Llama.cpp Models

Adding and managing Llama.cpp Models in Msty Studio

llama.cpp is an open source C/C++ library that performs efficient inference on large language models with minimal setup and optimized performance across various hardware.

This provides another option for a local inference engine in Msty Studio in addition to Ollama and MLX.

Llama.cpp can be installed during new user onboarding or from Model Hub > Llama.cpp.

Adding Llama.cpp Models

In Model Hub, you can view featured Llama.cpp models and search for models on the Llama.cpp Hugging Face Community tab.

Click on the download icon to download and install a model locally.

Managing Llama.cpp Service

You can manage and configure the Llama.cpp service in Settings > Llama.cpp Service. Here, you can start or stop the Llama.cpp service, which is required to run Llama.cpp models locally.

You can also view the service health, stop and start the service, view the endpoint, version, logs, and more.

Llama.cpp Model Parameters

LLama.cpp provides a few features that can give them an edge when used during conversations.

Next to the model selection, when you have a Llama.cpp model selected, click on the Model Parameters icon.

Here, you will see options specific to Llama.cpp, including:

Num ctx default to model max

This is the context window where you can select the option to use the max amount of context for a mode.

The specific max value may not be known to Llama.cpp or Msty Studio; however, this option will set it to the max value.

Setting to the max will benefit conversations as more context will be held onto, which will result in fewer hallucinations and improved conversation continuity.

This does come at a cost in the form of your device's system performances being expended, which may result in decreased system performance.

Truncation Strategy

As conversation get longer, you can set a truncation strategy so that you can continue conversations without getting a context limit reached message.

Options:

  • Truncate Middle this truncates the middle portions of the conversation. This setting is ideal when the first parts of the conversation are important to maintain the overall context. Use this if the first interaction set the stage for the overall conversation.
  • Truncate Old - this truncates the first messages of the conversation. This setting is idea for continued conversations where the first messages are not critical to the continuity of the conversation. Use this if the latest messages are the most critical for continuity.
  • None - this does not truncate any messages in an ongoing conversation. Use this setting if full historical context is needed. However, this does run the risk of context limits being exceeded.