LlamaCPP Extension
Overview
Nitro is an inference server on top of llama.cpp. It provides an OpenAI-compatible API, queue, & scaling.
LlamaCPP Extension
note
Nitro is the default AI engine downloaded with Jan. There is no additional setup needed.
In this guide, we'll walk you through the process of customizing your engine settings by configuring the nitro.json file
- Navigate to the
App Settings>Advanced>Open App Directory>~/jan/enginefolder.
- MacOS
- Windows
- Linux
cd ~/jan/engines
C:/Users/<your_user_name>/jan/engines
cd ~/jan/engines
- Modify the
nitro.jsonfile based on your needs. The default settings are shown below.
~/jan/engines/nitro.json
{
"ctx_len": 2048,
"ngl": 100,
"cpu_threads": 1,
"cont_batching": false,
"embedding": false
}
The table below describes the parameters in the nitro.json file.
| Parameter | Type | Description |
|---|---|---|
ctx_len | Integer | Typically set at 2048, ctx_len provides ample context for model operations like GPT-3.5. (Maximum: 4096, Minimum: 1) |
ngl | Integer | Defaulted at 100, ngl determines GPU layer usage. |
cpu_threads | Integer | Determines CPU inference threads, limited by hardware and OS. (Maximum determined by system) |
cont_batching | Integer | Controls continuous batching, enhancing throughput for LLM inference. |
embedding | Integer | Enables embedding utilization for tasks like document-enhanced chat in RAG-based applications. |
tip
- By default, the value of
nglis set to 100, which indicates that it will offload all. If you wish to offload only 50% of the GPU, you can setnglto 15 because most models on Mistral or Llama are around ~ 30 layers. - To utilize the embedding feature, include the JSON parameter
"embedding": true. It will enable Nitro to process inferences with embedding capabilities. Please refer to the Embedding in the Nitro documentation for a more detailed explanation. - To utilize the continuous batching feature for boosting throughput and minimizing latency in large language model (LLM) inference, include
cont_batching: true. For details, please refer to the Continuous Batching in the Nitro documentation.
Assistance and Support
If you have questions, please join our Discord community for support, updates, and discussions.