Models and performance

LlamaBoss bundles CPU and NVIDIA CUDA llama.cpp runtimes, scans local GGUF files, and can also list models from configured OpenAI-compatible endpoints.

Applies to LlamaBoss v0.1.10 · Updated 2026-07-24

Local GGUF
Model folders
Vision
Context & KV
Performance
Remote models

Local GGUF models

Use Settings → Download models for the curated catalog, or place compatible .gguf files in the active models folder. The model picker displays models found by the scanner.

The default folder is:

%LOCALAPPDATA%\LlamaBoss\models

Bring your own model folder

In Settings → Model → Location, select Change to point LlamaBoss at another folder. The downloader, scanner, model manager, and picker then use that folder. Select Reset to return to the default location.

A custom folder is useful when models already live on another SSD. LlamaBoss stores the folder choice; it does not copy the model files.

Vision models and companion projectors

A vision-capable language model usually needs a compatible multimodal projector file, commonly named with mmproj. Keep the matching model and projector together in the model's folder. Curated vision downloads handle the intended pairing automatically.

Do not keep several unrelated projector files beside one model. When pairing is ambiguous, LlamaBoss avoids guessing rather than silently attaching the wrong projector.

Context length and 8-bit KV cache

Settings exposes context choices from 4K through 256K tokens. A larger context lets the model retain more conversation and tool history, but requires more memory and may take longer to process.

8-bit KV cache is recommended. It roughly halves KV-cache memory use, allowing about twice as much context to fit in VRAM. Changing context length or KV-cache mode reloads the local model.

The model itself must support the selected context. Setting 256K in the UI cannot give a model a reliable 256K context if its architecture or training does not support it.

Choosing a model that feels responsive

Prefer a model that fits fully in GPU VRAM when possible.
If generation becomes extremely slow, the model may be spilling into system RAM.
Close games, image generators, and other GPU-heavy applications before loading a large model.
A smaller quantization or smaller parameter count often feels much faster with only a modest capability tradeoff.
Long contexts consume memory even before the answer becomes long.

Remote models

Remote endpoint models appear in the same model picker, grouped by endpoint. They use the endpoint's API rather than the local llama.cpp server, so prompts and attachments sent to that model leave the computer. See Remote endpoints.

← Interface & conversations Files & documents →