Recommended LLaMA.cpp parameters for different models
Following are my llama.cpp setting used to efficiently run different models.
Gemma3:
Gemma3 27bbash
./llama-cli --model bartowski/google_gemma-3-27b-it-qat-GGUF/gemma-3-27b-it-Q8_0.gguf --ctx-size 131072 --temp 1.0 --repeat-penalty 1.0 --min-p 0.01 --top-k 64 --top-p 0.95
note
Recommended is a Min_P of 0.00 (optional), but 0.01 works well, llama.cpp according to unsloth team.
References
Microsoft/Phi-4-reasoning
Microsoft Phi4 Reasoning Plusbash
./llama-cli --model bartowski/microsoft_Phi-4-reasoning-plus-GGUF/microsoft_Phi-4-reasoning-plus-Q8_0.gguf --ctx-size 32768 --temp 0.8 --top-k 50 --top-p 0.95 --reasoning-format deepseek
note
For more complex queries, set --predict to 32768 to allow for longer chain-of-thought (CoT).
Mistralai/Devstral-Small-2505
Devstral-Small-2505bash
./llama-cli --model bartowski/Devstral-Small-2505-GGUF/Devstral-Small-2505-Q8_0.gguf --ctx-size 131072 --temp 0.15
Qwen/Qwen3
Thinking
Qwen3 Thinkingbash
./llama-cli --model bartowski/Qwen_Qwen3-8B-GGUF/Qwen_Qwen3-8B-Q8_0.gguf --jinja --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --ctx-size 40960 --predict 32768
Non Thinking
Qwen3 Non Thinkingbash
./llama-cli --model bartowski/Qwen_Qwen3-8B-GGUF/Qwen_Qwen3-8B-Q8_0.gguf --jinja --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0 --ctx-size 40960 --predict 32768
Notes
- qwen team suggests to set the
--presence-penalty
parameter between 0 and 2 to reduce endless repetitions and adds that a higher value may occasionally result in language mixing and a slight decrease in model performance.