Skip to main content

Recommended LLaMA.cpp parameters for different models

Following are my llama.cpp setting used to efficiently run different models.

Gemma3:

Gemma3 27bbash
./llama-cli --model bartowski/google_gemma-3-27b-it-qat-GGUF/gemma-3-27b-it-Q8_0.gguf --ctx-size 131072 --temp 1.0 --repeat-penalty 1.0 --min-p 0.01 --top-k 64 --top-p 0.95
note

Recommended is a Min_P of 0.00 (optional), but 0.01 works well, llama.cpp according to unsloth team.

Microsoft/Phi-4-reasoning

Microsoft Phi4 Reasoning Plusbash
./llama-cli --model bartowski/microsoft_Phi-4-reasoning-plus-GGUF/microsoft_Phi-4-reasoning-plus-Q8_0.gguf --ctx-size 32768 --temp 0.8 --top-k 50 --top-p 0.95 --reasoning-format deepseek  
note

For more complex queries, set --predict to 32768 to allow for longer chain-of-thought (CoT).

 

Mistralai/Devstral-Small-2505

Devstral-Small-2505bash
./llama-cli --model bartowski/Devstral-Small-2505-GGUF/Devstral-Small-2505-Q8_0.gguf --ctx-size 131072 --temp 0.15  

 

Qwen/Qwen3

Thinking

Qwen3 Thinkingbash
./llama-cli --model bartowski/Qwen_Qwen3-8B-GGUF/Qwen_Qwen3-8B-Q8_0.gguf --jinja --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --ctx-size 40960 --predict 32768  

Non Thinking

Qwen3 Non Thinkingbash
./llama-cli --model bartowski/Qwen_Qwen3-8B-GGUF/Qwen_Qwen3-8B-Q8_0.gguf --jinja --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0 --ctx-size 40960 --predict 32768
Notes
  • qwen team suggests to set the --presence-penalty parameter between 0 and 2 to reduce endless repetitions and adds that a higher value may occasionally result in language mixing and a slight decrease in model performance.