Skip to main content

Qwen 3 LLaMA.cpp tips and tricks

To disable thinking, use (or you can set it in the system prompt):
>>> Write your prompt here /no_think

Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768

Qwen3-235B-A22Bbash
./llama.cpp/llama-cli \
--model unsloth/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-UD-Q2_K_XL.gguf \
--threads 32 \
--ctx-size 16384 \
--n-gpu-layers 99 \
-ot ".ffn_.*_exps.=CPU" \
--seed 3407 \
--prio 3 \
--temp 0.6 \
--min-p 0.0 \
--top-p 0.95 \
--top-k 20 \
-no-cnv \
--prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n"

Sampling Parameters:

info

Thinking mode
For thinking mode (enable_thinking=True), use Temperature=0.6, TopP=0.95, TopK=20, and MinP=0. DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.

info

Non Thinking mode For non-thinking mode (enable_thinking=False), we suggest using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0.

info

For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.

Info from Reddit:

Qwen3-30B-A3Bbash
 & "C:\llama-cpp\llama-server.exe" `
--host 127.0.0.1 --port 9045 `
--model "C:\llama-cpp\models\Qwen3-30B-A3B.Q8_0.gguf" `
--n-gpu-layers 99 --flash-attn --slots --metrics `
--ubatch-size 512 --batch-size 512 `
--presence-penalty 1.5 `
--cache-type-k q8_0 --cache-type-v q8_0 `
--no-context-shift --ctx-size 32768 --n-predict 32768 `
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 `
--repeat-penalty 1.1 --jinja --reasoning-format deepseek `
--threads 5 --threads-http 5 --cache-reuse 256 `
--override-tensor 'blk\.([0-9]*[02468])\.ffn_.*_exps\.=CPU' `
--no-mmap
--api-key 1234

-ot exps=CPU

Qwen3-235B-A22Bbash
/app/llama-server
--port 9045 --flash-attn --slots --metrics -ngl 99
--cache-type-k q8_0 --cache-type-v q8_0
--no-context-shift
--ctx-size 32768
--n-predict 32768
--temp 0.5 --top-k 20 --top-p 0.95 --min-p 0 --repeat-penalty 1.05 --presence-penalty 2.0
--jinja --reasoning-format deepseek
--model /models/Qwen3-235B-A22B.i1-IQ3_M.gguf
--threads 23
--threads-http 23
--cache-reuse 256
--main-gpu 0
--tensor-split 0.5,0.5
--override-tensor '([3-8]+).ffn_.*_exps.=CPU'

For 30B A3B I'm using: --overridetensors "blk\.([0-9]*[02468])\.ffn_.*_exps\.=CPU"

baseline with 46 layers offload: 6.86 t/s

\.\d*[0369]\.(ffn_up|ffn_gate)=CPU 99 layers 7.76 t/s

\.\d*[03689]\.(ffn_up|ffn_gate)=CPU 99 layers 6.96 t/s

\.\d*[0369]\.(ffn_up|ffn_down)=CPU 99 offload 8.02 t/s, 7.95 t/s

\.\d*[0-9]\.(ffn_up)=CPU 99 offload 6.4 t/s

\.(5[6-9]|6[0-3])\.(ffn_*)=CPU 55 offload 7.6 t/s

\.(5[3-9]|6[0-3])\.(ffn_*)=CPU 99 layers -> 10.4 t/s