Qwen 3 LLaMA.cpp tips and tricks
To disable thinking, use (or you can set it in the system prompt):
>>> Write your prompt here /no_think
Use -ot ".ffn_.*_exps.=CPU"
to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
./llama.cpp/llama-cli \
--model unsloth/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-UD-Q2_K_XL.gguf \
--threads 32 \
--ctx-size 16384 \
--n-gpu-layers 99 \
-ot ".ffn_.*_exps.=CPU" \
--seed 3407 \
--prio 3 \
--temp 0.6 \
--min-p 0.0 \
--top-p 0.95 \
--top-k 20 \
-no-cnv \
--prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n"
Sampling Parameters:
Thinking mode
For thinking mode (enable_thinking=True), use Temperature=0.6, TopP=0.95, TopK=20, and MinP=0. DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.
Non Thinking mode For non-thinking mode (enable_thinking=False), we suggest using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0.
For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
Info from Reddit:
& "C:\llama-cpp\llama-server.exe" `
--host 127.0.0.1 --port 9045 `
--model "C:\llama-cpp\models\Qwen3-30B-A3B.Q8_0.gguf" `
--n-gpu-layers 99 --flash-attn --slots --metrics `
--ubatch-size 512 --batch-size 512 `
--presence-penalty 1.5 `
--cache-type-k q8_0 --cache-type-v q8_0 `
--no-context-shift --ctx-size 32768 --n-predict 32768 `
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 `
--repeat-penalty 1.1 --jinja --reasoning-format deepseek `
--threads 5 --threads-http 5 --cache-reuse 256 `
--override-tensor 'blk\.([0-9]*[02468])\.ffn_.*_exps\.=CPU' `
--no-mmap
--api-key 1234
-ot exps=CPU
/app/llama-server
--port 9045 --flash-attn --slots --metrics -ngl 99
--cache-type-k q8_0 --cache-type-v q8_0
--no-context-shift
--ctx-size 32768
--n-predict 32768
--temp 0.5 --top-k 20 --top-p 0.95 --min-p 0 --repeat-penalty 1.05 --presence-penalty 2.0
--jinja --reasoning-format deepseek
--model /models/Qwen3-235B-A22B.i1-IQ3_M.gguf
--threads 23
--threads-http 23
--cache-reuse 256
--main-gpu 0
--tensor-split 0.5,0.5
--override-tensor '([3-8]+).ffn_.*_exps.=CPU'
For 30B A3B I'm using: --overridetensors "blk\.([0-9]*[02468])\.ffn_.*_exps\.=CPU"
baseline with 46 layers offload: 6.86 t/s
\.\d*[0369]\.(ffn_up|ffn_gate)=CPU
99 layers 7.76 t/s
\.\d*[03689]\.(ffn_up|ffn_gate)=CPU
99 layers 6.96 t/s
\.\d*[0369]\.(ffn_up|ffn_down)=CPU
99 offload 8.02 t/s, 7.95 t/s
\.\d*[0-9]\.(ffn_up)=CPU
99 offload 6.4 t/s
\.(5[6-9]|6[0-3])\.(ffn_*)=CPU
55 offload 7.6 t/s
\.(5[3-9]|6[0-3])\.(ffn_*)=CPU
99 layers -> 10.4 t/s