Getting close to 100K context on 32GB VRAM with Qwen3.6-27 at Q8
这条记录涉及编程工具或代码能力更新,适合开发者评估工作流变化和可复用价值。
Not really a tutorial, but more of sharing my attempts at getting higher contexts on Q8 of Qwen3.6-27 with 32GB VRAM.
Disclaimer: Not in-depth research. Crowd wisdom suggests that Qwen is more tolerant of model quantization, but my experience suggests otherwise. I have nothing quantitative to back this up, only my personal experience in using it for vibe coding a couple of personal projects (which aren't very big either, but have been working on them for a few weeks).
Context: I am able to run Q8 at ~60K context easily and found that it works better than Q6 or Q5 (purely subjective experience). But I can easily get 128K context with Q5 with unquantized kv, so I wanted to see how much I could push with Q8.
System: 5090 with 64GB system RAM. Remote server running headless Ubuntu.
After a few trial and error approaches, I find the following are working. Some notes:
- VRAM is right at the edge, and maybe in long coding contexts, you may need to drop context for a bit more space.
- The benchmark I'm using is just for token inference speed. Nothing more.
- Options -b and -ub help shave like a 100MB of VRAM.
Option 1: 95K context, KV: Q8_0 and Q8_0, VRAM when starting: 230MB, VRAM after bench: 90MB
bash build/bin/llama-server \ -m ~/myp/models/bartowski_Qwen_Qwen3.6-27B-Q8_0.gguf \ --temp 0.6 \ --top_p 0.95 \ --top_k 20 \ --min_p 0.0 \ --repeat-penalty 1.0 \ --presence-penalty 0.0 \ -c 95000 \ -t 16 \ -ngl 99 \ --flash-attn on \ --host 0.0.0.0 --port 8080 \ --no-mmproj \ --spec-type draft-mtp \ --spec-draft-n-max 4 \ --parallel 1 \ -kvo \ -ctk q8_0 \ -ctv q8_0 \ -b 1024 \ -ub 256
``` python3 mtp_bench.py code_python pred= 192 draft= 183 acc= 145 rate=0.792 tok/s=141.6 code_cpp pred= 192 draft= 214 acc= 137 rate=0.640 tok/s=121.9 explain_concept pred= 192 draft= 225 acc= 134 rate=0.596 tok/s=115.6 summarize pred= 192 draft= 176 acc= 146 rate=0.830 tok/s=146.0 qa_factual pred= 192 draft= 198 acc= 141 rate=0.712 tok/s=131.4 translation pred= 192 draft= 221 acc= 135 rate=0.611 tok/s=117.3 creative_short pred= 192 draft= 256 acc= 126 rate=0.492 tok/s=101.5 stepwise_math pred= 192 draft= 192 acc= 142 rate=0.740 tok/s=134.3 long_code_review pred= 192 draft= 213 acc= 137 rate=0.643 tok/s=120.3
Aggregate: { "n_requests": 9, "total_predicted": 1728, "total_draft": 1878, "total_draft_accepted": 1243, "aggregate_accept_rate": 0.6619, "wall_s_total": 15.41 } ```
Option 2: 105K context, KV: Q8_0 and Q5_1, VRAM when starting: 320MB, VRAM after bench: 180MB
build/bin/llama-server \ -m ~/myp/models/bartowski_Qwen_Qwen3.6-27B-Q8_0.gguf \ --temp 0.6 \ --top_p 0.95 \ --top_k 20 \ --min_p 0.0 \ --repeat-penalty 1.0 \ --presence-penalty 0.0 \ -c 105000 \ -t 16 \ -ngl 99 \ --flash-attn on \ --host 0.0.0.0 --port 8080 \ --no-mmproj \ --spec-type draft-mtp \ --spec-draft-n-max 4 \ --parallel 1 \ -kvo \ -ctk q8_0 \ -ctv q5_1 \ -b 1024 \ -ub 256
``` python3 mtp_bench.py code_python pred= 192 draft= 181 acc= 145 rate=0.801 tok/s=142.0 code_cpp pred= 192 draft= 220 acc= 136 rate=0.618 tok/s=119.8 explain_concept pred= 192 draft= 246 acc= 128 rate=0.520 tok/s=105.1 summarize pred= 192 draft= 176 acc= 146 rate=0.830 tok/s=146.0 qa_factual pred= 192 draft= 202 acc= 140 rate=0.693 tok/s=128.8 translation pred= 192 draft= 245 acc= 129 rate=0.526 tok/s=106.0 creative_short pred= 192 draft= 248 acc= 128 rate=0.516 tok/s=104.7 stepwise_math pred= 192 draft= 197 acc= 141 rate=0.716 tok/s=131.2 long_code_review pred= 192 draft= 220 acc= 135 rate=0.614 tok/s=116.1
Aggregate: { "n_requests": 9, "total_predicted": 1728, "total_draft": 1935, "total_draft_accepted": 1228, "aggregate_accept_rate": 0.6346, "wall_s_total": 15.84 } ```
Option 3: 115K context, KV: Q8_0 and Q4_0, VRAM when starting: 290MB, VRAM after bench: 150MB.
build/bin/llama-server \ -m ~/myp/models/bartowski_Qwen_Qwen3.6-27B-Q8_0.gguf \ --temp 0.6 \ --top_p 0.95 \ --top_k 20 \ --min_p 0.0 \ --repeat-penalty 1.0 \ --presence-penalty 0.0 \ -c 115000 \ -t 16 \ -ngl 99 \ --flash-attn on \ --host 0.0.0.0 --port 8080 \ --no-mmproj \ --spec-type draft-mtp \ --spec-draft-n-max 4 \ --parallel 1 \ -kvo \ -ctk q8_0 \ -ctv q4_0 \ -b 1024 \ -ub 256
``` python3 mtp_bench.py code_python pred= 192 draft= 186 acc= 144 rate=0.774 tok/s=138.7 code_cpp pred= 192 draft= 183 acc= 145 rate=0.792 tok/s=142.6 explain_concept pred= 192 draft= 215 acc= 136 rate=0.633 tok/s=119.7 summarize pred= 192 draft= 175 acc= 146 rate=0.834 tok/s=145.9 qa_factual pred= 192 draft= 196 acc= 141 rate=0.719 tok/s=131.6 translation pred= 192 draft= 230 acc= 133 rate=0.578 tok/s=113.1 creative_short pred= 192 draft= 229 acc= 133 rate=0.581 tok/s=113.1 stepwise_math pred= 192 draft= 181 acc= 145 rate=0.801 tok/s=142.3 long_code_review pred= 192 draft= 213 acc= 137 rate=0.643 tok/s=120.2
Aggregate: { "n_requests": 9, "total_predicted": 1728, "total_draft": 1808, "total_draft_accepted": 1260, "aggregate_accept_rate": 0.6969, "wall_s_total": 14.93 } ```