Qwen3.6 27B on a 5090, 6.4k sample tok/s distribution after tuning MTP/cache settings

推荐理由

这条记录涉及编程工具或代码能力更新，适合开发者评估工作流变化和可复用价值。

Spent a while tuning llama.cpp for Qwen3.6 27B on a 9800X3D / 64GB / 5090 box and wanted to share the real distribution instead of just a headline number, since averages hide a lot.

Ran with q8 KV cache, 192k context, MTP draft=10, spec-draft-p-min=0.5, batch/ubatch 512. Logged 6,454 samples across a mixed agentic coding + debugging + doc session over 20 hour ish. Peak bucket sits at 120-130 tok/s, mean 140.7, median 134.9, with a long tail up to 233.

Worth noting the hybrid attention/SWA cache handling in llama.cpp still isn't perfect for this model if you see prompt reprocessing warnings in your logs that's why. Happy to share launch flags if anyone wants to compare setups.

原始关键词#distribution#settings#sample#tuning#cache#qwen3

查看原文reddit.com

单一来源，暂无交叉验证