Qwen 3.6 27B Speculative Decoding Bench: Pushing ~100 TPS on a single RTX 3090

推荐理由

这条记录涉及生成能力或端侧推理进展，适合跟踪模型效率、部署门槛和应用机会。

First of all, a huge thank you to the r/LocalLLaMA community and the 3090 club. This benchmark started from your shared recipes...

These are my findings on my hardware (Xeon E5-2666v3, 64GB RAM, single RTX 3090 24GB) comparing 5 engines (3 llama.cpp forks + mainline + Lucebox) across two quantizations of the same model.

I've used the bench script from https://github.com/noonghunna/club-3090/tree/master and two simple scripts using en8wiki for building long prompts.

Summary Table

Sorted by fork → speculative type. Key metrics: decode_TPS (code & narrative), TTFT, VRAM usage, and context consistency (generation speed degradation when moving from 72k to 128k filled context).

Fork / Engine Speculative Type Model / Quant Code TPS Narr. TPS TTFT VRAM (MiB) Gen 72k Gen 128k Deg. (72k→128k) ik_llama (ubergarm config) MTP n_max=4

Qwen3.6-27B-IQ4_KS 89.2 63.9 361ms 22304 34.6 23.5 −32.1% ik_llama + ngram ngram+MTP Qwen3.6-27B-IQ4_KS 87.8 58.6 341ms 20508 32.1 24.1 −24.9% ik_llama (Standard config) MTP n_max=2

Qwen3.6-27B-IQ4_KS 73.1 61.7 357ms 20208 33.8 25.4 −24.8% mainline llama.cpp MTP n_max=1

Qwen3.6-27B-Q4_K_M 64.7 52.5 288ms 21354 33.4 31.2 −6.6% Spiritbuun MTP Qwen3.6-27B-Q4_K_M 59.7 45.7 294ms 22066 34.8 31.5 −9.5% beellama DFlash (Draft GGUF) Qwen3.6-27B-Q4_K_M 96.8 45.6 504ms 20814 22.9* 27.1 −41.3% Spiritbuun DFlash Qwen3.6-27B-Q4_K_M 66.9 30.4 300ms 23356 — — — LUCEBOX** DFlash (TQ3 KV) Qwen3.6-27B-Q4_K_M 32.6 32.5 448ms 20680 27.0 — — * beellama: The 72k run (22.9 DP) was an outlier due to the experimental KV cache configuration ( q5_0/q4_1

), stabilizing at 27.1 DP upon reaching 128k.

Degradation** calculated relative to baseline performance in short context.

ik_llama — The fork that does "everything"

Fork of llama.cpp with native MTP support, merge-qkv, recurrent checkpoints, and multi-backend speculative decoding. Tested on IQ4_KS quant (by ubergarm).

ik_llama + MTP+ngram (ngram-mod + mtp)

Great code generation. Combines ngram drafts ( n_max=4

, size 16) with MTP ( n_max=3

). Code hits 87.8 decode tokens/sec — a massive jump over mainline.

- VRAM: 20508 MiB (82% GPU utilization)

- Context degradation: −25% (32.1→24.1 gen_tps). Notable drop when context fills.

ik_llama + MTP (ubergarm tuned config)

Best narrative speed: 63.9 TPS, highest in the entire benchmark. Code sits at 89.2 TPS.

- Extra config: -muge --merge-qkv -mtprot iq4_ks -cram 32768 --slot-save-path /root/slot --ctx-checkpoints 32

- VRAM: 22304 MiB. Higher VRAM due to slot checkpoints.

- Context degradation: −32% (34.6→23.5). Worst drop across all setups.

主题标签QwenNVIDIA端侧推理

原始关键词#speculative#decoding#pushing#single#bench#3090

查看原文reddit.com

单一来源，暂无交叉验证