返回
RCreddit.com
18
·开发者社区 · RSS

I benchmarked 13 models at 65K-128K context to find out what actually matters for agentic workloads

查看原文
推荐理由

这条记录涉及编程工具或代码能力更新,适合开发者评估工作流变化和可复用价值。

I benchmarked 13 models at 65K-128K context to find out what actually matters for agentic workloads — prefill dominates everything, and KV head count beats parameter count

I've been running local LLMs for agentic workflows (tool use, coding agents, RAG) and kept seeing people obsess over tg128 (token generation speed) as the headline performance metric. So I ran a structured long-context benchmark to figure out what actually matters when your context window is full. The answer surprised me.

Setup

- GPU: RX 7900 XT 20GB (Vulkan backend, RADV/Mesa)

- Backend: llama.cpp / llama-bench (build 9860)

- Flags: -ngl 99

(GTT spill), -fa on

, -ub 2048 -b 16384

, ASPM=performance, bare TTY to free VRAM

- 13 models: 5 dense, 6 MoE, 1 Mamba2 hybrid, 1 MLA MoE — ranging from 5GB to 18GB

- 3 KV cache tiers: Q8_0 K / Q4_0 V (aggressive), Q8_0 K / Q8_0 V (symmetric), F16 (baseline)

- Context sizes: 512, 4K, 16K, 65K, 131K — both pure prefill (pp) and prompt+gen (pg)

- Full run took ~21 hours across two sessions

Full prefill speed results (Q8_0 K / Q8_0 V KV cache, tokens/sec)

If you just want the raw numbers, here's every model tested. pp = pure prompt processing (prefill), tg128 = token generation (decode). Sorted by pp131K.

Model Size Type pp512 pp4K pp16K pp65K pp131K tg128 Trinity-Mini 16G MoE 3B/26B 2639 2924 2370 1419 923 150 Granite-4.0-H-Small 17G Mamba2+MoE 1115 1271 1220 1043 875 71 Ornith-9B / Qwen3.5-9B 6G Dense 2103 2220 1943 1274 873 92 Qwen3.6-35B-A3B 18G MoE 3B/35B 2184 2736 2227 1268 802 110 Gemma-4-26B-A4B 14G MoE 4B/26B 2523 2798 2076 1024 600 119 North-Mini-Code 15G MoE 3B/30B 2155 2187 1568 900 579 134 Gemma-4-12B 7G Dense 1492 1498 1145 595 350 66 Qwen3.6-27B 16G Dense 693 681 602 406 285 32 Granite-4.1-8B 5G Dense 1965 1807 1124 442 244 93 Ministral-3-14B 8G Dense 1419 1325 916 404 232 67 Apriel-1.6-15B 9G Dense 1332 1208 812 347 197 66 Devstral-24B 15G Dense 829 796 628 313 --- 42 GLM-4.7-Flash 16G MoE (MLA) 1822 1054 358 --- --- --- A few things to note: Devstral-24B couldn't complete the 131K test (8 KV heads × 128 dim = 160 KB/token — KV cache alone is ~21GB at 131K). GLM-4.7-Flash crashed above 16K (MLA issue, see Finding 5). Ornith-9B is architecturally identical to Qwen3.5-9B.

Finding 1: At 65K+ context, prefill is 94–99% of wall-clock time. tg128 is nearly irrelevant for short agentic outputs.

Here's the wall-clock breakdown for a real agentic query — 65K context in, 300 tokens out (typical tool-use response). Sorted by total time:

Model Type Prefill Decode Total Prefill % Trinity-Mini (MoE 3B/26B) MoE 46.2s 2.0s 48.2s 96% Qwen3.6-35B-A3B (MoE) MoE 51.7s 2.7s 54.4s 95% Ornith-9B / Qwen3.5-9B Dense 51.4s 3.3s 54.7s 94% Gemma-4-26B-A4B (MoE) MoE 64.0s 2.5s 66.5s 96% Granite-4.0-H-Small (Mamba2) Mamba2 62.8s 4.2s 67.1s 94% North-Mini-Code (MoE) MoE 72.8s 2.2s 75.0s 97% Gemma-4-12B Dense 110.2s 4.5s 114.7s 96% Granite-4.1-8B Dense 148.4s 3.2s 151.6s 98% Qwen3.6-27B Dense 161.4s 9.3s 170.7s 95% Ministral-3-14B Dense 162.0s 4.5s 166.5s 97% Apriel-1.6-15B Dense 188.9s 4.6s 193.5s 98% Devstral-24B Dense 209.5s 7.2s 216.6s 97% Decode is 1–5% of the time you actually wait. If your agent makes a short tool call or writes a brief response, the only thing that matters is how fast you can process the context window.

This means benchmark reports that lead with tg128 are misleading for agentic use cases. pp65K / pp131K is the metric that matters. The pg(prompt, gen)

blended metric is better but still obscures the split — a model with fast prefill + catastrophically slow decode can look mediocre on pg despite being excellent for short outputs.

Finding 2: KV head count is the dominant architectural factor for long-context prefill — not parameter count, not MoE vs dense

Prefill speed retention (% of pp4K speed) at increasing context, all models:

Model Size KV Heads pp4K 16K 65K 131K Type Granite-4.0-H-Small 17G Mamba2* 1271 96% 82% 69% Mamba2+MoE Qwen3.6-27B 16G 4×256 681 88% 60% 42% Dense Ornith-9B / Qwen3.5-9B 6G 4×128 2220 87% 57% 39% Dense Trinity-Mini 16G 4×128 2924 81% 49% 32% MoE Qwen3.6-35B-A3B 18G 4×128 2736 81% 46% 29% MoE Gemma-4-12B 7G 8×128 1498 76% 40% 23% Dense Gemma-4-26B-A4B 14G 4×256 2798 74% 37% 21% MoE North-Mini-Code 15G 4×128 2187 72% 41% 26% MoE Apriel-1.6-15B 9G 8×128 1208 67% 29% 16% Dense Ministral-3-14B 8G 8×128 1325 69% 31% 18% Dense Granite-4.1-8B 5G 8×128 1807 62% 24% 14% Dense Devstral-24B 15G 8×128 796 79% 39% --- Dense GLM-4.7-Flash 16G MLA (1×576) 1054 34% --- --- MoE (MLA) \ Granite-H-Small has 4 attention layers + 36 Mamba2 layers (recurrent state, no KV cache)*

主题标签模型发布
原始关键词#benchmarked#workloads#actually#agentic#context#matters
查看原文reddit.com
单一来源,暂无交叉验证
I benchmarked 13 models at 65K-128K context to find out what actually matters for agentic workloads · BuzzRadr