Follow-up: GLM-5.2 NVFP4 on four DGX Sparks — the MTP mystery is solved, and it's now ~24 tok/s at 128K context

This is a follow-up to my earlier post about running GLM-5.2 NVFP4 on 4x DGX Spark at 128K context. Short version of that post: 128K worked at ~15 tok/s with MTP1, and there was a painful tradeoff where you could have 128K context OR ~23 tok/s (DCP1 at 32K), but not both. I also flagged that MTP2/MTP3 acceptance collapse at DCP4 "really looks buggy" but that 30 hours of digging hadn't cracked it.

It was buggy. It's cracked. Tradeoff gone. Here's how it shook out:

TL;DR

old post (DCP4/128K/MTP1) now (DCP4/128K/MTP3) now (DCP4/128K/MTP4) decode, short codegen (hot) 14.5-15.2 tok/s 22-23 tok/s MTP acceptance per position 0.74 (MTP1 only) 0.90 / 0.79 / 0.67 context 131,072 131,072 hardware 4x GB10 Spark + MikroTik RoCE unchanged Edit: prefill still ~475 tps; bs=3 decode =~48 tps.

Yes, MTP4 — the recursively-reused single MTP layer is still conditionally accepting at ~0.84 by position 4, which mirrors what I see on my RTX 6000 Pro box where MTP4 is also the peak. One config gotcha: MAX_CUDAGRAPH_CAPTURE_SIZE

needs headroom above num_speculative_tokens + 1

(the draft derives a smaller cap than the target; exactly N+1 fails startup with "No valid cudagraph sizes"). I run 10 for MTP4. I've seen occasional runs sag when host paging churns — MTP3 is my conservative default, MTP4 the peak config.

Same machines, same switch, same checkpoint, same 1.81 GB/rank KV budget. The entire gain is one missing line of configuration plumbing in vLLM, plus rebasing onto a newer upstream branch. The DCP1/32K compromise config is now pointless: DCP4 at full context beats it outright.

What the bug actually was

In my original post I wrote that acceptance looked like 0.9, 0.75^4, 0.6^4

and guessed at some rank-intersection effect. The exponent intuition was pointing at something real (the damage does scale with DCP world size), but the mechanism was better-hidden than that — and the reason it survived 30+ hours of ablations is genuinely evil:

SpeculativeConfig.create_draft_parallel_config()

builds the draft model's parallel config by copying fields from the target config — and decode_context_parallel_size

is not one of the fields it copies. It silently defaults to 1. On the code path my stack uses, that value is consumed verbatim.

So under TP4/DCP4, the MTP draft layer's KV cache, metadata, and sparse-indexer state were all DCP-sharded (the writer side runs under the target config), while the draft's attention thought it wasn't under DCP at all: no query all-gather, no LSE merge, and the global top-k indices were consumed as if the local quarter-cache were the whole cache. Tensor dumps showed draft forwards where three of four ranks selected nothing and emitted literal all-zero attention for their 48 of 64 heads.

Here's the evil part: the very next op after attention is o_proj, which is row-parallel — its TP all-reduce sums the four inconsistent per-rank results into one hidden state that is bit-identical on every rank. Every cross-rank divergence check I ran in the original investigation came back clean, because the corruption is laundered into consensus one op after it happens. And because the draft gets the target's hidden state as input, single-step MTP1 mostly survives on that signal (~0.75 acceptance), while the recursive steps 2-3 compound the garbage and die. That's the collapse curve from my first post.

It also explains why the bug shrugged off every knob: KV interleave size, ag_rs

vs a2a

DCP comm backend, global vs rank-local top-k, CUDA graphs vs eager — none of them touch how the draft's parallel config is constructed. I tested all of them (identical acceptance curves to two decimal places) before giving up on config space and building a tensor tap instead.

How it got found

Method notes, since I know some people like gory details:

- Rebased the stack onto a much newer upstream branch (see below). Capacity reproduced exactly; MTP3 still collapsed. That killed "it's fixed upstream" and "it's my old fork."

- Burned four more boots falsifying the remaining config hypotheses (interleave/comm-backend/top-k-mode/eager). All identical. At that point the bug had to be in the compute, not the config surface.

- Wrote a small env-gated tap into the MLA decode path that dumps, per draft-layer forward: the post-allgather query, the top-k indices actually consumed, per-rank partial output + LSE, the merged output, the metadata, and the raw fp8 KV pages.

原始关键词#context#mystery#follow#solved#sparks#nvfp4

查看原文reddit.com

单一来源，暂无交叉验证