Concurrency plus nvfp4 on Blackwell

~2000 tps in aggregate performing bulk captioning on images. Above is parsed from vllm log while a client runs 30 concurrent streams, each concurrent stream has 1 request with an image and prompt, then a 2nd request on the same stream (so 1st Q:A would be cached). Typical log line:

Engine 000: Avg prompt throughput: 1301.0 tokens/s, Avg generation throughput: 1924.0 tokens/s, Running: 30 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.8%, Prefix cache hit rate: 0.0%, MM cache hit rate: 50.1%

#!/usr/bin/env bash set -euo pipefail source /etc/vllm/env mkdir -p /mnt/sdb/vllm/logs exec > >(tee -a "/mnt/sdb/vllm/logs/$(basename "$0" .sh)_$(date +%Y%m%d_%H%M%S).log") 2>&1 export CUDA_VISIBLE_DEVICES=1 # use the Blackwell GPU vllm serve \ nvidia/Qwen3.6-35B-A3B-NVFP4 \ --served-model-name qwen36_35b_a3b \ --max-num-seqs 30 \ --max-model-len 36768 \ --gpu-memory-utilization 0.90 \ --enable-prefix-caching \ --limit-mm-per-prompt '{"video":0,"image":1}' \ --mm-processor-kwargs '{"min_pixels": 3136, "max_pixels": 500000}' \ --trust-remote-code \ --host 0.0.0.0 \ --port "${VLLM_PORT}" \ --attention-backend FLASHINFER

This is running on a RTX Pro 6000 Blackwell, but I don't think I'm actually using nearly all the VRAM yet. A 5090 should be able to get close if your individual chats are not that long as to fit into VRAM. Maybe kv cache will evict and impact perf.

Here's another graph comparing to some other dense models as well using lmarena-ai/VisionArena-Chat as a test set:

The quanttrio is Qwen 3.5, the rest are all Qwen 3.5. 27B isdense, 35B is moe. Unsloth is ~26GB and nvidia is ~22GB, I believe because unsloth left more unquantized layers. nvidia 35b is 23.4GB.

I was actually a bit surprised that with concurrency the MOE was so far ahead, but running the Monte Carlo, about 53% (union of selected) experts are expected to be chosen per forward execution at c=24, or still only ~56% at q=0.95. Or ~61% at c=30. My initial gut was by 24 (tested in above graph) that a vast majority of experts would be chosen, thus making 35B MOE act more like a 35B dense model, but it is barely over half.

原始关键词#concurrency#blackwell#nvfp4#plus#on

查看原文reddit.com

单一来源，暂无交叉验证