Comparing local inference speeds across a few real setups people are running (3090 vs 5090 vs dual 6000)

推荐理由

这条记录涉及编程工具或代码能力更新，适合开发者评估工作流变化和可复用价值。

Pulled together token rates from a few different local rigs people have reported running lately, just to get a sense of what's realistic at each hardware tier(source discord group)

Qwen3.6 27B on a single 3090 (Q4/Q8 MTP, 128k ctx): ~50 tok/s inference, ~950 tok/s prompt processing

Qwen3.6 27B on a 5090 (Q6 MTP, tuned cache/batch settings): ~140 tok/s average

DeepSeek V4 Flash on dual RTX 6000 workstation cards (vLLM, full context + room for KV cache): ~80-100 tok/s

Interesting that 3090 setup is still very usable(granted its properlly cooled and cleaned, also you should apply some new paste on the gpu chip) for day to day coding work at a fraction of the cost of the higher end rigs, sounds like the difference is more about scope of task (smaller asks vs sic-the-whole-project-on-it) than raw unusability. The jump to dual 6000s buys you a much bigger model, not necessarily more speed.

But then again pricing is also so fucked that the older 3090 seems more reasonable

主题标签端侧推理

原始关键词#comparing#inference#running#across#people#setups

查看原文reddit.com

单一来源，暂无交叉验证