For RAG specifically, prefill speed matters more than decode and why Strix Halo struggles for interactive use

Seeing a lot of "what hardware for local RAG" threads lately, and the framing that keeps getting missed is: decode tok/s is not the bottleneck for RAG.

RAG queries stuff thousands of tokens of retrieved context into every prompt. On unified memory boxes like Strix Halo, prefill throughput lags way behind a discrete GPU even though decode speed on MoE models is perfectly fine (25-40 tok/s). A single 24GB discrete card chews through the same context in a few seconds; unified memory setups can leave you staring at a 20-60 second pause before the first token comes back.

If your work is more batch style you're more than fine. but if its constantly tweqking you need something else

Practical takeaway if you're budget constrained: pick a board with a free PCIe slot so you can drop in a discrete card later just to offload prefill, rather than assuming unified memory alone will feel good for interactive RAG.

原始关键词#specifically#interactive#struggles#matters#prefill#decode

查看原文reddit.com

单一来源，暂无交叉验证