A "Can You Run It" calculator for local LLMs (factors in Quantization & KV Cache)

推荐理由

这条记录涉及生成能力或端侧推理进展，适合跟踪模型效率、部署门槛和应用机会。

If you run local models, you know the headache of trying to figure out if a new model will instantly OOM your setup especially once you start messing with different quantizations and massive context windows.

Can My PC Run It? Local AI Checker on TheAITechPulse, and it's a remarkably handy utility for taking the guesswork out of local hosting.

Here is what you can configure:

- Hardware: It covers a massive range of current hardware, from older RTX 3060s up to the new RTX 5090s, plus Radeon cards and Apple Silicon (up to the M4/M5 Max and M2 Ultra).

- Models: It is up to date with recent drops, including DeepSeek V3, Llama 3.3 70B, Llama 4 Scout, and Qwen 2.5.

- Variables: You can select your Quantization (from Q4 up to uncompressed FP16) and your desired Context Window (from 8k up to 128k tokens).

The Substance (How it calculates): Instead of arbitrary recommendations, it actually shows its work.

- VRAM: Calculated using Parameter count × quantization bytes + KV cache (based on context size) + ~0.6 GB overhead

- Speed: Estimated by taking (Memory bandwidth × ~70% efficiency) ÷ active model weight size

. (Though obviously, this will vary slightly depending on if you use Ollama, llama.cpp, or vLLM).

If you’re trying to figure out if you can squeeze a 70B model onto your rig at Q4, or if you want to see exactly how much VRAM a 128k context window is going to eat up on DeepSeek, this saves a lot of napkin math.

Link to the tool here: Can My PC Run It?

Hope this helps some of you avoid a few out-of-memory errors!

主题标签端侧推理

原始关键词#quantization#calculator#factors#cache#local#llms

查看原文reddit.com

单一来源，暂无交叉验证