How I'm using local models from real-world coding

推荐理由

这条记录涉及编程工具或代码能力更新，适合开发者评估工作流变化和可复用价值。

Just want to share since after many attempts over the past year, I finally have a setup I kinda like and does useful work for me.

I only have 32GB of RAM and a 4070 8GB (laptop), just very ordinary hardware. I found that Qwen3.6-35B-A3B runs reliably at about 15 tokens per second*, which is slow but enough to do useful work while I do other things.

I treat this local model as a "small coding agent", only capable of doing very well-scoped tasks.

For deeper code review, task creation and organization, I currently use GLM 5.2 on openrouter. It costs under 1$ to have this much smarter model comprehensively look at my codebase and generate a detailed task plan for Qwen3.6 to execute on. This means the setup is not 100% local. It's about a 90%-10% split local-cloud, but it's dirt cheap to run.

Concretely, I run pi-coding-agent and llama-server** (from llama.cpp). I review every change Qwen3.6 produces. When I notice the small model gets stuck on some aspects of coding, I do a post-mortem with it to determine where its knowledge gaps lie and I add useful tips to a README file that the next agent picks up on. This really helps, you can see code quality improve and the model not getting stuck as much.

Feel free to ask questions.

* on battery or low-power charging. At full power, around 19 t/s.

** llama-server config:

llama-server -m "C:\***\models\unsloth\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-UD-IQ4_NL_XL.gguf" -c 100000 -fa on -t 20 -b 4096 -ub 4096 --no-mmap --jinja -ctk q8_0 -ctv q8_0 -ngl 99 --n-cpu-moe 38 --no-mmproj --chat-template-kwargs '{"preserve_thinking": true}' --temp 1.0 --top-p 0.95 --top-k 64

主题标签模型发布端侧推理

原始关键词#coding#models#local#using#world#real

查看原文reddit.com

单一来源，暂无交叉验证