agent-smith update: my claude code plugin that offloads work to free models, gpt-oss:20b just swept my eval harness twice and became my trusted app-builder!
这条记录涉及编程工具或代码能力更新,适合开发者评估工作流变化和可复用价值。
I had posted a while back about "agent-smith" a claude code skill that sends the heavy drafting to free models, gemini free tier, local ollama, so claude's tokens go to judgment instead of grunt work. shipped a big update this week and some results worth sharing.
One way I like to look at it, i run a local eval gym, hidden-test graded tasks, code generation, structured extraction, repo edits, full app builds in a sandboxed tool loop, and models have to pass twice consecutively to earn "trusted" for a capability. Openai's gpt-oss:20b walked in cold and swept all 14 tasks. twice. First model ever to earn trusted on agentic app builds. It built working CLIs, a CSV tool, and an http api from spec, passing hidden tests, at 13GB on a MacBook Pro M3 w/36GB RAM. It also passed the exact tasks my other models stably fail.
It did lose the design head-to-head to gemma4:26b, blind-judged rubric. The failure mode is worth knowing if you use reasoning models for code, commented out debug prints, dead branches, doc strings describing exceptions the code never raises. The structure was actually better than gemma4's on the API design task. discipline not so much.
what's new in the plugin:
- agentic sandbox builds — smith_agent.py runs a local model in a tool loop (list/read/write/run/finish) in a scratch dir until its own verification passes. point it at a ticket-style task with a seeded test, get working code back
- local vision — --file shot.png --backend ollama
→ gemma4 reads screenshots/error dialogs/charts. spot-evaluated: exact text fidelity on window-sized images incl. hex error codes. caveat: on tall full-page captures it confidently invents small text (brand names, button labels) tile your screenshots
- batch mode — one prompt over a manifest of files, per-item outputs, one summary back. zero orchestration cost per item
- usage ledger — every run logs one JSON line; a report script shows what your fleet actually did (runs, finish rates, failures)
- generic openai socket — --backend openai --base-url groq
hits any OpenAI-compatible endpoint. groq's free tier hosts gpt-oss-120b and it's stupid fast. (caveat: free cloud tiers commonly train on your data private work stays local)
The core loop hasn't changed: the model drafts, claude verifies. Every model I've tested, winners included, has shipped at least one bug a review caught. the point isn't finding a perfect model, it's a pipeline where imperfection gets caught before it counts.
I hope this is useful to some of you, in our quest to compute more cheaply!
repo (MIT): https://github.com/negativetime/agent-smith-plugin
install: /plugin marketplace add negativetime/agent-smith-plugin
then /plugin install agent-smith@agent-smith-marketplace