I benchmarked full tool catalog vs ranked catalog on a local model: 8% → 77% accuracy

推荐理由

这条记录涉及生成能力或端侧推理进展，适合跟踪模型效率、部署门槛和应用机会。

Been running agents locally for a while and kept hitting the same issue: the more tools I added, the worse the model got at picking the right one.. So I finally benchmarked it properly..

Setup: qwen3.5-class model on an M4 MacBook, 100 tools in the catalog. One run with the full catalog every turn, one where I ranked the tools per query (BM25 over plain text) and only passed the relevant ones..

Results:

- Full catalog: ~8% task accuracy

- Ranked: ~77%

- Tokens: -57%

Same weights, same machine, same prompts.. Only difference was how many tool descriptions the model had to read past before choosing. At 20-30 tools it barely matters.. past ~100 it falls apart. The model isn't getting dumber, it's just drowning.

The ranking is deliberately simple, no embeddings, no extra LLM call. It's part of an open source project (Ratel) I help build, benchmark's here if you want to run it on your own setup: https://github.com/ratel-ai/ratel-bench

Anyone else seeing similar jumps (or different thresholds) with local models?

主题标签模型发布端侧推理

原始关键词#benchmarked#accuracy#catalog#ranked#local#model

查看原文reddit.com

单一来源，暂无交叉验证