Ran a classic(medival europe) fantasy RP/agentic benchmark across 8 local models Qwen3.6-27B held up better than its size suggests

推荐理由

这条记录涉及生成能力或端侧推理进展，适合跟踪模型效率、部署门槛和应用机会。

Threw together a benchmark suite (quest completion, scene endings, item/time tracking, character detection, storytelling, drafting) and ran it across 8 models people talk about a lot on here. Judged with an external LLM grader, N varies per category (shown on the chart).

Overall pass rates: gemma-4-31B on top at 87%, Qwen3.6-27B close behind at 82%, then a pretty steep drop off after gemma-4-12B (80%) down to the smaller/looser models in the 55-70% range. but oh well that expected.

The interesting part to me wasn't the top line, it's how uneven the sub-scores are some models that look fine on "completing quests" fall apart on "NPC thoughts" or "summarizing quests," which never shows up if you only look at overall %. Curious if others have seen the same category level cliffs on their own evals.

主题标签模型发布端侧推理

原始关键词#benchmark#suggests#agentic#classic#fantasy#medival

查看原文reddit.com

单一来源，暂无交叉验证