Local benchmarks with a RTX 3090 - Qwen3.6 27b vs Ornith
这条记录涉及编程工具或代码能力更新,适合开发者评估工作流变化和可复用价值。
Hey folks. I've been frustrated by how difficult it is to get an idea of how good each new model (or fine-tune) is, and I've not been satisfied with the one-off "draw a pelican riding a bike" style tests that we often fall back on. New models or model variants that can run locally on my RTX 3090 almost never get proper benchmark coverage from anyone but the folks who make them. Lately, I wanted to see how Ornith 35b compared to Qwen3.6 27b.
So I've been playing around with inspect-ai and a bunch of standard benchmarks that are available in their inspect-evals
package. I'd like to be able to run a complete set of benchmarks on a new model overnight, and have some broad indication of how they compare in the morning. I'm not there yet, but I wanted to share the benchmarks I've run so far comparing Qwen3.6 27b (Q4_K_M), Gemma4 26B A4B QAT (Q4_0), and Ornith1.0 35B MoE (Q4_K_M). I am still running on LM Studio at the moment, so I ran the benchmarks below on lmstudio-community provided models, except Ornith, which I got from the deepreinforce-ai account.
TLDR
I tested all three on benchmarks with a limited number of samples (100) and aggressive limits. I expected Ornith to be nearly as good as Qwen3.6 27b at coding tasks, but not quite. I expected, as a fine tune, for it to be worse on general knowledge and grounding. But the final picture wasn't quite that clear. It was as-good or better than Qwen 27b in a little under half of cases, and worse the rest of the time. It claims to be best at agentic tasks though, and I haven't managed to successfully run most of the agentic benchmarks.
Specifics of each benchmark follow with some notes. And my thoughts on how painful it has been trying to run these benchmarks locally.
General Knowledge and Reasoning
Qwen takes the best (or joint best) score in 4 / 6 benchmarks.
Ornith takes the best (or joint best) in 3 / 6 benchmarks.
Something about the MMLU benchmark didn't like Gemma. It timed out in a lot of cases, but I haven't determined why. It could have been that it got stuck endlessly looping, or it could have been something to do with how I configured the tasks. Take the Gemma scored on these cases with a pinch of salt.
# Static knowledge and reasoning. success, logs = eval_set( tasks=[ gsm8k(), ifeval(), arc_easy(), arc_challenge(), mmlu_0_shot(cot=True), mmlu_5_shot(cot=True) ], log_dir="logs-know", **default_config, max_tokens=20000, )
Benchmark Gemma4 26b Qwen3.6 27b Ornith1.0 35b gsm8k 0.93 0.96 0.9 ifeval 0.93 0.95 0.91 arc_easy 1.0 1.0 0.98 arc_challenge 0.97 0.97 0.98 mmlu_0_shot 0.54 0.88 0.91 mmlu_5_shot 0.5 0.88 0.88 Grounding and Recall
Ornith takes lead on these, but Needle in a haystack (NIAH) had to be limited to 100000 max context because prompt processing times for Qwen made running a fair test at higher contexts prohibitively time-consuming. I need to find more convenient benchmarks for local testing, or simply re-run them with more time to spend.
# Grounding and recall success, logs = eval_set( tasks=[ drop(), niah(max_context=100000), ], log_dir="logs-ground", **default_config, max_tokens=40000, )
Benchmark Gemma4 26b Qwen3.6 27b Ornith1.0 35b drop 0.932 0.947 0.952 niah 10.0 10.0 10.0 Code generation and data science
This is where I expected Ornith to shine. It matched Qwen in 2 tasks out of four, but Qwen had the best score in every case. The scicode score was particularly disappointing. One positive over Gemma here, was that for me to get scicode working with Gemma I had to impose very heavy limits because it looped infinitely on most samples. Ornith didn't have that problem. Less infinite looping behavior.
# Code generation and data science success, logs = eval_set( tasks=[ ds1000(), class_eval(), scicode(), ifevalcode(samples_per_language=tasks_limit_per_eval // 10), # 10 languages ], log_dir="logs-code", **default_config, )
Benchmark Gemma4 26b Qwen3.6 27b Ornith1.0 35b DS-1000 0.34 0.66 0.48 class_eval 0.97 0.97 0.97 scicode 4.615 10.769 1.538 ifevalcode 0.03 0.00 0.03 Notes
Honestly, running these has been a bit of a nightmare. Gemma, in particular, had a tendency to loop infinitely. I had to re-configure and re-run the benchmarks with heavy limits to stop it from running forever. Additionally, prompt processing time one some of the tests was particularly bad. Changing some of these configs meant having to re-run the benchmarks all over for it to be a fair comparison against the other models.
My aim was to be able to run a full suite of tests over night, so I can have an idea of its capabilities in the morning. In reality, ifevalcode took 18 hours to run on its own with only 100 samples for Qwen3.6 27b. Here are some things I configured;
- 100 samples for each benchmark max.
- Max token limits to stop looping. This really needed to be different for each benchmarks, as some genuinely seemed to need larger reasoning blocks.
- Initially I set timeouts, but this really screwed things up while I was running multiple samples at once. One heavy task would use up all the resources while another times out without having been attempted.
- 1 task at a time, 1 connection max, 1 sandbox (docker instance) at a time.