Smaller is Better: Replacing GPT-4o-mini with a 7B Local Judge
I expected the 30B model to be the better judge. It wasn't.
When I set out to replace OpenAI's GPT-4o-mini as the judge for the Oolong benchmark, my plan was simple: use the biggest local model I had. Qwen3-coder at 30B parameters seemed like the obvious choice over Qwen2.5 at 7B. More parameters, better judgment, right?
The data told a different story. The 7B model achieved 90% agreement with GPT-4o-mini. The 30B model? Only 87%. The smaller model was the better substitute.
What is Oolong and Why Does It Need a Judge?
Oolong is a long-context benchmark from Bertsch et al. designed to test models on truly massive inputs. We're talking about contexts averaging 1.5 million characters (~400k tokens), with some examples reaching 10.5 MB of text (~2.6 million tokens). No model can fit this in context directly.
The benchmark uses a Recursive Language Model (RLM) approach: the model must chunk the context, delegate analysis to sub-LLMs, and synthesize answers. It's a test of whether models can orchestrate complex information retrieval.
But here's the challenge: evaluating these answers requires semantic understanding. The model might answer "spam" when the ground truth is "Label: spam". A string match says wrong; a human says correct. This is why Oolong uses GPT-4o-mini as a judge - it evaluates whether answers are semantically equivalent, not just string-identical.
The problem? Every evaluation requires an API call. Run 1,300 validation examples and you're looking at real costs and rate limits. For researchers iterating on methods, this adds up fast.
The Experiment: Finding a Local Replacement
I wanted to run Oolong evaluations fully locally using Ollama. The question was whether a local model could match GPT-4o-mini's judgment reliably enough to trust the results.
The setup was straightforward:
- Run 100 Oolong examples with GPT-4o-mini as judge
- Capture all the judge inputs (question, ground truth, model response)
- Re-judge the same examples with local Qwen models
- Compare agreement rates
I tested two candidates:
- qwen2.5:7b (4.7 GB) - the smaller option
- qwen3-coder:30b (18 GB) - the larger, "smarter" option
Results: Size Isn't Everything
| Judge Model | Agreement with GPT-4o-mini | Size |
| qwen2.5:7b | 90% | 4.7 GB |
| qwen3-coder:30b | 87% | 18 GB |
The smaller model won. But why?
Looking at the disagreement patterns revealed the answer:
qwen2.5:7b disagreements (10 total):
- Qwen says yes, GPT says no: 8 cases
- Qwen says no, GPT says yes: 2 cases
- Net leniency: +6
qwen3-coder:30b disagreements (13 total):
- Qwen says yes, GPT says no: 12 cases
- Qwen says no, GPT says yes: 1 case
- Net leniency: +11
The larger model is more lenient - it accepts answers that GPT-4o-mini rejects nearly twice as often. The smaller model's judgments are more conservative, closer to GPT's stricter standard.
The Leniency Bias
This matters for benchmarking. If you use qwen3-coder:30b as your judge, you'll see inflated scores. In my 100-example run:
- GPT-4o-mini accuracy: ~27%
- If judged by qwen2.5:7b: ~33% (+6 points)
- If judged by qwen3-coder:30b: ~38% (+11 points)
For comparing methods to each other, this bias cancels out - both will be inflated equally. But for comparing to published results that used GPT judges, the smaller model gives you numbers closer to reality.
Running Oolong with a Local Judge
Here's how to run Oolong fully locally:
# Using qwen2.5:7b as judge (90% GPT agreement)
USE_LOCAL_DIRECT_SANDBOX=1 python -m verifiers.scripts.eval oolong \
--model ollama-qwen3-coder \
--env-args '{"subset":"synth","split":"validation","judge_model":"qwen2.5:7b","judge_base_url":"http://localhost:11434/v1","judge_api_key_var":null}' \
--num-examples 100 \
--max-concurrent 1 \
--save-results
To validate your local judge against GPT, I wrote a comparison script:
# Compare judges on existing results
python scripts/compare_judges.py environments/oolong/outputs/evals/<run_id>
This re-runs all examples through both judges and reports agreement statistics.
When to Use Local Judges
Use a local judge when:
- Iterating quickly on methods (cost and rate limits matter)
- Running large-scale ablations (1000+ examples)
- Reproducibility is important (no API dependency)
- Comparing methods against each other (relative rankings preserved)
Stick with GPT-4o-mini when:
- Publishing results that will be compared to other papers
- Absolute accuracy numbers matter
- You need the strictest evaluation standard
The Takeaway
The counterintuitive finding here is that model size and judge quality aren't linearly related. The 7B model's more conservative judgments made it a better GPT substitute than the 30B model's lenient ones.
For those building local evaluation pipelines: don't assume bigger is better. Test your judge against the standard you're replacing, and look at the direction of disagreements, not just the rate.
If you are interested in high-performance RL simulations, check out our Reality project - a simulation framework built on the Madrona engine.
