Smaller is Better: Replacing GPT-4o-mini with a 7B Local Judge

I expected the 30B model to be the better judge. It wasn't.

When I set out to replace OpenAI's GPT-4o-mini as the judge for the Oolong benchmark, my plan was simple: use the biggest local model I had. Qwen3-coder at 30B parameters seemed like the obvious choice over Qwen2.5 at 7B. More parameters, better judgment, right?

The data told a different story. The 7B model achieved 90% agreement with GPT-4o-mini. The 30B model? Only 87%. The smaller model was the better substitute.

What is Oolong and Why Does It Need a Judge?

Oolong is a long-context benchmark from Bertsch et al. designed to test models on truly massive inputs. We're talking about contexts averaging 1.5 million characters (~400k tokens), with some examples reaching 10.5 MB of text (~2.6 million tokens). No model can fit this in context directly.

The benchmark uses a Recursive Language Model (RLM) approach: the model must chunk the context, delegate analysis to sub-LLMs, and synthesize answers. It's a test of whether models can orchestrate complex information retrieval.

But here's the challenge: evaluating these answers requires semantic understanding. The model might answer "spam" when the ground truth is "Label: spam". A string match says wrong; a human says correct. This is why Oolong uses GPT-4o-mini as a judge - it evaluates whether answers are semantically equivalent, not just string-identical.

The problem? Every evaluation requires an API call. Run 1,300 validation examples and you're looking at real costs and rate limits. For researchers iterating on methods, this adds up fast.

The Experiment: Finding a Local Replacement

I wanted to run Oolong evaluations fully locally using Ollama. The question was whether a local model could match GPT-4o-mini's judgment reliably enough to trust the results.

The setup was straightforward:

Run 100 Oolong examples with GPT-4o-mini as judge
Capture all the judge inputs (question, ground truth, model response)
Re-judge the same examples with local Qwen models
Compare agreement rates

I tested two candidates:

qwen2.5:7b (4.7 GB) - the smaller option
qwen3-coder:30b (18 GB) - the larger, "smarter" option

Results: Size Isn't Everything

Judge Model	Agreement with GPT-4o-mini	Size
qwen2.5:7b	90%	4.7 GB
qwen3-coder:30b	87%	18 GB

The smaller model won. But why?

Looking at the disagreement patterns revealed the answer:

qwen2.5:7b disagreements (10 total):

Qwen says yes, GPT says no: 8 cases
Qwen says no, GPT says yes: 2 cases
Net leniency: +6

qwen3-coder:30b disagreements (13 total):

Qwen says yes, GPT says no: 12 cases
Qwen says no, GPT says yes: 1 case
Net leniency: +11

The larger model is more lenient - it accepts answers that GPT-4o-mini rejects nearly twice as often. The smaller model's judgments are more conservative, closer to GPT's stricter standard.

The Leniency Bias

This matters for benchmarking. If you use qwen3-coder:30b as your judge, you'll see inflated scores. In my 100-example run:

GPT-4o-mini accuracy: ~27%
If judged by qwen2.5:7b: ~33% (+6 points)
If judged by qwen3-coder:30b: ~38% (+11 points)

For comparing methods to each other, this bias cancels out - both will be inflated equally. But for comparing to published results that used GPT judges, the smaller model gives you numbers closer to reality.

Running Oolong with a Local Judge

Here's how to run Oolong fully locally:

# Using qwen2.5:7b as judge (90% GPT agreement)
USE_LOCAL_DIRECT_SANDBOX=1 python -m verifiers.scripts.eval oolong \
  --model ollama-qwen3-coder \
  --env-args '{"subset":"synth","split":"validation","judge_model":"qwen2.5:7b","judge_base_url":"http://localhost:11434/v1","judge_api_key_var":null}' \
  --num-examples 100 \
  --max-concurrent 1 \
  --save-results

To validate your local judge against GPT, I wrote a comparison script:

# Compare judges on existing results
python scripts/compare_judges.py environments/oolong/outputs/evals/<run_id>

This re-runs all examples through both judges and reports agreement statistics.

When to Use Local Judges

Use a local judge when:

Iterating quickly on methods (cost and rate limits matter)
Running large-scale ablations (1000+ examples)
Reproducibility is important (no API dependency)
Comparing methods against each other (relative rankings preserved)

Stick with GPT-4o-mini when:

Publishing results that will be compared to other papers
Absolute accuracy numbers matter
You need the strictest evaluation standard

The Takeaway

The counterintuitive finding here is that model size and judge quality aren't linearly related. The 7B model's more conservative judgments made it a better GPT substitute than the 30B model's lenient ones.

For those building local evaluation pipelines: don't assume bigger is better. Test your judge against the standard you're replacing, and look at the direction of disagreements, not just the rate.

If you are interested in high-performance RL simulations, check out our Reality project - a simulation framework built on the Madrona engine.

Smaller is Better: Replacing GPT-4o-mini with a 7B Local Judge

What is Oolong and Why Does It Need a Judge?

The Experiment: Finding a Local Replacement

Results: Size Isn't Everything

The Leniency Bias

Running Oolong with a Local Judge

When to Use Local Judges

The Takeaway

Comments

More from this blog

The Building Blocks of an Agent Memory System

How InfoNCE Creates Exploration: The Hidden Engine of Contrastive RL

Contrastive RL: A Step-by-Step Guide to Learning Reachability

How wp.ScopedTimer Found My 12x Speedup

Command Palette

What is Oolong and Why Does It Need a Judge?

The Experiment: Finding a Local Replacement

Results: Size Isn't Everything

The Leniency Bias

Running Oolong with a Local Judge

When to Use Local Judges

The Takeaway

Comments

More from this blog