Open-Source AI Judges Beat GPT-5.2 at 15x Lower Cost Using DPO Fine-Tuning
Fine-tuned open-source large language models can now outperform OpenAI's GPT-5.2 at evaluating AI outputs—at a fraction of the cost. Together AI released research showing their GPT-OSS 120B model achieved 62.63% accuracy on human preference alignment after Direct Preference Optimization training, surpassing GPT-5.2's 61.62% baseline while running 14x faster and costing 15x less per token.
The findings matter for any organization running AI evaluation pipelines at scale. GPT-5.2 currently charges $1.75 per million input tokens and $14 per million output tokens. The fine-tuned GPT-OSS 120B? Just $0.15 and $0.60 respectively.
The Training Approach
Together AI used DPO, a technique introduced in late 2023 that bypasses the complex reinforcement learning loops of traditional RLHF. Instead of training a separate reward model, DPO directly adjusts the language model's weights using preference pairs—one preferred response, one rejected response for each prompt.
The training data came from RewardBench 2, a benchmark containing examples with human-labeled preferred and rejected responses across six categories: safety, factuality, math, precise instruction following, focus, and ties. From roughly 1,500 training examples, the team generated 5,407 preference pairs.
Training took just 1.5 hours for GPT-OSS 120B using LoRA (Low-Rank Adaptation) with a learning rate of 5e-6 over three epochs.
Where Open Models Excel
The category-level breakdown reveals where fine-tuning delivered the biggest wins. GPT-OSS 120B after DPO beat GPT-5.2 on math evaluation by 10.3 percentage points and on focus (response quality assessment) by 6.3 points.
Safety evaluation proved easiest across all models, averaging 91.32% accuracy—unsurprising given these models undergo extensive safety training. Factuality detection hit 85.23%. The hardest category? Focus, where models averaged just 10.13% accuracy, highlighting how subjective quality judgments remain challenging.
One wrinkle: Qwen3 235B, which already beat GPT-5.2 out of the box at 62.63%, actually regressed slightly to 61.28% after fine-tuning. Not every model benefits from additional training, reinforcing that validation remains essential.
The Broader Implications
The "LLM-as-a-judge" paradigm has become standard for evaluating AI outputs at scale because judging is fundamentally simpler than generating. A model generating a response must juggle context, follow multi-step instructions, and synthesize information. Evaluating that response is a focused classification task.
This research suggests organizations can build evaluation pipelines using open-source models they control entirely—no API dependencies, full visibility into model behavior, and the ability to fine-tune for specific domains. The cost savings at production scale are substantial.
Together AI published the full methodology in a cookbook notebook for teams wanting to replicate the approach with their own preference data.
Read More
AITO Delivered Over 420,000 Vehicles in 2025, Topping China's Luxury Brand Rankings
Feb 02, 2026 0 Min Read
WuXi Biologics and Vertex Sign License and Research Service Agreement for T-cell Engager
Feb 02, 2026 0 Min Read
BTC Plunges to $74K as RSI Hits Oversold, Spot Sellers Dominate
Feb 02, 2026 0 Min Read
Kakao Games Begins Pre-Registration for SM Entertainment IP Based Game "SMiniz"
Feb 02, 2026 0 Min Read
AI adoption is widespread, but developer confidence is still catching up, Agoda report finds
Feb 02, 2026 0 Min Read