AI Research

The GPT-5 Leak Analysis: What the Numbers Actually Show

Benchmark results leaked ahead of OpenAI's announcement. The gains are real, but where they're concentrated matters more than the headline numbers.

By Jaclyn · April 17, 2026 · 5 min read

Internal benchmark results for GPT-5 surfaced on an engineering forum last week, and the AI internet immediately did what the AI internet does: treat the numbers as either a revolution or a disappointment, depending on which narrative fit better. The reality is more complicated and more interesting than either framing.

Here is what the leaked results actually show. GPT-5 scores approximately 94.3% on MMLU (massive multitask language understanding), up from GPT-4o's 88.7%. That's a 5.6 percentage point jumpâ€”significant, but not the kind of step-change that suggests a qualitative threshold has been crossed. On SWE-bench, the software engineering benchmark, GPT-5 scores 73.2%, up from GPT-4o's 48.1%. That's a much more notable jumpâ€”25 points suggests the model has meaningfully better code generation and debugging capabilities.

Where the Gains Are Concentrated

The interesting pattern in the benchmarks isn't the average performanceâ€”it's the variance across task types. GPT-5 shows outsized improvement on tasks that require multi-step reasoning with tool use: code execution, web search, file manipulation. On tasks that require pure reasoning without external groundingâ€”mathematical proofs, abstract logicâ€”GPT-5 is better than GPT-4o but the improvement is smaller and within the range of normal model progression.

This tracks with what OpenAI has been building toward: an agentic foundation model optimized for action, not just response. The capability investment has gone toward making the model reliably useful as a system that does things, not just a system that answers questions.

The Benchmark Methodology Problem

One thing the coverage has mostly missed: benchmark saturation is a real and growing problem in evaluating frontier models. MMLU, the most-cited general capability benchmark, is now heavily represented in training data across all frontier models, which means scores on it are increasingly measuring exposure rather than reasoning. A model that scores 94% on MMLU might genuinely be better at general reasoningâ€”or it might have seen enough MMLU-adjacent content that the test is measuring memorization patterns.

SWE-bench is less saturated because it tests against real GitHub issues that didn't exist in training data when the models were trained. That makes the code generation numbers more credible. The math benchmarksâ€”MATH, GSM8Kâ€”are somewhere in between: contaminated enough to be suspicious, fresh enough to still be informative.

What the GPT-5 numbers actually suggest is a capable agentic model that has made meaningful progress on code and tool use, with more modest gains on pure reasoning tasks. That fits the OpenAI product strategy better than it fits the narrative of a qualitative leap.

GPT-5: What OpenAI's August 2025 Release Actually Changed â€” GigSoul

The GPT-5 Leak Analysis: What the Numbers Actually Show

Where the Gains Are Concentrated

The Benchmark Methodology Problem