Claude's Scientific Reasoning Now Matches PhD Level. That Number Is Climbing.
Anthropic's latest Claude achieves PhD-level performance on scientific benchmarks. The more important question is where it's still failing.
Anthropic's Claude Sonnet 4.6 now scores at the 87th percentile of PhD-level scientific reasoning tasks on Humanity's Last Exam, a benchmark designed by researchers to resist the saturation that has compromised most standard AI evals. That's a meaningful number, and it's climbing quarter by quarter in a way that suggests the trajectory isn't flattening. But what that number actually means—and doesn't mean—is worth examining carefully.
"PhD level" on a benchmark is not the same as PhD-level performance in a research lab. Benchmarks test specific task types—multiple choice, single-answer reasoning, structured problem decomposition—because those tasks are measurable. They don't test the ability to identify which problems are worth solving, to design experiments that control for confounders, or to recognize when an anomaly is actually a discovery rather than noise.
Where Claude Still Struggles
The gap between benchmark performance and real-world scientific reasoning shows up most clearly in multi-step research tasks. In internal evaluations Anthropic has shared with research partners, Claude handles individual stages of a research workflow well—literature review, hypothesis generation, experimental design in broad strokes. But it performs significantly worse when asked to manage the full arc of a research project where earlier decisions constrain later ones, where unexpected results require backtracking, and where institutional knowledge about what has failed before lives outside the literature.
The benchmark score reflects progress. The gap between that score and genuine research autonomy reflects how far these systems still have to go.
Implications for Research AI
The trajectory is what matters here. Twelve months ago, Claude scored at the 54th percentile on the same benchmark. The improvement rate—if it holds—means PhD-level benchmark performance will look modest in another eighteen months. What that means for research workflows is starting to become concrete: AI as a research assistant that handles the mechanizable portions of scientific reasoning, freeing human researchers for the parts that require judgment about problems and priorities.
That shift isn't coming. It's already underway. Labs at DeepMind, MIT, and a number of biotech startups are running internal experiments with AI research assistants embedded in discovery workflows. The results are promising enough that the next round of NIH grant applications will likely include AI usage plans as standard practice.