AI Research

Claude's Scientific Reasoning Now Matches PhD Level. That Number Is Climbing.

Anthropic's latest Claude achieves PhD-level performance on scientific benchmarks. The more important question is where it's still failing.

By Jaclyn · April 17, 2026 · 5 min read

Scientific research and laboratory equipment

Anthropic's Claude Sonnet 4.6 now scores at the 87th percentile of PhD-level scientific reasoning tasks on Humanity's Last Exam, a benchmark designed by researchers to resist the saturation that has compromised most standard AI evals. That's a meaningful number, and it's climbing quarter by quarter in a way that suggests the trajectory isn't flattening. But what that number actually meansâ€”and doesn't meanâ€”is worth examining carefully.

"PhD level" on a benchmark is not the same as PhD-level performance in a research lab. Benchmarks test specific task typesâ€”multiple choice, single-answer reasoning, structured problem decompositionâ€”because those tasks are measurable. They don't test the ability to identify which problems are worth solving, to design experiments that control for confounders, or to recognize when an anomaly is actually a discovery rather than noise.

Where Claude Still Struggles

The gap between benchmark performance and real-world scientific reasoning shows up most clearly in multi-step research tasks. In internal evaluations Anthropic has shared with research partners, Claude handles individual stages of a research workflow wellâ€”literature review, hypothesis generation, experimental design in broad strokes. But it performs significantly worse when asked to manage the full arc of a research project where earlier decisions constrain later ones, where unexpected results require backtracking, and where institutional knowledge about what has failed before lives outside the literature.

The benchmark score reflects progress. The gap between that score and genuine research autonomy reflects how far these systems still have to go.

Implications for Research AI

The trajectory is what matters here. Twelve months ago, Claude scored at the 54th percentile on the same benchmark. The improvement rateâ€”if it holdsâ€”means PhD-level benchmark performance will look modest in another eighteen months. What that means for research workflows is starting to become concrete: AI as a research assistant that handles the mechanizable portions of scientific reasoning, freeing human researchers for the parts that require judgment about problems and priorities.

That shift isn't coming. It's already underway. Labs at DeepMind, MIT, and a number of biotech startups are running internal experiments with AI research assistants embedded in discovery workflows. The results are promising enough that the next round of NIH grant applications will likely include AI usage plans as standard practice.

Anthropic's Claude Opus 4.6: The Scientific Reasoning Benchmark Leader â€” GigSoul

Claude's Scientific Reasoning Now Matches PhD Level. That Number Is Climbing.

Where Claude Still Struggles

Implications for Research AI