GPQA Diamond

Fundamentals

A benchmark of 198 graduate-level multiple-choice questions in physics, biology, and chemistry that are designed to be unsolvable through internet search, requiring genuine PhD-level expertise.

GPQA Diamond (Graduate-Level Google-Proof Q&A - Diamond subset) is a challenging AI evaluation benchmark consisting of 198 multiple-choice questions in biology, physics, and chemistry. The questions are written by domain experts with PhDs and are specifically designed to be "Google-proof" - even skilled non-experts with unrestricted internet access score only about 34% accuracy, while PhD-level domain experts achieve roughly 65-70%.

The Diamond subset is the highest-quality tier of the broader GPQA dataset. A question qualifies for the Diamond set only if both expert annotators answered it correctly while the majority of non-expert annotators answered incorrectly. This filtering ensures the questions genuinely require deep domain expertise rather than surface-level knowledge or effective search skills.

GPQA Diamond has become one of the standard benchmarks for evaluating frontier AI models on scientific reasoning. Top models now score above 80% on this benchmark, surpassing the average human expert performance of roughly 70%. This creates an interesting dynamic for AI safety research: as models exceed human expert performance on specialized knowledge tasks, it becomes increasingly difficult for humans to verify whether model outputs are correct, highlighting the challenge of scalable oversight.

Last updated: February 26, 2026

GPQA Diamond

Related Terms