GPQA Diamond
FundamentalsA benchmark of 198 graduate-level multiple-choice questions in physics, biology, and chemistry that are designed to be unsolvable through internet search, requiring genuine PhD-level expertise.
GPQA Diamond (Graduate-Level Google-Proof Q&A - Diamond subset) is a challenging AI evaluation benchmark consisting of 198 multiple-choice questions in biology, physics, and chemistry. The questions are written by domain experts with PhDs and are specifically designed to be "Google-proof" - even skilled non-experts with unrestricted internet access score only about 34% accuracy, while PhD-level domain experts achieve roughly 65-70%.
The Diamond subset is the highest-quality tier of the broader GPQA dataset. A question qualifies for the Diamond set only if both expert annotators answered it correctly while the majority of non-expert annotators answered incorrectly. This filtering ensures the questions genuinely require deep domain expertise rather than surface-level knowledge or effective search skills.
GPQA Diamond has become one of the standard benchmarks for evaluating frontier AI models on scientific reasoning. Top models now score above 80% on this benchmark, surpassing the average human expert performance of roughly 70%. This creates an interesting dynamic for AI safety research: as models exceed human expert performance on specialized knowledge tasks, it becomes increasingly difficult for humans to verify whether model outputs are correct, highlighting the challenge of scalable oversight.
Related Terms
Last updated: February 26, 2026