In July 2025, artificial intelligence crossed a threshold that many researchers had assumed was still years away: solving five of the six problems on the International Mathematical Olympiad (IMO) at a level equivalent to a gold medal. Both Google DeepMind and OpenAI announced, within days of each other, that their large-language-model-based reasoning systems had achieved gold-medal performance under the official competition format used by the world’s top high-school mathematicians in Sunshine Coast, Australia.
What Happened
The IMO is the most prestigious mathematics competition for pre-university students, requiring contestants to produce rigorous written proofs to six problems over two days, with 4.5 hours allotted each day. Google DeepMind reported that an “advanced version” of its Gemini model, operating end-to-end in natural language, scored 35 out of 42 points — enough to clear the gold-medal cut-off. The company’s announcement, detailed on the DeepMind research blog, emphasised that the solutions were graded by IMO coordinators under the same rules applied to human competitors.
Hours earlier, OpenAI had publicised its own result via researcher Alexander Wei, claiming an unreleased general-purpose reasoning model had also achieved 35 points. Unlike DeepMind, OpenAI did not submit its solutions to the IMO’s official graders, instead arranging external evaluation by former medallists — a methodological difference that has drawn scrutiny in coverage by Nature and other outlets. Both companies’ systems failed to solve Problem 6, which only five human contestants out of more than 600 cracked completely.
Why This Result Is Different
AI has been chipping away at olympiad-level mathematics for several years. In 2024, DeepMind’s specialised systems AlphaProof and AlphaGeometry 2 reached silver-medal performance, but they relied on translating problems into the formal proof language Lean and required days of computation per problem. The 2025 result is qualitatively different: the models worked directly in natural language, within the same time constraints as human competitors, and without bespoke geometry or algebra engines bolted on.
That distinction matters for the broader trajectory of machine reasoning. Mathematicians such as Fields Medallist Terence Tao have long argued that olympiad problems, while difficult, occupy a structured corner of mathematical reasoning — a point Tao reiterated in commentary posted to his Mastodon account following the announcements, where he cautioned against extrapolating from contest performance to research-level mathematics. Olympiad questions have clean statements, known-to-be-tractable solutions, and a fixed time horizon. Frontier research mathematics has none of those properties.
Reactions from the Mathematical Community
Reactions have been mixed. Some researchers see the results as a watershed for automated theorem proving and a sign that AI-assisted proof assistants could soon become routine collaborators for working mathematicians. Others are more cautious, noting that neither company has released the model weights, the prompts used, or the full transcripts of the system’s reasoning, making independent verification impossible. The IMO board itself reportedly asked AI labs to delay public announcements until after the official human results were celebrated, a request OpenAI was criticised for ignoring.
There are also concerns about evaluation integrity. Because IMO problems are written fresh each year, there is no risk of the systems having seen the exact questions in training data — but coaching strategies, similar problems, and stylistic conventions are abundantly available online. Researchers writing for outlets including MIT Technology Review have urged the field to develop benchmark protocols that more closely resemble blind scientific peer review.
What to Watch Next
The immediate question is whether either system, or a successor, can be deployed to assist genuine mathematical research — for example by suggesting lemmas, checking proofs in formal systems like Lean or Coq, or exploring conjectures in areas such as combinatorics and number theory. DeepMind has hinted at integrating “Deep Think” capabilities into consumer-facing Gemini products later in 2025, while OpenAI says its IMO model is a research artefact not slated for near-term release. The next IMO, in 2026, will likely see formal AI participation tracks, and the mathematical community will be watching closely to see whether this year’s leap was a one-off or the start of a new normal in human–machine collaboration on hard problems.


