A harder kind of math test for AI
A new benchmark called SOOHAK is trying to measure something many popular AI evaluations miss: whether a model can reason through genuinely difficult mathematics and whether it knows when it should refuse to answer. According to the supplied source text, the benchmark was built by a consortium of 64 mathematicians from groups including Carnegie Mellon University, EleutherAI, and Seoul National University.
SOOHAK contains 439 original handwritten tasks. The collection is split into a 340-problem “Challenge” set aimed at graduate and research-level mathematics, and a 99-problem “Refusal” set made up of intentionally flawed problems that contain contradictions or lack enough information for a clear answer. That second section is the more unusual one. It tests whether a model can identify that a task is unsound instead of confidently producing a result anyway.
The benchmark’s creators also tried to reduce the chance that models had already seen the material during training. The source text says every problem was written from scratch rather than pulled from textbooks or competition archives. Contributors included professors, PhD students, postdocs, and International Mathematical Olympiad medalists, and they were required to confirm they did not use AI assistance while drafting the questions.
Research-level math remains a clear weakness
The reported results show that advanced models still struggle badly once the problems move beyond familiar contest-style territory. On the Challenge set, Google’s Gemini 3 Pro scored 30%, followed by GPT-5 variants at 26%. Claude Opus 4.5 dropped to 10%, while open-weight systems including Kimi-2.5, Qwen3-235B, and GPT-OSS-120B stayed below 15%.
The headline is not that one model narrowly leads another. It is that none of them are consistently strong on this class of unpublished, research-level work. The source text says not a single model could solve 124 of the Challenge problems. That suggests the ceiling on frontier mathematical reasoning is still much lower than recent public narratives around olympiad-level performance may imply.
The easier companion set, SOOHAK-Mini, paints a different picture. There, top systems cluster much closer together and post substantially higher scores. The sharp drop only appears when the tasks move into less standardized, less pre-digested material. The benchmark authors, according to the source text, argue that this may expose weaker transfer to niche unpublished problems, especially among open-weight models.
The refusal problem may matter as much as the solving problem
The benchmark’s most consequential contribution may be its refusal section. In real use, an AI system is not only judged by how often it gets an answer right. It is also judged by whether it recognizes when a request is malformed, contradictory, or impossible to answer from the given information. SOOHAK treats that as a first-class capability.
Here too, the results were weak. The supplied source text says even the best model remained below 50% on recognizing unsolvable problems. That means leading systems still often prefer to guess rather than identify a missing assumption or contradiction. In practice, that behavior is more dangerous than a visible arithmetic mistake because it can sound authoritative while being structurally wrong.
This is a recurring pattern in AI evaluation. As models improve on familiar benchmarks, the benchmark itself can stop reflecting the hardest remaining failures. SOOHAK appears designed to push the field away from leaderboards dominated by coverage and memorization and toward tests of abstraction, novelty, and epistemic restraint.
Why this benchmark stands out
- It uses original tasks rather than recycled textbook or contest material.
- It separates ordinary problem solving from refusal behavior.
- It focuses on research-level difficulty instead of only school or olympiad math.
- It highlights that strong performance on easier benchmark sets does not necessarily transfer upward.
If the reported results hold up under wider scrutiny, SOOHAK could become a useful counterweight to increasingly saturated math evaluations. For developers, it points to two unresolved problems: frontier models still hit a wall on unfamiliar high-level mathematics, and they still too often answer when they should stop and explain why no answer is possible.
That combination matters well beyond math. Systems that cannot reliably distinguish solvable from unsolvable requests are likely to make the same kind of error in law, science, engineering, and policy analysis. SOOHAK does not just ask whether AI can solve harder problems. It asks whether AI can recognize the limits of what it knows.
This article is based on reporting by The Decoder. Read the original article.
Originally published on the-decoder.com







