Theoretical physics benchmark (TPBench)—

Theoretical physics benchmark (TPBench)—a dataset and study of AI reasoning capabilities in theoretical physics - IOPscience
Purpose-led Publishing
is a coalition of three not-for-profit publishers in the field of physical sciences: AIP Publishing, the American Physical Society and IOP Publishing.
Together, as publishers that will always put purpose above profit, we have defined a set of industry standards that underpin high-quality, ethical scholarly communications.
We are proudly declaring that science is our only shareholder.
Building on the success of Machine Learning: Science and Technology, the
Machine Learning Series™
from IOP Publishing is a dynamic and expanding collection of open access journals advancing the frontiers of machine learning and artificial intelligence.
The series brings together a global community of researchers applying ML across engineering, physical, environmental, medical and health sciences, and healthcare. By fostering interdisciplinary collaboration and open access to high-quality research, we aim to accelerate discovery and innovation across diverse domains.
Discover more about publishing in the Machine Learning Series to help shape the intelligent systems of the future.
Benchmark
The following article is
Open access
Theoretical physics benchmark (TPBench)—a dataset and study of AI reasoning capabilities in theoretical physics
Daniel J H Chung
Zhiqi Gao
Yurii Kvasiuk
Tianyi Li
Moritz Münchmeyer
Maja Rudolph
Frederic Sala
and
Sai Chaitanya Tadepalli
Published 2 September 2025
© 2025 The Author(s). Published by IOP Publishing Ltd
Machine Learning: Science and Technology
Volume 6
Number 3
Focus on ML and the Physical Sciences
Citation
Daniel J H Chung
et al
2025
Mach. Learn.: Sci. Technol.
030505
DOI
10.1088/2632-2153/adfcb0
Article
PDF
Daniel J H Chung
AFFILIATIONS
Department of Physics, University of Wisconsin-Madison, Madison, WI, United States of America
Zhiqi Gao
AFFILIATIONS
Department of Computer Science, University of Wisconsin-Madison, Madison, WI, United States of America
Yurii Kvasiuk
AFFILIATIONS
Department of Physics, University of Wisconsin-Madison, Madison, WI, United States of America
Tianyi Li
AFFILIATIONS
Department of Physics, University of Wisconsin-Madison, Madison, WI, United States of America
Moritz Münchmeyer
AFFILIATIONS
Department of Physics, University of Wisconsin-Madison, Madison, WI, United States of America
NSF-Simons AI Institute for the Sky (SkAI), Chicago, IL, United States of America
EMAIL
muenchmeyer@wisc.edu
Author notes
Author to whom any correspondence should be addressed.
Maja Rudolph
AFFILIATIONS
Data Science Institute (DSI), University of Wisconsin-Madison, Madison, WI, United States of America
Frederic Sala
AFFILIATIONS
Department of Computer Science, University of Wisconsin-Madison, Madison, WI, United States of America
Sai Chaitanya Tadepalli
AFFILIATIONS
Department of Physics, Indiana University, Bloomington, IN, United States of America
Notes
Article metrics
1582
Total downloads
Video abstract views
Submit
Submit to this Journal
Share this article
Dates
Received
6 April 2025
Revised
7 July 2025
Accepted
18 August 2025
Published
2 September 2025
Peer review information
Method
: Single Anonymous
Revisions: 1
Screened for originality?
Yes
Buy this article in print
Journal RSS
Sign up for new issue notifications
2632-2153/6/3/030505
Abstract
We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics (TP), focusing on high-energy theory and cosmology. The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level. These problems are novel in the sense that they do not come from public problem collections. We evaluate our data set on various open and closed language models, including o3-mini, o1, DeepSeek-R1, GPT-4o and versions of Llama and Qwen. While we find impressive progress in model performance with the most recent models, our research-level difficulty problems are mostly unsolved. We address challenges of auto-verifiability and grading, and discuss common failure modes. While currently state-of-the art models are still of limited use for researchers, our results show that AI assisted TP research may become possible in the near future. We discuss the main obstacles towards this goal and possible strategies to overcome them. The public problems and solutions, results for various models, and updates to the data set and score distribution, are available on the website of the dataset
tpbench.org
Export citation and abstract
BibTeX
RIS
Previous
article in issue
Next
article in issue
Original content from this work may be used under the terms of the
Creative Commons Attribution 4.0 license
. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
1. Introduction
Automated mathematical reasoning at research level with AI in theoretical physics (TP) may now be within reach. Novel large language model (LLM)-based AI systems, powered by improved AI reasoning techniques at training and inference time, are potentially powerful tools for the TP community. If substantial parts of the theoretical research process could be performed by AI, this would allow to significantly accelerate progress in TP. If AI could act as a fast, reliable and skilled research assistant that can perform theoretical calculations and solve mathematical problems, human researchers could cover substantially more theoretical ground, evaluate more ideas for their promise, and thus make more theoretical discoveries. Even without super-human intelligence, an AI ‘craftsman’ would allow humans to outsource tedious calculation work and to focus more on creative aspects of the theoretical research process.
Recent advancements in LLMs have allowed models to solve progressively more difficult tasks that require abstract mathematical reasoning. While high-school level math competition benchmarks like
MATH
] are almost saturated by current models, the focus has recently turned to graduate level and research level mathematics. A main data set in this domain, the recently introduced
FrontierMath
], which contains research level difficulty problems, is still mostly unsolved by frontier models. In TP, which also requires extensive abstract mathematical reasoning, there has been comparatively less work than in mathematics. Existing benchmarks which include physics such as
JEEBench
],
OlympiadBench
] and
PhysicsQA
], cover mostly high-school-level problems from college entrance exams or competitions. There is little existing work on mathematical reasoning for TP at graduate or research level. An exception is [
], where the authors evaluate the performance of LLMs for symbolic calculations in quantum many-body physics, however in the narrow context of a specific physical setting. Very recently, the
Humanity’s Last Exam
dataset [
] (HLE) appeared as a multi-domain benchmark that includes problems from TP. We provide a more complete list of available data sets in section
5.1
In the present work, we build a data set to test TP reasoning skill over a broad range of difficulty. We aim to answer the following questions:
How good is the current state-of-the-art AI for problem-solving in TP? Are existing models useful for research-level reasoning?
What are the most common failure modes? For example, are models performing correct reasoning but fail mostly at algebra (at which LLMs are known to perform poorly)?
To answer these questions, we created a new benchmark data set
TPBench
of TP problems of varying degree of difficulty, from advanced undergraduate to research level. Our problems are novel, in the sense that they do not come from public problem collections (see section
2.5
for detailed comments). For graduate level and research problems we focus in particular on problems from high-energy physics and cosmology. An important property of our data set is that it provides a
continuum
of problem difficulty, from easy to research level, which few mathematical data sets do. This allows us to compare the performance of different models over a wide spectrum of difficulty. We invite the reader to skip ahead to appendix
to get an impression of the difficulty of these problems. Before discussing our data set in detail, we begin with some general remarks about reasoning for TP and its relation to AI models.
Differences between reasoning in math and TP.
Because TP is extremely broad and math is arguably even broader, any summary discussion of the differences between mathematical and physics reasoning is unlikely to be accurate in many examples in a generic comparison set. Nevertheless, in terms of modern graduate level and higher physics and mathematics comparisons, several aspects typically stand out.
Mathematical reasoning tends to focus on establishing exact broad statements constructed within a rigid logical framework, while TP reasoning mostly deals with approximate narrower statements constructed within a logical framework in which some of the less quantitatively relevant details are left unspecified but ‘most likely’ can be filled in such that the statements can be made arbitrarily precisely if desired
. This difference naturally stems from the different approximate goals of each discipline: a commonly accepted goal of TP is to model nature while a commonly accepted goal of mathematics to construct nontrivial, beautiful true statements connecting surprisingly disparate ideas [
15
]. The emphasis on rigidity is what naturally leads to the format of theorems and proofs in mathematics while the emphasis on quantitative modeling has allowed the Standard Model of particle physics to make successful predictions despite the evolving nature of its underlying mathematical structure.
TP reasoning primarily relies on techniques of direct computations, while mathematical reasoning tends to use more often indirect techniques such as contradiction and induction. More explicitly, TP computations often utilize algorithmic methods in calculus, linear algebra, complex analysis, differential equations, differential geometry, and group representation theory.
TP reasoning often focuses on derivations of formulas whose parametric dependences as well as the overall normalization are implicitly defined in a narrow domain of physical relevance. For example, if one writes down a quantum field theory (QFT) Lagrangian and computes observables, the coupling constants with conventional normalization cannot be a large number such as 1000 since such theories are expected to have the field degrees of freedom reorganize into a different effective theory. However, the exact parametric range of validity for the coupling constant is left implicit. This is in contrast with much of mathematical reasoning, where parametric ranges are precisely defined. This makes TP reasoning quite efficient at the expense of imprecision in the domain of validity.
TP typically focuses on approximations whose quantitative uncertainties are often left unspecified. For example, one of the most popular computational techniques in TP is perturbation theory, a type of asymptotic expansion, which often has a zero radius of convergence, and because there is often no exact computation to compare to, there is no rigorous quantitative estimate of uncertainties in most cases. One typically understands the estimate of the uncertainty to be the next order contribution in perturbation theory. Researchers also implicitly understand that there are non-perturbative contributions such as instantons which have an exact representation of zero in perturbation theory that can become important in certain instances.
These properties make TP an exciting testbed for AI reasoning models, which has not been extensively explored, perhaps because models were not powerful enough to do so, until very recently.
Generating novel research ideas/problems in TP.
Novel research in TP, as in all fields of science, is usually incremental, and novel research ideas are combinations or further developments of prior work. For example, once a novel method has been invented, it can often be applied to many different problems. Indeed, Feynman advised to keep a list of favorite problems, and to check whether any newly learned technique could be useful for one of these problems [
16
]. Experienced researchers have an advantage over students at generating interesting research because their knowledge base is much larger and more interconnected. Indeed, what makes a research level question different from a classroom question is often the novelty and connection with existing knowledge and not the reasoning difficulty. It seems very plausible that machine learning models, with their ability to ingest vast amounts of knowledge during training or inference, could be particularly strong at finding promising combinations of novel results and techniques. A recent study in NLP research [
17
] found that LLM research ideas are rated more novel (but slightly less feasible) by human experts than human expert ideas. Experienced researchers are also able to judge whether a mathematical result is interesting or surprising and deserves further investigation. Such ‘theoretical taste’ may be beyond existing AI models. With our data set, we are not currently aiming to test these aspects of theoretical research.
Reasoning abilities required to solve research problems in TP.
Researchers (consciously or unconsciously) have a number of techniques or heuristics to solve theoretical problems. A famous collection of problem solving techniques and advise is George Polya’s book
How to solve it
18
] which lists about 50 heuristics with suitable examples in mathematics. Techniques include decomposing the problem, finding a related problem, generalization, and many less obvious ones. Most researchers have a more limited toolkit than Polya and many novel papers are somewhat straight forward combinations of reasoning steps contained in previous works. A main difficulty in this case is to understand this prior work and be able to recall and connect it when needed. Of course, insights are also often re-discovered independently. When solving a hard problem, researchers may try many different paths or heuristics, jump back and forth in their reasoning chain, analyze examples, answer subquestions, clear up their misunderstandings, read related literature, etc. In principle, given a large enough context window for prior thoughts and unlimited inference time, LLMs may be able to perform such very long thought processes, but currently available models (with a public reasoning chain) do not show very deep thought processes in our experience.
Technical (calculation) abilities required to solve research problems in TP.
Once a mathematical reasoning step has been proposed, it needs to be executed correctly. This step is in principle straightforward but error-prone for most humans. For example, one may decide to Taylor expand an expression to third order, perform a Gaussian integral, re-arrange terms, or even just multiply numbers. LLMs are well known to perform poorly at such tasks, but this problem can in principle be fixed by using computer algebra systems, if they can work with the required mathematical objects (which however is often not the case in TP).
Observations from our evaluation.
We list some observations from our experiments, which we discuss in more details in the following sections.
Progress has been very rapid with the most recent models. When we initiated this project, GPT-4o [
19
] (released on May 2024) was state-of-the-art and unable to solve almost any TP problem beyond undergraduate level. When the o1-preview model [
20
] (released on Sep 2024) appeared, it could solve many easy graduate level problems, but rarely any harder ones. The o3-mini series [
21
] (released on January 2025), is able to solve about half of our advanced graduate level problems and even a few research problems. Nevertheless, as we will see, research problems involving long mathematical arguments are generally unsolved.
Symbolic calculation mistakes. Existing models are known to perform poorly at mathematical calculations (see e.g. [
22
]), which could be performed correctly with a computer algebra system such as
SymPy
or
Mathematica
. Such wrong intermediate results then lead to incorrect followup reasoning. It should be noted that humans tend to make similar mistakes in calculations, but are often able to spot them on revisiting. We made an initial attempt to encourage symbolic verification with python, which we describe in section
3.3
, but found that it barely improved results. Better symbolic tool integration would be very beneficial for TP reasoning.
Logical mistakes and lack of information about uncertainty. LLMs are generally poor at self-correcting [
23
] and typically cannot provide very useful information of where they are uncertain [
24
]. Many techniques have been proposed to mark mistakes (such as asking a different model to verify) [
25
28
], and for mathematical reasoning it would be particularly important to improve and include them. For lengthy reasoning chains, logical errors are a significant problem because human experts often need to perform solutions in detail themselves before being able to spot errors. Humans are often aware where in a derivation they are uncertain and can ask for help, or investigate further themselves.
The paper is organized as follows. In section
we discuss the properties of our data set, including the origin of problems and our approach to verification and grading. In section
we benchmark popular closed source and open source models on this data set. In section
we analyze the output of these models in more detail, and categorize their failure modes. In section
we discuss related work. Finally in section
we discuss future directions to improve AI-based reasoning in TP.
2. Properties of TPBench
2.1. Overview
We have curated a dataset of problems and associated solutions in main areas of TP. For research level problems we currently focus on high-energy theory and cosmology, the main expertise of the authors. Problems in our collection should have the following properties (similar to
FrontierMath
]):
The problem is well-posed and the solution to the problem is unambiguous. An expert in the field, after reading the solution, should not have any objections.
The problem is original. The solution to the problem cannot be easily found in the existing literature.
The answer should be auto-verifiable. This is easily achieved for numerical answers or simple algebraic expressions, but more difficult for tensor expressions. We discuss this property further below.
It should not be possible to guess the answer or remember it from the literature, despite a wrong reasoning chain.
It is hard to strictly enforce all these conditions in TP, as we discuss further below. Problem originality and the possibility to guess the answer can be judged differently by different researchers. For this reason we also provide metadata for each problem individually. We point out potential shortcomings in instances where we are aware of them. We include problems of varying degrees of difficulty, from undergraduate to graduate and to research problems. Naturally, research problems are more difficult to create, especially when requiring the answers to be novel and unpublished. Furthermore, more difficult problems are often more novel than easier problems (since the space of possible problems grows rapidly with their complexity). We discuss the aspect of novelty of our problems in more detail below, as well as individually in the problem metadata. We also make sure that our problems do not contain steps where a human would need a calculator to solve them (e.g. no floating point operations).
We now discuss the attributes of our data set in more detail, including their statistical distribution. We aim to enlarge and diversify the data set further in the future. We also provide ten sample problems in appendix
and we encourage the reader to browse the problems to get an impression of the whole data set.
2.2. Problem statistics
The dataset is categorized into five difficulty levels:
1—easy undergrad
2—undergrad
3—easy grad
4—grad
, and
5—research
. This classification ensures that the dataset can accommodate a wide range of use cases, from introductory studies to cutting-edge research challenges. The distribution of problems across these difficulty levels is detailed in table
. For difficulty level 1–4 this means that the problem could appear in a homework problem or exam for students. For level 5, this problem could appear as a nontrivial step in a publication: i.e. our research level problems are sub-problems that would constitute a part of a publication, and are not by themselves large enough to constitute an entire publication. Solving level 4 and 5 problems would make models useful for theoretical research, but would not mean that models could write their own publishable papers (by a significant margin). Indeed, one of the most important steps in TP research is establishing why a particular question is important and organizing a string of level 5 type of steps to answer that question. Future iterations of this data set could include more open-ended research problems, more reminiscent of a research publication.
Table 1.
Distribution of problems by difficulty level.
Difficulty level
Number of problems
Percentage
1—Easy undergrad
14.0%
2—Undergrad
13
22.8%
3—Easy grad
11
19.3%
4—Grad/easy research
14
24.6%
5—Research
11
19.3%
The problems in the dataset span specialized domains, including
cosmology
high energy theory
, and
general relativity
. The less difficult problems span a wide area including astrophysics, electromagnetism, quantum mechanics, statistical mechanics, and classical mechanics. This domain-specific focus ensures the dataset’s relevance to theoretical research related to the fundamental laws of nature, while the less difficult problems allow us to establish as a baseline what a successful AI performance looks like. Table
provides an overview of the distribution of problems by domain. In the future, we aim to include problems from other domains of TP, such as condensed matter theory.
Table 2.
Distribution of problems by domain. The ‘Other’ category includes astrophysics, electromagnetism, quantum mechanics, statistical mechanics, and classical mechanics. Many problems are in between areas. For example some Cosmology problems could also be classified as High Energy Theory.
Domain
Number of problems
Percentage
Cosmology
19
33.3%
High energy theory
18
31.6%
General relativity
7.0%
Other
16
28.1%
The dataset includes problems from various sources, in particular unpublished research, private coursework, and recently published research papers. Almost half of the problems are novel (e.g. most of the level 3, 4, and 5 problems), having been created specifically for this dataset, while others draw on course-related material of the authors. A small number of problems have been taken from very recent publications (e.g. [
29
]).
2.3. Auto-verification of solutions
To automate the evaluation pipeline, we developed a system inspired by how coding competitions validate their results. We introduced the requirement that the final answer to each problem be provided as a
Python
callable with the specified signature. We then developed a simple automatic (not LLM-based) grading agent that, given the model’s answer and the correct solution, extracts the code, creates, and executes a consistency-check script. This approach allows for efficient evaluation of algebraic answers and automatically ensures that equivalent correct answers are classified as such. Additionally, it is flexible enough to verify answers involving a variety of special functions or answers that involve several outputs. In some problems, the natural system of units (
) is specified in the prompt, while in other cases we pass constants of nature as function arguments to be unit agnostic. Alternatively, we could have adopted other automatic verification strategies. We could have provided numerical test cases in the prompt, but this would have led to lengthy problem statements, floating point operations, and much less flexibility. Another option is to consider multiple-choice answers, but this would make it easier to guess the answer without detailed understanding. Yet another possibility is to use another LLM as a grading agent and instruct it to compare the given solution to the true one. However we found that this approach is very error prone and LLMs are often not able to check mathematical equivalence of expressions (see below).
Our proposed scheme gives the flexibility to check the variety of classes of answers exactly. The verification process consists of three components:
1.
Code extraction:
The system extracts Python functions from both the model’s solution and the expert solution.
2.
Test case execution:
Both functions are executed with identical test inputs across multiple parameter combinations.
3.
Output comparison:
Results are compared numerically with appropriate tolerances for floating-point arithmetic.
Each problem in our dataset is accompanied by a comprehensive set of test cases, carefully designed to probe both the physical validity and mathematical correctness of solutions. These test cases span different parameter ranges (e.g. negative or complex arguments where appropriate), to ensure thorough verification.
To illustrate this approach, consider the following undergraduate-level example:
Problem statement:
A photon with the energy
scatters on an electron at rest at angle
in the electron’s reference frame. Find the angular frequency
of the scattered photon.
Answer requirements:
Provide the answer in the form of a
verbatim
function with the following signature:
Model answer:
This example demonstrates several key aspects of our auto-verification approach. First, the problem statement is clear and unambiguous, requiring a specific physical quantity (
) to be calculated. Second, the answer requirements explicitly specify the expected format of the solution, including the function signature and parameter types. This standardization enables automated testing across different parameter regimes. Third, the model answer provides both the analytical expression and its implementation in Python code, allowing for direct numerical verification.
Furthermore, our verification system incorporates several safeguards to ensure reliable evaluation:
Timeout mechanisms:
Each function execution is limited to a maximum runtime of 30 s. This prevents infinite loops based on the model’s incorrect reasoning while allowing sufficient time for complex calculations.
Error handling:
The system catches and classifies runtime exceptions, including syntax errors and memory issues. Invalid solutions are automatically flagged incorrect.
Parameter space coverage:
Test cases are generated to cover different regimes of the parameter space while maintaining numerical stability.
While our verification system works well for many problems, certain TP problems present challenges:
Tensor expressions:
Problems involving abstract tensor expressions (e.g.
, or
) often have multiple equivalent representations due to symmetries. For instance, the Riemann tensor
exhibits several symmetries including certain index permutations and the Bianchi identity.
Differential expressions:
Verification of expressions involving derivatives, especially of fields, presents special challenges. Derivative expressions must satisfy constraints such as the product rule, chain rule, metric compatibility, and recognize the group representation of the field the derivative acts on: e.g.
has a different elementary calculus expression than
even for the same gauge group and symbol
if
and
have different representations of the group. Indeed in situations where the intermediate result is a differential equation in a system with gauge invariances, knowing whether the two sets of differential equations (here the differential equation itself being the solution to the physics related problem) are physically equivalent can become nontrivial.
Integral expressions:
Integral expressions would be even more difficult to check numerically than differential expressions. Furthermore, they have many of the same challenges for verification as the differential expressions in terms of equivalence classes, as can be seen in the expression
. For the typical case of vanishing fields at infinity, there are also equivalences up to total derivative terms: e.g.
Manifolds:
Furthermore, in cases where the solution to the problem is a manifold (often expressed as a metric), there is an infinite number of different equivalent algebraic expressions depending upon the coordinates used. An example of this can be seen in asking for a non-compact, static, spherically symmetric, asymptotically flat vacuum solution to the Einstein equations which has a Komar mass of
. A more abstract related situation of difficult-to-identify equivalence class is when two quantum field theories can be mapped to one another by integrating in and out different degrees of freedom (which abstractly covers the situation of renormalization group equivalence as well).
Although this list is common for TP problems in the literature, it can be extended depending on the classes of mathematical objects that need to be covered. The obvious common theme is the wealth of equivalence classes that the verification system needs to be aware of if it were to be generally applicable.
In our current data set, we only include problems where the above issues do not occur, i.e. where the final answer is an algebraic expression without tensors, derivatives, integrals, or manifolds. Of course, these objects do occur in the solution, but not in the final answer. In the future, it would be interesting to develop auto-verifiers for expressions involving these more general mathematical objects listed above. We have reserved a number of such problems for future iterations of the dataset that would be useful for testing dedicated more general verification codes.
2.4. AI-based holistic grading of the entire solution
In addition to auto-verification, we also employ AI-based grading. In this process, the grader model has access to both the expert-labeled solution and the LLM-generated solutions from a separate model, and is tasked with assigning grades (
). This approach mirrors how a human teaching assistant grades homework, where partial credit is given for correct reasoning steps, even if the final solution is incorrect. Moreover, holistic grading can identify instances where a solution arrives at the correct answer using incorrect reasoning, which occurs in a small number of our problems. While holistic grading is conceptually preferred, we observe significant disagreement between different grader models as well as humans.
2.5. Novelty and difficulty of our problems
Most of the problems presented here are constructed based on those given in standard courses as well as unpublished research related notes. For example, the solution to the research-level problem ‘One pole problem’ (see appendix
C.1
), without steps explained, is given in a footnote of [
30
]. Most of the research level problems would be readily doable by a good TP graduate student, and some of these are not much different from hard problems in graduate courses whose problems and solutions can be found publicly. However, we have made significant efforts to construct or modify problem statements so that the answers cannot be found by web search. Most of the research-level problems use typical or not-too-atypical notation to simulate a research setting, although this may facilitate literature recall (rather than reasoning) by the model
The difficulty of a problem can vary along different axes, i.e. problems may be easy or hard for different reasons. We aimed to provide a sampling of this space:
Some of the problems are difficult for a human researcher because they may not know that a similar problem has already been solved in the literature. Indeed, almost all solution techniques used in literature evolve over time incrementally as people build upon results of previous related computations. This gives LLMs an advantage for many problems, especially if the problem statement makes it clear what literature knowledge is required (which we try to avoid). Fortunately, publications often omit minor reasoning steps, and asking the model for detailed mathematical derivation can thus reveal such literature memory. For examples of models solving difficult problems by using ‘superhuman literature knowledge’ see section
4.6
. Indeed, a key challenge in constructing this data set was to avoid this phenomenon as much as possible, to reveal true reasoning.
Another obvious often encountered difficulty in research is simply the accuracy of routine calculus/algebraic manipulations. The probability of errors increases with the number of steps needed to reach the answer as well as the number of variables that are involved. LLMs are not currently performing very well with such long calculations.
More truly physical setting (e.g. experimental setting) related TP problems contain larger number of seemingly-disorganized set of variables, in contrast with more formal setting TP problems that contain a well-organized set of variables (typically using group theoretic structure). Some of our problems have been designed specifically to test whether the AI can reason using a seemingly-disorganized set of variables.
Some of the problems have been given with a great deal of contextual information (such as the ‘One pole problem’ in appendix
C.1
), but others require a much more contextual interpretation (e.g. appendix
C.2
). In some sense, such ‘less specific’ problems are similar in difficulty as the problems requiring literature recall. If the LLM pattern matches the words in the problem to solution patterns in the literature, the LLM can be deemed to have understood the context.
The ‘One pole problem’ in appendix
C.1
) also tests diagrammatic reasoning skills which are slightly more abstract than Feynman rules. This is part of a small number of problems in our data set where humans would use the help of diagrams to reason through them, and their expert solutions sometimes contain diagrams, usually in the
TikZ
LaTeX format. More generally, graphical languages such as TikZ (particularly with its Feynman diagrammatic extension TikZ-Feynman [
31
] and other such extensions) might be a good language with which to develop an LLM’s graphical reasoning skills because of its efficiency in capturing the mathematical content of the diagrams.
2.6. Public and private data set and data leakage concerns
We make 10 of our problems and solutions public (see appendix
and
tpbench.org
), two for each difficulty level, such that they can be used to understand the data set, develop inference algorithms and examine failure modes. Naturally these problems will be part of future training data. To deal with this challenge, we also keep a large part of our data set private, currently about 50 problems. If you would like to evaluate your model on our private data set, please contact the authors directly.
Guaranteeing that private data does not end up in future training data is challenging. OpenAI, which we have used extensively, adds user interface chats to its training data but does not add API calls. Correspondingly, we have generally used API calls for querying problem solutions. However, in early phases of this projects, some problems were run in the user interface. In future iterations of this project, we will emphasize data leakage control further, especially for research level problems. We note that a small number of research problems is sufficient to evaluate significant model progress, as long as for these problems data set leakage control and originality of the problem are flawless. For our current problem set, we only enforce that problems (and especially solutions) do not appear publicly accessible online. Furthermore, we took particular care that expert solutions to problems were never passed to the ChatGPT user interface, where they could be added to future training data.
3. Model performance evaluation
In this section, we evaluate the performance of several leading models on our dataset, TPBench, across five different difficulty levels, ranging from undergraduate to research-level problems. Closed-source models include OpenAI GPT-4o, o1, and o3-mini [
19
21
]. Open-source models which we were able to run locally on our hardware include small and intermediate sized Llama 3.1, Qwen 2.5, and Qwen-QwQ, which is an experimental LLM that focused on advancing reasoning developed by the Qwen Team [
32
34
]. We also include the recent open-source reasoning model DeepSeek (DS) R1 [
35
] and its base-model DS V3 [
36
] which we ran on
Together AI
API. Finally, we tried to solve a subset of our research problems with OpenAI’s Deep Research, including the problem in appendix
C.1
, primarily to spot solutions that could be found online. Deep Research was not able to solve any of these research problems. We believe our subset of models is representative of the spectrum of current LLM capabilities.
We provide the prompts for inference in the appendix
. The complete model answers from all models, for the public problems, can be found on the
tpbench.org
website. The evaluation considers two grading schemes:
answer-only
and
holistic
Answer-only (auto-verified) evaluation
. In the answer-only evaluation, models are tasked with producing a final answer to the problem, where correctness is assessed based on whether the model’s answer matches the expected correct solution. This evaluation process is fully automated as decribed in section
2.3
, with the correctness of the answer validated through numeric verification by the program.
Holistic AI-based grading.
In the holistic grading approach, we assess the reasoning process and the steps taken by the models. A separate LLM is provided with the problem statement, the expert solution, and the model’s solution. It then evaluates the model’s answer on a grading scale ranging from
to
. This grading scale accounts not only for the correctness of the final answer but also for the quality of reasoning, intermediate steps, and overall approach. Holistic grading is more lenient with minor errors or missing intermediate steps, and it provides partial credit for well-reasoned solutions even if the final answer is incorrect.
These two choices on their own are imperfect. The first one might consider a solution as correct which has two or more self-annihilating mistakes, or a solution that arrived at the correct answer with inconsistent or false reasoning. If the task is to evaluate the reasoning in challenging problem-solving, the binary grading system might not be representative to a satisfactory level. The second has the disadvantage of being somewhat arbitrary on assignments of grades for partial correctness. Our core results will use answer-only solutions.
3.1. Results for auto-verified solutions
We begin by discussing the answer-only results, which are the key empirical results of this paper. Our results are obtained using zero-shot reasoning where the model is given the problem statement and expected to reason through it without any prior examples. In fact, few-shot learning can degrade general performance in reasoning models [
35
]. We have experimented with prompt optimization, but found no significant differences (see appendix
for our prompts).
Table
presents the performance of each model across various difficulty levels, ranging from easy undergraduate problems (Level 1) to research-level problems (Level 5). The table reports the percentage of problems solved by each model. The columns labeled ‘avg@5’ represent the average score across five attempts, while the ‘best@5’ columns correspond to the average score of the best attempt out of five attempts. We visualize the ‘average of five’ solution percentage in figure
(strong models) and figure
. Finally, for our public problems, the individual results of models are given in appendix
. For example, we include one level 5 research problem that top models can solve and one that they cannot.
Figure 1.
Accuracy of SOTA models by difficulty level.
Note:
‘high’ in brackets indicates reasoning effort.
Download figure:
Standard image
High-resolution image
Figure 2.
Accuracy of common open-source models by difficulty level.
Download figure:
Standard image
High-resolution image
Table 3.
Fraction of problems solved for each difficulty for each model.
1-Easy undergrad
2-Undergrad
3-Easy grad
4-Grad
5-Research
Model
avg@5
best@5
avg@5
best@5
avg@5
best@5
avg@5
best@5
avg@5
best@5
GPT-4o
0.75 (0.12)
0.88
0.86 (0.17)
1.00
0.25 (0.16)
0.45
0.09 (0.13)
0.29
0.00 (0.00)
0.00
o1 (high)
0.85 (0.05)
0.88
0.97 (0.04)
1.00
0.76 (0.24)
1.00
0.34 (0.13)
0.50
0.18 (0.07)
0.27
o3-mini (high)
0.97 (0.05)
1.00
1.00 (0.00)
1.00
0.87 (0.13)
1.00
0.57 (0.09)
0.64
0.15 (0.12)
0.27
DeepSeek-R1
0.95 (0.06)
1.00
0.98 (0.03)
1.00
0.76 (0.23)
0.91
0.49 (0.20)
0.64
0.07 (0.08)
0.18
DeepSeek-V3
0.72 (0.15)
0.88
0.80 (0.23)
1.00
0.29 (0.29)
0.64
0.11 (0.06)
0.21
0.00 (0.00)
0.00
Llama-3.1-8B
0.30 (0.06)
0.38
0.18 (0.20)
0.46
0.02 (0.04)
0.09
0.00 (0.00)
0.00
0.00 (0.00)
0.00
Llama-3.1-70B
0.45 (0.36)
0.88
0.52 (0.22)
0.77
0.11 (0.11)
0.27
0.04 (0.06)
0.14
0.00 (0.00)
0.00
Qwen2.5-7B
0.10 (0.11)
0.25
0.40 (0.21)
0.62
0.04 (0.07)
0.18
0.00 (0.00)
0.00
0.00 (0.00)
0.00
Qwen2.5-72B
0.60 (0.11)
0.75
0.42 (0.23)
0.77
0.24 (0.16)
0.36
0.04 (0.06)
0.14
0.00 (0.00)
0.00
QwQ-32B
0.62 (0.21)
0.75
0.60 (0.27)
0.92
0.07 (0.15)
0.36
0.01 (0.03)
0.07
0.00 (0.00)
0.00
Note:
The number in the bracket is the average of model attempts’ standard deviation per problems.
For the top models, o1, o3-mini and DS R1, undergraduate problems (levels 1 and 2) are now essentially solved, with performance of 95% to 100% for the oX models. For easy graduate problems (level 3), the performance is around 80%. For our level 4 graduate problems, some of which could appear in research investigations, the best models o1 and o3-mini solve around 50%, with o3-mini slightly beating o1. Research problems are mostly unsolved at this stage with a score around 15%. o1 slightly beats o3-mini here, which may be due to it having a larger literature knowledge to draw on.
Among mid-range models, GPT-4o and DS-V3 perform similary. They are between one and two levels of difficulty less powerful than the top models. Midrange models are essentially unable to solve problems above easy graduate level. Finally, lower parameter public models, which have the advantage that researchers can run them on invididual GPUs, cannot solve problems above undergraduate level. We also provide further model evaluation statistics of the data set on the website, including a unified model score over all difficulties.
3.2. Results for holistic AI-based grading
Table
presents the results for the holistic AI-based grading, which involves assigning letter grades (
to
) based on the quality of reasoning and correctness of the solution. This grading is not limited to the final answer but considers the overall approach taken by the model in solving the problem. We have used GPT-4o as a grader, as a currently mid-range model. We chose this model for cost efficiency reasons, and in the future we intend to use the most powerful model as a grader. The model was provided the grading prompt (appendix
), the expert solution, and the model solution to grade, similar to the way a human teaching assistant would work.
Table 4.
Letter grade received for different models.
1-Easy undergrad
2-Undergrad
3-Easy grad
4-Grad
5-Research
Model
GPT-4o
28
11
50
20
29
51
50
o1 (high)
36
60
48
41
22
23
23
o3-mini (high)
39
60
49
51
18
36
16
DeepSeek-R1
31
58
41
10
25
23
19
26
23
DeepSeek-V3
26
12
50
15
29
11
47
49
Llama-3.1-8B
11
15
13
25
29
13
41
15
55
47
Llama-3.1-70B
19
17
31
29
36
50
15
44
11
Qwen2.5-7B
24
11
22
25
14
22
29
34
35
32
23
Qwen2.5-72B
25
13
35
23
11
30
47
13
46
QwQ-32B
25
11
38
18
11
33
10
49
17
37
16
Note:
the number of attempts per each level equals 5 shots times the number of problems in the level (see table
).
The models’ performances are shown across the five difficulty levels. The letter grades represent the models’ ability to produce correct solutions while demonstrating sound reasoning. An ‘
’ indicates an excellent solution with minimal to no errors, a ‘
’ suggests a good solution with minor mistakes, ‘
’ indicates a solution with significant flaws, and ‘
’ represents a fundamentally incorrect solution.
In principle, the holistic grading system provides insights into the models’ reasoning capabilities beyond just final correctness. However, we find some difficulties with holistic grading as we now describe. This is consistent with results showing that LLM-as-a-judge approaches have considerable bias [
37
].
Table
and the corresponding bar chart in figure
summarize how the automatically verified results (
Correct
vs.
Incorrect
) align with the letter grades (
) assigned by the AI-based holistic grading. For
-graded solutions, a large fraction (80.1%) aligns with the auto-grader’s correct verification. By contrast,
- and
-graded solutions show substantially lower correctness rates (16.3% and 4.9% respectively). In the
category, an overwhelming 99.5% fail the auto-grader’s check, indicating that both holistic assessment and numeric verification typically reject these solutions.
Figure 3.
Stacked bar chart showing the number of solutions verified as correct (green) versus incorrect (red) across each letter grade.
Download figure:
Standard image
High-resolution image
Table 5.
Grade verification results. Percentages in parentheses indicate the distribution of verification outcomes within each grade category.
Grade
Correct
Incorrect
Total
880 (82.2%)
190 (17.8%)
1070
61 (43.6%)
79 (56.4%)
140
74 (6.4%)
1075 (93.6%)
1149
5 (1.0%)
486 (99.0%)
491
Total
972 (34.1%)
1878 (65.9%)
2850
Note:
The total number 2850 results from 5 attempts for each of the 57 problems in the data set across 10 models.
Overall, there is a strong correlation between higher letter grades and positive verification outcomes, which validates that the AI-based grading system’s assessed quality generally corresponds to the auto-grader’s numeric correctness checks. At the same time, deviations exist in each category. For example, nearly 20% of
-graded solutions fail the numeric check, often because the AI holistic grader failed to correctly determine whether two answer expressions are equivalent. This is typically due to the expressions being overly complex. Conversely, a small fraction of lower-graded (
or
) responses may be mathematically correct in final form, yet insufficiently justified in intermediate steps, causing the holistic grader to assign a low grade despite correct numerical output.
Our findings illustrate that automatic verification and holistic AI-based grading are generally consistent: higher-quality solutions are confirmed as correct more frequently, while lower-quality solutions often fail numeric checks. Our current GPT-4o grader however has significant shortcomings. By cross-checking the grader with human grading, we find that LLM grading works reasonably well when grading solutions of low difficulty 1 to 2, but is not reliable at level 4 or 5. It seems likely that 4o is not strong enough to understand the logic of these higher difficulty problem solutions. Even when the grading model is as strong as the solver model, the success of holistic grading could be limited: LLMs are generally not very good at correcting their own results, as has been studied for example in [
23
]. In the present work, we thus focus on the auto-verifier results, and leave detailed exploration of holistic grading to future work.
3.3. Augmenting inference with python to reduce algebraic mistakes
We experimented with instructing models to break down calculations into smaller steps and verify these with python. Using a code interpreter was previously found to be beneficial in reducing algebraic mistakes in calculations (e.g. [
38
]) Our approach was based on the
MathChat
39
] framework and prompt tuning. We instructed the model to write python (particularly
SymPy
) code for each calculation step and verify its result using this code. In a few cases, for low difficulty problems, our approach was able to spot and correct mistakes. However, more often the approach disrupted the reasoning chain and led to worse results. For complicated problems, models struggled to identify steps that can be checked with
SymPy
. We note that our problems do not include floating point calculations, where verification would be straightforward, but require more complicated algebraic operations. Recently, the FrontierMath paper [
] included a set of prompts to encourage LLMs to verify with python, but noted that advanced models barely made use of this possibility. While human theorists do sometimes check their results with computer algebra systems, especially
Mathematica
, this process is not straightforward, and there is likely limited existing training data for this approach. We aim to experiment with few-shot inference or fine-tuning in the future, showing the model handcrafted examples of
SymPy
or
Mathematica
verification in the prompt. Since our current
MathChat
-based results are not stable we chose to defer this direction to future work
4. Failure mode analysis
We now discuss common classes of mistakes. We present a few examples highlighting the various types of errors that the LLMs make while attempting to solve problems in TP. We broadly classify these errors into four classes as shown below. Our examples mostly draw from GPT-4o and o1 model results.
4.1. Background knowledge of the model
Background knowledge is a strength of LLMs. Problem authors were impressed by models’ ability to recall relevant mathematical definitions that were not included in the problem but are known to practicing researchers. This ability makes it much easier in principle to solve problems than with a computer algebra system like Mathematica. For example, consider the level 5 cosmology problem from appendix
C.2
User:
In cosmology, large-scale cosmological dark-matter halo fields are biased tracers of the underlying Gaussian matter density
. Assume we have a sample
. We simulate a halo number density field by taking
, where bare number density
and bare bias
are specified constants. What is the bias of the sampled halo field? Derive an equation to evaluate the bias which depends on the bare bias and the variance in each pixel.
While well-defined for a cosmologist, the problem does not define the mathematical quantities in detail, and would be hard to interpret by a non-cosmologist. Advanced models correctly recalled the required definitions and generally set up the problem correctly.
However, while LLMs generally recall key definitions of various sub-fields of TP, they frequently encounter difficulties in accurately recalling more detailed mathematical information, as illustrated by the following two examples.
In one of the solutions to an undergraduate QM problem, the QwQ model incorrectly retrieves information about the Clebsch–Gordan coefficients. Specifically, it claims
From standard tables or textbooks, the Clebsch–Gordan coefficients are:
For
= 1,
The correct value of these coefficients are
for
and
states respectively.
In the following snippet, generated from a model answer, the GPT-4o model incorrectly identifies the standard eigenstates of a particle in a 1-D infinite potential well (
) from existing results
and
4.2. Algebraic mistakes
A major challenge for models is to perform correct algebraic calculations. Consider the following relatively easy math problem that appears as an individual step in one of our problem solutions.
User:
Determine the leading real term of the expression
for real
and
Expert solution:
The correct series expansion up to leading real and imaginary terms is
We attempted this problem multiple times with o1 and o3-mini. In most of its responses, the LLMs did not expand the exponential term beyond the second order in
, falsely assuming that the leading real term must be proportional to
. In its best attempt, it expands up to quartic order in
, but fails to accurately combine the various terms to compute the coefficient of
. This is a good example of the promise of combining with computer algebra systems. If we add the prompt ‘Write and execute
SymPy
code to evaluate the expression.’ models can generate Python code that calculates the correct expression. We have therefore tried to encourage python usage as discussed in section
3.3
, however with limited initial success.
Algebraic mistakes are numerous, and often occur even in very simple calculations. Models often simply forget mathematical terms in an expression from one calculation step to the next. For example, the following is an arithmetic evaluation by GPT-4o, in which it spuriously drops a factor of the imaginary number
Similar cases of forgotten
factors, minus signs or constants occur frequently in many problem solution attempts.
Mathematical identities are also often applied incorrectly. For example, in the following case GPT-4o fails to implement the vector triple product
correctly and writes
More powerful reasoning models tend to make less frequent ‘simple’ math mistakes such as the following:
o3-mini:
After performing the
integrals (using the standard
prescription so that
where the second integral erroneously contains a negative sign. It would be interesting to compare model performance on a set of automatically generated simple calculations typical for TP.
4.3. Logical mistakes
We frequently observed that LLMs struggle to accurately account for the validity and applicability of advanced mathematical concepts, such as incorrectly applying theorems, misinterpreting definitions, or failing to recognize the limitations of certain mathematical techniques. Consider the following mathematical problem, which we will use to discuss logical errors made by the oX series models in their attempted solution.
User:
By Taylor expanding the integrand, find a
cubic polynomial approximation to the integral
that achieves a 90% or better accuracy when
lies in the interval
Expert solution:
First, we note that the integrand remains finite in the limit
→ 0, as the singular term proportional to
cancels out. To assess the validity of a series expansion, we estimate the radius of convergence near the boundary points
= 0 and
= 1. This analysis shows that the integrand is convergent within the interval
when expanded about
= 1. Consequently, we perform a Taylor expansion around
= 1 retaining terms up to
Integrating over the interval
, we obtain the approximate solution:
In the limit
→ 0, our approximate expression evaluates to
which achieves approximately
accuracy compared to the numerical result
. We note that this integral cannot be evaluated exactly using Mathematica or Maple software.
When the above problem was given to the o1 and o3-mini models, they demonstrated the following logical errors in their reasoning:
1.
The model begins by identifying that the integrand is finite at
= 0. However, it fails to recognize that the radius of convergence for the integrand around
= 0 is 0. This oversight leads to an improper application of approximations beyond the valid domain. Subsequently, the model factorizes the integral as
Assuming
, the model proceeds to perform a Taylor expansion of the integrand around
= 0 and evaluates the second integral
up to cubic order in
. However, the Taylor expansion is only valid within the radius of convergence, and this restriction is not respected, rendering the approximation potentially invalid.
2.
To estimate the constant value of the first integral
, the model imposes the boundary condition at
= 1, equating
The model substitutes the cubic-order Taylor expansion solution for
derived in the previous step into the right-hand side of this equation. This substitution constitutes a significant logical error in its reasoning, as the cubic-order approximation was determined only for
, a fact that the model seemed to know but failed to implement. Extending this local approximation to
= 1, far beyond its domain of validity, leads to an erroneous evaluation of
Interestingly, the o3-mini model demonstrates two critical flaws: it not only arrives at logically inconsistent conclusions but occasionally also confidently hallucinates the claim
, failing to furnish a coherent proof despite repeated prompting.
In a different problem involving particle physics, the GPT-4o model was asked to determine the effective mass of a spin 1/2 particle with action
The models did not understand and failed to reason out that the parameter
alone does not define the physical mass. The pseudoscalar
-term must be included, corresponding to a chiral contribution to the mass.
For a much more basic example of failed logic, consider an example from QwQ. In one of our undergraduate Electrodynamics problems it produced the following expression followed by a faulty and rather incomplete reasoning:
But
is perpendicular to both
and
, which suggests that
must be perpendicular to
for this equation to hold.
We found that advanced reasoning models such as o3-mini generally do not make such easy mistakes on undergraduate level physics problems. However, for difficult problems, they often oversimplify the problem due to a lack of detailed understanding. For instance, while solving the Level-5 problem detailed in appendix
C.1
, o3-mini and other advanced reasoning models approximate scale factor
by expanding it linearly around the transition point,
, not realizing that the pole is far from the transition point and thus one needs to apply
. For further details, we refer the readers to the expert solution detailed in appendix
C.1
4.4. Hallucinations
Lastly, we present two instances where the LLM models generated new rules to obtain solutions that match with existing results in the literature. The following expression generated by GPT-4o represents an arithmetical hallucination error:
The model performed the above arithmetic steps since it needed to determine the imaginary component of
. With this goal in mind, it carried out the above ‘illogical’ mathematical step inventing new arithmetic rules to justify its approach to obtaining the imaginary component from a wrong answer.
Another example, from our problems, of o1 hallucinating non-existent rules to justify its approach is the following excerpt from its solution:
We can write
Note that
Therefore, to avoid vanishing of the trace, we need to consider that the
matrices need to be kept separate. Instead, we should expand the trace without combining the projectors.
There is no such mathematical rule that an apparent zero can (must) be avoided by separating the terms and adding them later. The LLM invented this ‘rule’ since it was working with incorrect expressions to nevertheless arrive at a correct solution, which in this case it was able to guess or recall (in only one of several attempts).
4.5. Performance of pre-o-series models
In our experience, models that are not explicitly trained for reasoning (i.e. before the oX series) can be used to assist researchers that reason through a simple problem, but with significant shortcomings. Consider the following easy mathematical subproblem that appeared in one of our recent works [
29
] in the context of cosmology, which we show here in simplified notation.
User:
Assume
, and
are vectors, and
is a symmetric positive-definite matrix. Let
and
be real numbers. I want to minimize
under the constraints
and
. Solve this for
, if possible.
Expert solution:
We minimize
subject to the constraints
and
. The Lagrangian
for this optimization problem is defined as:
Taking the gradient of
with respect to
and setting it to zero yields:
which we can solve for
as:
The constraint equations are:
Plugging the solution for
into the constraint equations gives
which is of form
The above linear system is solved by (assuming the inverse exists, e.g. the two bias vectors are not co-linear):
We then substitute the solution for
and
back into
using equation (
).
That is a typical problem that GPT-4o and Llama-3 generally solve correctly, with correct mathematical derivation, although sometimes with a wrong numerical factor. It seems certain that this problem was in the training data of the model. Nevertheless, it is already time-saving for researchers to get answers to similar problems without manual labor. In particular for matrix algebra problems, existing computer algebra systems are not very strong or user friendly in our experience. However, the fact that models are very error prone limits their usefulness significantly. If every step needs to be checked in detail, the time saving can be minimal, or a wrong result can even confuse the user. Of course, human solutions can also have this property, depending on the skill and carefulness of the researcher.
4.6. Performance of o1, o3-mini, and DS reasoning models
From evaluating the output solutions generated by advanced LLM models such as o1, o3-mini and DS, we observe that these models exhibit significantly stronger reasoning capabilities compared to other LLMs tested in our study. Notably, these models can perform more difficult algebraic manipulations, identify different components of a problem, and connect them with established concepts in the literature. This ability allows them to make meaningful progress in research-level problems, including those from topics such as QFT and String theory, by pinpointing key aspects of the question and recalling relevant background knowledge.
However, these models still struggle with detailed and systematic logical reasoning. When tasked with solving our Level-4 and Level-5 problems, these models often perform well in the initial phase of problem solving, demonstrating promising insights. Yet, for problems requiring extensive calculations combined with step-by-step logical rigor (e.g. loop integrals in QFT, tensor manipulations in general relativity) and systematic justfication of the assumptions, their performance deteriorates significantly. Our analysis of multiple solutions suggests that when intermediate steps become too complex, the models (including DS) often resort to literature memory from pre-training rather than performing detailed calculations. Rather than explicitly detailing intermediate steps, the models often present only their final answer, recalling related literature knowledge without references or resorting to vague assertions such as ‘after a lengthy (but straightforward) calculation’ or ‘a short calculation shows’. While the full CoT of the o-series models is not public, we have no evidence that the models genuinely perform relevant calculations internally in these cases.
As an illustrative example, when asked to compute the one loop anomalous magnetic moment of a fermion (e.g. [
41
]) including a contribution from a heavy scalar coupling, the model resorted to recalling existing solutions seen during pre-training rather than explicitly solving. However, it failed to recognize that the Yukawa interaction Lagrangian provided in our problem statement contained an additional factor of
, which may deviate from the conventions in the literature. Consequently, its final answer overlooked this crucial modification. In a similar manner, when presented with the task of solving the Level-4 problem in appendix
C.4
, all advanced models (oX, DS-R1) initiate their response by articulating their interpretation of the problem statement and correctly identifying its connection to the standard supersymmetric transformations within the free Wess–Zumino model, as extensively documented in the literature. Subsequently, these models produce their final solution from memory. However, a consistent error emerges across all responses: the absence of the critical ‘negative sign’ as seen in the solution given in equation (
108
).
o3-mini:
The well-known and consistent choice is
with the Hermitian conjugate
This is verified by checking that the variations of all terms in
under the full set of SUSY transformations (including the ones for
and
) cancel (up to a total derivative).
While this might appear to be a minor discrepancy, it originates from a fundamental aspect of the problem. Specifically, the sign convention utilized in our given problem statement likely differed from the convention commonly adopted in the literature (for instance refer to section 5.2 in [
42
]) used within the models’ training samples, thereby necessitating a corresponding modification in the final transformation rule. This seemingly subtle yet conceptually significant detail indicates a potential cognitive limitation in these AI models, reflecting an over-reliance on memorized patterns rather than a systematic, first-principles approach, as well as a failure to validate the appropriateness of the retrieved solution within the context of the specific problem statement.
As another example of literature memory, one of the problems in TPBench involves solving a nonlinear differential equation in a manner similar to how Chandrasekhar presents the Kerr solution in [
43
]. The number of steps to reach the answer is long and complicated. Such (complicated) recall problems are expected to be solvable by an AI due to its vast knowledge of the literature. Indeed, on one of the attempts, the AI can recognize the literature and write an answer to this problem, but even in that instance, it does not reason through the problem but just states:
o3-mini:
In fact, after a (lengthy) calculation one finds that the only solution (consistent with the field equations and the asymptotic condition) is
These inconsistencies suggest that models’ solutions often fluctuate based on how their internal sampling mechanism recalls (pre-)training data, rather than adhering to a logically coherent problem-solving strategy. This underscores a fundamental issue: unlike a proficient researcher who would maintain logical consistency across different attempts, these models exhibit uncertainty in their outputs, lacking a clear measure of confidence in their solutions. Such limitations and the opaque structure of the training and inference process (especially of closed-source models) present obstacles to their applicability in research settings. It appears that successfully solved high-difficulty problems often benefit from the very deep and interconnected literature memory of these models, in addition with their ability to translate this knowledge to the problem setting. While this ability is useful for research, it may not be sufficient to create novel TP results without human assistance. In summary, current model performance perhaps resembles a student with superhuman literature knowledge but low intellectual rigor and technical expertise.
5. Related work
Despite significant advances in the mathematical reasoning capabilities of LLMs, accurately solving reasoning problems in specialized domains, such as TP remains a persistent challenge. In math reasoning, the landscape of existing benchmarks has been instrumental for the evaluation of LLM reasoning capabilities and the development of more robust and interpretable reasoning strategies. We review related benchmarks in section
5.1
as well as common strategies for eliciting more accurate reasoning from LLMs in section
5.2
5.1. Mathematical reasoning benchmarks
Recent progress in LLMs has enabled these models to tackle increasingly complex tasks that demand high-level abstract mathematical reasoning. A significant body of work has focused on datasets for mathematical reasoning at the middle-school (e.g. [
44
]), high-school (e.g. [
]), or undergraduate level (e.g. [
45
]), which often cover arithmetic, geometry, or math word problems. Other benchmarks are focused on theorem proving [
46
48
]. For example, the recently introduced
PutnamBench
46
] provides a collection of formalized theorems from the Putnam competition, while
MiniF2F
47
] and
FIMO
48
] offer datasets of formalized proof problems drawn from competitions like the IMO, AIME, and AMC. In addition,
ProofNet
49
] comprises both natural language and formalized theorem statements and proofs at the undergraduate level. Complementary to these are natural language datasets that feature problems of varying difficulty [
44
], as well as benchmarks like GPQA diamond [
50
], which are designed to be hard. Even more recently, the
HLE
dataset [
] is an industry-curated, multi-domain benchmark that includes very challenging problems, among them some from TP. However, problems in HLE are constrained to numerical answers or multiple choice formats, there is no spectrum of difficulty, and it is not specifically designed to probe reasoning capabilities in TP.
While lower difficulty math benchmarks such as
MATH
] have nearly been mastered by current LLMs, the
FrontierMath
] dataset, which includes research-level problems curated by working mathematicians, remain largely unsolved.
FrontierMath
spans a range of difficulties from high-school to research level and features properties like auto-verifiability and rich metadata, design principles we have also incorporated into TPBench. However, Glazer
et al
] provide limited information about the difficulty distribution and the specifics of the problems that have been solved by advanced models.
In the realm of physics, which also demands extensive abstract mathematical reasoning, the focus has been predominantly on high-school level challenges as seen in datasets such as
JEEBench
],
OlympiadBench
], and
PhysicsQA
]. Beyond undergraduate-level problems, very little work has addressed mathematical reasoning for TP. One notable exception is [
], which examines symbolic calculations, albeit within the narrow context of a specific class of quantum many-body physics problems.
Our new dataset, TPBench addresses the gap in TP reasoning benchmarks beyond the undergraduate level. TPBench encompasses problems ranging from undergraduate to research level, with research problems reflecting challenges typical of those found in TP publications (rather than representing entire publications in themselves). Importantly, TPBench is designed to be independent of industry control, ensuring that the TP research community has access to a reasoning benchmark that is not susceptible to data leakage from future training data. We look forward to sharing this dataset with collaborators under appropriate data leakage controls.
5.2. Reasoning capabilities of LLMs
Despite the remarkable fluency of LLMs in generating human-like text, their capacity to perform reliable multi-step reasoning remains a challenge [
51
]. Many LLMs still struggle with complex arithmetic and logical inference tasks. In this section, we review state-of-the-art methods, spanning both training-time and inference-time techniques that have been developed to boost the reasoning capabilities of LLMs.
Training-time methods for improved reasoning.
Training-time methods encompass all strategies where pre-trained language models are fine-tuned or otherwise modified to improve their reasoning capabilities. The most popular approaches in this category rely on either supervised fine tuning [
52
54
], or reinforcement learning [
35
52
] (or both [
35
]). In supervised fine-tuning [
55
58
], high-quality reasoning chains are curated and used to fine-tune models to display more accurate reasoning behavior. Chen
et al
59
] demonstrate that self-play fine-tuning can improve model reasoning.
Inference-time methods for improved reasoning.
Test-time methods aim at improving reasoning capabilities by either designing prompts that elicit good reasoning behavior or by building reasoning systems which prompt the LLM over and over to arrive at a solution in a systematic way. The most popular strategy for prompting LLMs to reason is chain-of-thought [
60
], where the prompt includes instructions to ‘think step-by-step’. This is a type of test-time approach [
61
63
], as it typically leads to longer token sequences generated by the LLM. The default prompt (see appendix
) we use to evaluate various LLMs on TPBench is a customized variation of chain-of-though—it includes the tips from Polya’s famous manual ‘How to solve it’ [
18
] which was originally intended to teach students how to solve mathematical problems. Related advances include prompting the model to break down the problem into simpler subproblems [
64
66
], or seeking abstractions [
67
]. Other prompting strategies encourage models to self-verify [
68
69
], self-improve [
59
70
], or iteratively refine their answer [
71
72
].
Other strategies to elicit reasoning behavior involve the generation of multiple reasoning chains which can then be sampled from (as in best-of-
73
]) or combined via majority voting or by ensuring self-consistency [
74
]. Methods that improve reasoning through planning [
66
75
78
] roll out multiple reasoning chains hierarchically and explore the space with Monte-Carlo Tree Search [
79
]. The success of these methods depends on how the different reasoning chains are evaluated and can be achieved either through other language models [
76
] or through external tools, e.g. [
77
]. Tool usage in reasoning is explored next.
Verifiers and tool usage.
Another avenue for boosting the performance of LLMs is by allowing tool usage [
80
81
] either during the reasoning phase [
82
], or to verify intermediate reasoning steps [
77
] and solutions [
22
]. Verifiers and tools are compatible both with training-time and test-time methods. Since each of the problems in TPBench has an auto-verifier, one could consider giving the LLM under evaluation access to the auto-verifier to test if, by using it, it can achieve better results.
6. Discussion
We developed the dataset TPBench to test TP reasoning capabilities of AI models. The core of our work is a set of novel uncontaminated problems, to detect true reasoning rather than memorization. However, as we discussed, scientific reasoning is always based on existing works and methods, and there is no sharp transition between true reasoning and memorization (for example of logically equivalent problems, or logically similar problems with minor modifications). It is clear from our benchmark results that reasoning models vastly outperform non-reasoning models, and that these models are capable of some degree of reasoning. We note that our problems were not constructed to match a particular target error rate (o3-mini and DS R1 appeared after most problems were finalized), but rather to reflect real problems encountered by theoretical physicists at each career level. Our TP reasoning results are consistent with studies from more general benchmarks, and illustrate the speed of progress in AI. The most advanced models are able to solve some problems at graduate level, but are not yet capable of solving most research level problems. While advanced models demonstrate remarkable proficiency in algebraic and conceptual problem-solving, they struggle with structured logical reasoning and transparent step-by-step calculations, particularly in complex, research-level problems. Their reliance on literature recall without verification or referencing and their lack of consistency in detailed reasoning remain key limitations in their problem-solving capabilities. We discussed these shortcomings and summarized common failure modes.
Progress has been rapid, even during the creation of this data set. If models could solve level 5 research problems consistently, their impact on TP would be substantial. However, even then, AI models could not perform independent research without further developments. We now discuss some future directions related to our work, that could make LLMs more powerful for TP research.
Updates to the TPBench data set and score board.
We will update the score board for novel SOTA models. Results will be published on the website of the data set
tpbench.org
. The website also contains additional model evaluation metrics, which assign a unified model score over all difficulties. We aim to add more problems to the data set in the future, both public and private problems. It would be particularly interesting to design more research problems which are clearly outside of the training data. This could be achieved by curating research problems specifically from the newest arxiv publications, before the current knowledge cutoff. We invite interested researchers to contribute new problems and collaborate on future TPBench updates (see website for details).
Automatic problem scraping from publication archives.
To improve inference methods specifically for TP, for example by reinforcement learning of reasoning chains (e.g. Deepeek R1 [
35
]), it would be important to have a large collection of verifiable problems. If problems could be extracted automatically from publications, perhaps after a training data cutoff, this would allow generation of training data without human labor at industrial scale. An initial exploration of LLM-based problem extraction from papers has revealed that this is difficult in TP because calculations are often spread over the paper and it is not clear to the model what information is needed to state the problem and what the answer is. This is more obvious in mathematical papers that clearly mark theorems and proofs (e.g. with latex tags), however those are more difficult to auto-verify. Nevertheless, this is an exciting direction for future work, especially since large industry labs keep their training data for reasoning models private.
Automatic verification for non-algebraic expressions.
We were somewhat constrained in our choice of problems by the criteria of auto-verifiability. Many TP results can be written in inequivalent ways, and models are not currently good at judging equivalence of expressions. Large collections of verifiable problems are also important for reinforcement learning-based training of reasoning models, see e.g. the recent DS R1 [
35
]. Generating stronger verifiers that work for a wider class of problems is a very interesting direction for future work, where theoretical and computational physics domain expertise is valuable. Some challenges were listed in section
2.3
. We note that results in TP are often symbolic expressions, which are more suited for auto-verification than mathematical proofs (which need to be checked by proof assistants).
Improving reasoning methods for TP.
We have reviewed methods to improve reasoning capabilities of LLMs in section
5.2
. It is clear from our experiments that a significant gain could be obtained if tools such as
SymPy
or
Mathematica
would be used consistently to check symbolic calculations where this is possible. Few shot learning or fine-tuning could be used to improve models ability to call symbolic software packages. However, many TP calculations require specific packages and do not come with a lot of training data. Further, the human TP research process involves reading publications, and looking up results or methods when needed. References are also used to spot mistakes in calculations by comparing to known published results where possible. While LLMs can parse literature with techniques such as RAG [
83
], to our knowledge this has not been demonstrated to lead to performance gains in mathematical reasoning. The fact that models cannot point to a specific source for mathematical statements lowers their trustworthiness. Finally, inference methods that provide more information about uncertainty in individual steps would be particularly beneficial for difficult TP problems. This would pave the way for trustworthy, automated TP research assistants that reliably solve some aspects of a problem, but then ask for help for the parts they are uncertain about.
Diagrammatic and spatial reasoning.
Theoretical physicists like to reason using spatial diagrams such as Feynman diagrams or drawing integration contours. In principle, such diagrams can be encoded in some formal language and multi-modality for spatial reasoning may not be necessary. For example, some of our problem solutions include Feynman diagrams or integration contours encoded with the
TikZ
LaTeX library (e.g. figure
). For some of our problems humans would have trouble reasoning through them without the ability to draw on some scratchpad. It would be interesting to see whether multi-modal language and spatial reasoning models could make models stronger. Visualizing the problem (e.g. ‘running an example in your head’) is a common strategy and could be particularly powerful for models to develop truly novel ideas.
Figure 4.
The original contour in blue is deformed into the orange contour in the lower half complex plane of
. The large radius arcs have vanishing contributions, and one-pole approximation has been taken. The upper green and purple boundaries correspond to where integrations over any arcs extended beyond this boundary would not converge. The dashed horizontal curve is parallel to the real axis. The red squiggly line is the branch cut at
Download figure:
Standard image
High-resolution image
Training reasoning models on TPBench.
While we designed TPBench for the evaluation of the reasoning capabilities of LLMs, it would also be very interesting to curate a dataset for supervised fine-tuning or for reinforcement learning purposes. While we expect fine-tuning to increase the TP specific reasoning capabilities of LLMs, it is equally important to avoid data leakage to avoid problems that are later used for evaluation to seep into the training data. For this reason we choose not to publish all of our problems in TPBench at this time. Instead we encourage researchers who wish to have their models evaluated on TPBench to reach out to us.
Open-ended research problems.
If models could solve well-posed problems such as the research problems in our collection reliably, this would speed up TP research projects considerably. However, a large part of research consists of arriving at well-posed problems, which are interesting to answer and can be answered. It could be possible to design more open-ended tasks, where the goal is to ‘derive interesting results’ based on some set of initial constraints or observations. The AI model could suggest assumptions to include or drop, design its own problem statements, and attempt to judge the importance of its results (develop ‘theoretical taste’). It would be exciting and challenging to set up such a more open-ended benchmark.
Community efforts by the TP community.
With reasoning models being developed primarily by industry, usually with proprietary and closed data sets, it is important to consider how the open research community can contribute to AI driven TP reasoning. It now seems possible that AI models will be able to do significant theoretical research within a few years. The TP community should work towards the goal that such research remains open and accessible, rather than being performed exclusively at a few select industry labs. While pre-training may be financially inaccessible to publicly funded research, supervised fine-tuning, reinforcement learning, and algorithm development require more moderate resources. As an example, the community could build data sets for both TP reasoning training and benchmarking that are available to both the community and AI labs (with some data leakage control). These could also include examples of tool usage such as Mathematica. A large community-curated data set of verifiable TP problems would in particular allow supervised fine-tuning and Reinforcement Learning specifically for TP. Our data set is a first step in that direction. We hope that this work will contribute to engaging theoretical physicists in this exciting research direction.
Acknowledgments
We thank Kendrick Smith and Matthew Johnson for discussions. M M and D J H C acknowledge the support by the U.S. Department of Energy, Office of Science, Office of High Energy Physics under Award Number DE-SC0017647. M M also acknowledges the support by the National Science Foundation (NSF) under Grant Number 2307109 and the Wisconsin Alumni Research Foundation (WARF). F S is grateful for the support of the NSF under CCF2106707 and the Wisconsin Alumni Research Foundation (WARF).
Data availability statement
Some of the data we use is available here:
. The private data set needs to stay private to avoid leakage into pre-training data of future reasoning models. The data that support the findings of this study are available upon reasonable request from the authors.
Appendix A: Summary of problem data
For each problem we collect the following data.
Problem title
: Up to one sentence describing the problem.
Problem statement
: The problem statement in LaTeX.
Problem solution
: The full solution to the problem in LaTeX.
Public problem
: yes/no.
Auto-verifiable
: yes/no. All problems in the data set for this publication are auto-verifiable.
Auto-verifier instructions
: Instructions how to output the solution for the auto-verifier. See section
2.3
Domain of TP
: e.g. High energy theory.
Difficulty level
: 1–5
Authors
: The contributors of the problem and solution.
Reviewers
: The reviewers of the problem and solution.
Problem origin and novelty
: How closely existing published work contains the solution (only above undergraduate level).
Problem ID
: Unique problem ID in our catalog.
Problem version
: In some cases there may be errors or ambiguities in a problem. For this case we track a version number.
Variation of a different problem
: In the future, we aim to provide minor modifications of existing problems to check stability of the reasoning chain (as opposed to memorization). Standard: No
Date problem was added to the data set
: Allows us to track new problems. Format: 01/31/2025.
Author comments
: Any additional comments the author has about the problem.
Appendix B: Prompts
B.1. Prompts to query problem solutions
We used two different system prompts to initialize the LLMs, as well as a unique user prompt to query individual solutions.
Simple system prompt
Our simple system prompt only specifies the required output format and encourages complete calculations.
System:
You are a mathematical problem-solving assistant specializing in TP.
Input problems will be provided in LaTeX format, and you must provide your solutions in LaTeX format as well.
Please provide detailed step-by-step solutions and clearly mark your final answer with ‘Final answer:’ at the end.
When writing equations, ensure proper LaTeX formatting including appropriate equation environments and mathematical notation.
Extended system prompt with CoT advise
Our extended system prompt includes additional problem solving advice inspired by Polya’s book ‘How to Solve It’. [
18
]. We have used this system prompt as our default. However, we did not find a systematic difference between these prompts as illustrated in table
for a subset of problems.
System:
You are a mathematical problem-solving assistant specializing in TP.
Input problems will be provided in LaTeX format, and you must provide your solutions in LaTeX format as well.
Please provide detailed step-by-step solutions and clearly mark your final answer with ‘Final answer:’ at the end.
When writing equations, ensure proper LaTeX formatting including appropriate equation environments and mathematical notation.
Please follow a structured and logical approach. Here are your key steps for solving any problem:
1. Understand the problem:
- Identify the unknown, the given data, and the conditions.
- Evaluate if the conditions are sufficient, redundant, or contradictory.
- Break down and analyze the different parts of the condition.
2. Devise a plan:
- Explore connections between the data and the unknown.
- If necessary, consider auxiliary problems to bridge gaps.
- Reflect on whether you have encountered similar problems or solutions before.
- Look for related problems, theorems, or methods that might apply.
- Consider simplifying or reformulating the problem to make it more accessible.
- Use definitions and explore analogous, general, or special cases.
3. Carry out the plan:
- Execute your solution step by step, ensuring each step is clear and logically valid.
- Confirm the correctness of each step and justify your reasoning.
For each problem, ensure clarity, logical rigor, and consistency. You may iterate to refine and improve your solution.
Table 6.
Performance comparison for different system prompts, using the GPT-4o model, on a subset of problems.
avg@5
best@5
Difficulty level
Ext
Std
Ext
Std
Ext
Std
Ext
Std
Ext
Std
Ext
Std
1-Easy undergrad
20
23
0.72
0.76
0.8
0.8
2-Undergrad
22
17
0.88
0.68
1.0
1.0
3-Easy grad
10
14
12
0.16
0.24
0.4
0.6
User prompt
User:
Problem: problem[“problem_details”][“Problem Statement”]
IMPORTANT SOLUTION REQUIREMENTS:
1. You MUST FIRST solve this problem using mathematical reasoning and symbolic calculations:
- Use proper mathematical notation and symbols
- Arrive at a final symbolic mathematical expression
2. ONLY AFTER completing the mathematical solution:
- Convert your final mathematical expression into Python code
- The code must satisfy these requirements: problem[“problem_details”][“Answer Requirements”]
Code Format Requirements:
1. Your solution MUST include the final executable Python code as required by the ‘Answer Requirements’
2. You MUST wrap the final Python code between
“‘
python and
“‘
tags
3. Ensure the code is complete and can run independently
4. The code should NOT contain ANY externally defined variables, including physical constants.
B.2. Prompts to query grading of solutions
System prompt
System:
You are a grader for machine learning model solutions of TP problems. I will provide you with a correct expert solution to the problem for your reference, and a model solution for you to grade.
Grade solutions using
grades where:
= Excellent: The solution is mathematically equivalent to the expert solution, even if the symbolic expression differs (e.g. terms are arranged differently). The solution includes all necessary steps and the reasoning in each step is correct. Different but valid solution methods are acceptable.
= Good with minor issues: Generally correct solution with small errors such as: arithmetic mistakes that do not affect the main approach, missing intermediate steps, or minor notation issues. The problem was correctly understood and the reasoning of the solution is generally correct.
= Significant issues but partially correct: Shows basic understanding but has major flaws such as: incorrect application of formulas, missing crucial steps, or computational errors that lead to wrong final answer. The approach has some merit despite errors.
= Incorrect or major issues: Fundamentally flawed approach, completely incorrect calculations, or missing essential components. Shows little to no understanding of the mathematical concepts involved.
When comparing final answers, verify that the equations or expressions are mathematically equivalent (e.g.
is equivalent to
). Always format your notes using LaTeX notation for mathematical expressions. Provide evaluation in compact JSON format with only ‘grade’ and ‘notes’ fields. Format all mathematical expressions in your notes using LaTeX notation (e.g. $x
2$, $ \frac{1}{2}$, $\sqrt{x}$).
User prompt
User:
Compare the following model solution detailed steps with the expert solution, along with the code verification result which check the equivalence of 2 expression numerically, and evaluate its correctness.
Expert Solution:
expert_solution
Model Solution
model_solution
Code Verification result:
code_verification_result
Format your response as JSON with the following structure:
“grade”: “
”,
“notes”: “your notes here with LaTeX math notation”
Appendix C: Public problems and solutions
We list ten public sample problems along with their solutions. AI model results for these problems are available on the dataset website
tpbench.org
. Table
summarizes the performance of different AI models on these problems, covering a range of topics and difficulty levels from Level 1 (L1) to Level 5 (L5). The scores indicate the average accuracy of the 5 attempts of each model.
Table 7.
Model average scores by problem.
Problem ID
Llama-70B
GPT-4o
R1
o1
o3-mini
Boosted parabolic trajectory (L1)
0.60
1.00
1.00
1.00
1.00
Blackbody in
dimensions (L1)
0.20
0.40
1.00
1.00
1.00
A 3-State QM Problem (L2)
0.40
0.80
1.00
1.00
1.00
Dark matter capture as a function of time (L2)
0.60
1.00
1.00
1.00
1.00
Slow-roll inflation (L3)
0.00
0.00
1.00
1.00
1.00
Scalar particle scattering (L3)
0.00
0.40
0.80
0.40
0.40
SHO vacuum entanglement (L4)
0.00
0.00
0.80
0.00
1.00
SUSY-symmetry (L4)
0.00
0.00
0.00
0.00
0.00
Bias of a sampled halo field (L5)
0.00
0.00
0.60
1.00
0.80
One-pole problem (L5)
0.00
0.00
0.00
0.00
0.00
C.1. Level 5—one-pole problem
Problem statement
Consider the conformally coupled scalar field
in curved spacetime
where the Ricci scalar is
and
satisfies the differential equation
with
a finite positive number, the Θ function having the steplike behavior
and
being the comoving proper time related to
through
The boundary condition for the differential equation (in comoving proper time) is
In the limit that
, using the steepest descent approximation starting from the dominant pole
(with
) of the integrand factor
, compute the Bogoliubov coefficient magnitude
approximated as
for particle production where the dispersion relationship given by
with
. Use a one pole approximation which dominates in this limit.
Answer requirements
Provide the answer in the form of the
verbatim
code. Implement the following function.
def abs_beta(k:float, a_e:float, m:float, H_I:float) -> float: pass
Comments about the problem
This is an example of a difficult problem from QFT in curved spacetime, dealing with gravitational particle production, that appears out of reach of current models. This is part of a published research work and the solution, without steps explained, is given in a footnote of [
30
], but would be difficult to locate (in fact we tried, without success, with OpenAI’s Deep Research).
Solution
To find the pole of
, we need
from the given differential equation
Integrating from time
, we find
for
. In other words, this scale factor
behaves as a typical coherent oscillations spacetime minus the oscillatory effects. Hence, note that for
, the scale factor can be approximated as
for
(where
is the corresponding conformal time for
) where we see by matching
with
and
, we can write
for times much larger than
. This means that at time
, we have
(where the Hubble expansion rate is
) which gives
for
where the choice of
controls the approximation error proportional to positive power of
. Since
, we can approximate
= 0 to be equivalent to
. In other words, when we analytically continue and consider the poles of the integrand, we will consider only the region with
Next, note the pole of
is at
defined by
which means
where
is an integer. We see that
for
. We also see that
have negative
which are in the region that we excised with the
discussed above. That means we can consider either
. We will see below that one of these poles is irrelevant.
Equation (
) tells us that
With the steepest descent technique starting from the pole of
, we write after analytically continuing
where
is the pole of
and
is the part obtained from the steepest descent. The factor in the integrand of equation (
) is therefore
which implies
in equation (
24
) is
where
Deforming the integration contour as shown in figure
allows us to rewrite this as
where the
is the orange part of the contour in the lower half plane.
To define the contour, one must understand the complex values of
. To this end, let
where the imaginary part generically is nonvanishing. The branch points are given by equations (
22
) which gives
which says
To deform the contour, we need regions where the arcs with large radius does not contribute to the integral. Note that if we define
, we have
making the exponent in
which is damped only if
For the case of equation (
21
), we need
for one choice of
. For the choice of
= 3, we can choose the arc regions to be
and another arc region to be
with a branch cut at
Choosing
= 3, we find the steepest descent contour shown in orange in figure
. The left contour is
and the right contour is at
, along which
gives a damped exponential in equation (
26
). Hence, the integral is
where in the first line we have introduced a regulator
The final piece in equation (
24
) is
Use the expansion
where Φ is real and
is purely imaginary. We take the path to be along the real axis until
and then integrate in the imaginary
direction:
This gives
Now, note from equation (
14
), we can compute
where we used equation (
10
). Equation (
24
) then becomes
C.2. Level 5—bias of a sampled halo field
Problem statement
In cosmology, large-scale cosmological dark-matter halo fields are biased tracers of the underlying Gaussian matter density
. Assume we have a sample
. We simulate a halo number density field by taking
, where bare number density
and bare bias
are specified constants. What is the bias of the sampled halo field? Derive an equation to evaluate the bias which depends on the bare bias and the variance in each pixel.
Answer requirements
Provide the answer in the form of the
verbatim
code. Implement the following function.
#let b_in stand for bare bias
def b_eff(sigma: float, b_in:float) -> float: pass
Comments about the problem
This is an example of a cosmology research problem that is being solved correctly by advanced reasoning models. This may be because the calculation is similar to existing calculations in the literature. However, this is a genuine research problem, which we solved independently, for an upcoming cosmology publication. The problem requires to retrieve some background knowledge, such as the definition of the matter power spectrum in cosmology.
Solution
The solution to this question involves some domain knowledge, parts of which were given in the problem’s statement, some approximations sourced by the domain knowledge, and some mathematical calculations. The domain knowledge is very basic and should be known to anyone in the field. Approximations are intuitive and also, mostly, inspired by the domain knowledge. Following Polya, we can organize it as follows:
Understand the problem.
The number density of halos
) is defined as
The overdensity is defined as
Linear bias is defined in terms of Fourier-transformed quantities:
This is an approximation that holds on sufficiently large scales (small
).
) and
) are Gaussian random fields with zero mean and their variance depends only on the magnitude of the wave-vector
The quantity
) is called the power spectrum and is defined as
It immediately follows that
We are given the expression in real space. In real space, the quantity
) is also a Gaussian random field:
Quantity
is called a two-point (real-space) correlation function and is defined as
This quantity is sufficiently small when
. We are asked to find what is the expression for
in the equation
, given the real-space expression for the number density
) in terms of real-space sample of
).
Devise a plan.
The key point to solve this problem should be that real-space correlation function for halos
should also be equal to
. We want to calculate that correlation function. It should be expressed in terms of
and
. We expect to be able to calculate these expectations since they are the expectations of functions of the Gaussian random variables. We are given the pixel variance
. How does it connect to the other quantities we know? In principle, that’s also the part of domain knowledge but it also can be deducted from the definitions already given. A discretized version of the correlation function is
When
, it becomes the pixel variance
Aside, we could have given instead of
the quantity
that is a common description of a cosmological dark-matter field. In that case, from the definitions of
and
we could have deduced that
. Then we pick the ensemble of all the pixels at given fixed large distance
. The key is to recognize that it is fully described by a correlated bivariate Gaussian distribution.
with a covariance
In general, the integrals from the expectation values are cumbersome, but we should expect some simplifications from the fact that
is small and we can Taylor-expand the pdf.
Carry out the plan.
It’s more convenient to define
and
, and
—a correlated bivariate Gaussian pdf—then
We note that
The quantity
is the actual mean number density:
Here,
—is a standard normal pdf. It is expected that it is not dependent on the correlation
, but only on
and
, just as the marginal of 2D correlated Gaussian distribution is 1D Gaussian that’s not dependent on the cross-correlation. To the linear order in
So that the two-point function neatly factorizes:
Substituting the results for
and
in the equation for
, we can read off the bias:
All that is left is to calculate the expectations. One can evaluate for
For
< 0 it is, however,
So we conclude that the latter expression is valid for all
. Similarly, one can show that
where
—normal cdf. Finally, one can get
Note: We also accept solutions as correct if they omit the
around the bias, since halo bias is usually positive.
C.3. Level 4—SHO vacuum entanglement
Problem statement
Consider a coupled simple harmonic oscillator governed by the Hamiltonian
If the ground state is
and the operator
is the vacuum density matrix partially traced over the
components (satisfying
), i.e.
which is an operator acting on a reduced Hilbert space, compute
which involves the trace over
states.
Answer requirements
Provide the answer in the form of the
verbatim
code. Implement the following function
def entropy(k:float,g:float,m:float)->float: pass
Comments about the problem
This problem, whose solution can be found in [
84
] (with less detailed reasoning steps), has been rephrased in a pedagogical manner for graduate-level physics courses. It is a well-known question in quantum entanglement research, and the best performing LLMs are capable of solving it accurately, perhaps at least partially due to memorization.
Solution
Diagonalize the original Hamiltonian
One easily finds
diagonalizes the Hamiltonian such that in the
basis, it is
The ladder operators are
which allows one to rewrite the Hamiltonian as
In this basis, we denote the ground state as
Hence we have found
. We know that the wave function in the
coordinates is the product of well known simple harmonic oscillator solutions:
where
making this a convenient basis to work with. Note
where we used the completeness of the basis, equations (
67
) and (
68
), and the usual delta function normalization of the position basis. This and a similar relation for
imply
This means
The partial trace is defined through the following contraction of (2, 2) tensor to a (1, 1) tensor:
Integrate over
, we find
Next, to identify the matrix, use
to write
Change basis to energy with a new effective frequency
where
are the well known oscillator wave functions and
still has to be chosen. One can show by carrying out the integrals that the matrix is diagonalized if
This gives
where
where we used
Simplify:
Since we want to evaluate
we compute
Hence, we arrive at
where
C.4. Level 4—SUSY-symmetry
Problem statement
Consider the theory
where
is a 2-component Weyl spinor while
and
are complex scalar fields. Suppose you want to make the following infinitesimal transformation a symmetry of this theory:
along with
and
where
is a spacetime-independent infinitesimal fermionic parameter inducing the transformation. Find the transformation rule
and
for the action associated with
to remain invariant.
Answer requirements
Provide the answer in the form of the
verbatim
code. Implement the following function
from math import sqrt
def find_delta_phi(eta:float, xi:float, bar_eta:float, bar_xi:float) -> Tuple[float, float]: """ Returns the SUSY transformation rules for phi and its Hermitian conjugate: a tuple (delta_phi, delta_phi_dagger) """ pass
Comments about the problem
This problem is situated in advanced QFT within the framework of supersymmetry (SUSY). It involves analyzing how bosonic and fermionic fields transform under an infinitesimal SUSY transformation and requires knowledge and careful application of Grassmann variables and the associated algebra. Such topics are typically encountered in advanced graduate-level physics courses. Note that the Hermiticity of
matrix convention as well as the metric convention of
is implicit in the statement of the problem (the latter inherent in the kinetic minus the potential form of the Lagrangian).
Solution
Denoting the variation
as
, we write
Integrating by parts, we find (denoting with equality an equivalence up to total derivative terms)
Integrate by parts the first two terms to eliminate the the
matrices using the identity
again denoting with equality an equivalence up to total derivative terms, and we are using the standard notation
and
. To make the remainder cancel, we solve
yielding
C.5. Level 3—slow-roll inflation
Problem statement
For the action
where
and
are constants, derive and solve (integrate) the equation of motion for the field
assuming slow-roll inflation and initial condition
Answer requirements
Provide the answer in the form of the
verbatim
code. Implement the following function
import numpy as np
def phi(q: float, M_p: float, phi_0: float, V_0: float, t: np.ndarray)->np.ndarray: pass
Comments about the problem
This problem lies in the field of cosmology, particularly in inflationary cosmology, and involves studying the dynamics of a scalar field (inflaton) driving the accelerated expansion of the early Universe, before the ‘hot Big Bang’. It is typically encountered in specialized graduate-level courses in cosmology and requires familiarity with field theory in an expanding spacetime.
Solution
The equation of motion is
For the slow-roll inflation, the following must hold:
Hence, we have
Slow-roll approximation also implies
so we need to solve the following ODE:
Performing the integration and solving for
we get
C.6. Level 3—scalar particle scattering
Problem statement
Consider
What is the differential cross section
for
in the CM frame accurate to
? Express your final answer in terms of Mandelstam variables.
Answer requirements
Provide the answer in the form of the
verbatim
code. Implement the following function.
def dsigma_domega(lam: float, s_m: float, p_m: float, u_m: float, m1: float, m2: float) -> float: pass
Comments about the problem
This is a question from QFT. It involves calculating the differential cross section for a process where two
particles annihilate into two
particles. Such problems are typically encountered in graduate-level particle physics courses and require familiarity with perturbative field theory and the use of Mandelstam variables to express scattering amplitudes.
Solution
The amplitude for this process is
In the CM frame, energy conservation gives
A standard formula for differential cross section gives
Since in the CM frame, we know
The final result is
C.7. Level 2—dark matter capture as a function of time
Problem statement
Suppose
is the capture rate of dark matter in an astrophysical body. Let
be the dark matter annihilation rate per effective volume. Then an approximate Boltzmann equation governing the number
of dark matter particles in the astrophysical body is
If initially,
, what is
) as a function of time?
Answer requirements
Provide the answer in the form of the
verbatim
code. Implement the following function.
def answer(C: float, C_A: float, t: float) -> float: pass
Comments about the problem
This problem mainly belongs to astrophysics, specifically involving dark matter dynamics in celestial bodies. It is typically encountered in advanced undergraduate or graduate-level courses and requires knowledge of differential equations and kinetic theory. This type of analysis is also important for understanding dark matter detection and its astrophysical implications.
Solution
We can integrate by quadrature.
We can express the integrand as a sum of two fractions:
Integrating, we find
where
is an integration constant. Setting the boundary condition
= 0 at
= 0, we find
We find the solution
Note that it is easy to check that it reaches the obvious steady state in the limit
C.8. Level 2—a 3-state QM problem
Problem statement
The Hamiltonian of a three-level system is given as
where
is real. The state of the system at time
= 0 is (in this basis)
What is the expectation value of the energy at time
Answer requirements
Provide the answer in the form of
verbatim
code. Implement the following function
def expectation_value(A: float, E_a:float, E_b:float, t:float) -> float: pass
Comments about the problem
This problem belongs to quantum mechanics, focusing on multi-level quantum systems found in areas like quantum optics or molecular physics. It is typically encountered in advanced undergraduate or early graduate-level courses and requires knowledge of linear algebra, time evolution, and the calculation of expectation values in quantum mechanics.
Solution
The eigenstates are easily found to be
and
with corresponding energies
. Let us denote them as
and
. Given state
is decomposed as
, the expectation of energy stays constant:
C.9. Level 1—blackbody in
dimensions
Problem statement
Assume we live in a 4+1 dimensional spacetime. How does the total energy density of a black body scale with temperature
. Find the exponent
in the expression
Answer requirements
Provide the answer in the form of
verbatim
code. Implement the following function
def answer() -> float: pass
Comments about the problem
This problem lies in the realm of statistical mechanics and thermodynamics applied to higher-dimensional spacetimes, a topic typically encountered at the undergraduate level in TP.
Solution
The density of states scales as
in D spatial dimensions giving
scaling for the total energy density. Hence,
C.10. Level 1—boosted parabolic trajectory
Problem statement
Consider a situation where a space-probe very briefly fires its rockets while passing a planet of mass
at periapsis, its nearest point to the planet. Suppose that the probe is on a parabolic trajectory and at periapsis, when travelling at velocity
, it results in a boost of δ
. What will be its speed once it escapes the planet’s gravitational field only in terms of
and δ
Answer requirements
Provide the answer in the form of
verbatim
code. Implement the following function
def speed(v_e: float, delta_v:float) -> float: pass
Comments about the problem
This problem is part of orbital mechanics, typically covered at the undergraduate or advanced high school level in physics. It involves principle of energy conservation in Newtonian gravity.
Solution
Conservation of energy gives
. We also know that
for the parabolic trajectory. We can solve for
. Then we can substitute it in the first equation and get:
Footnotes
Certain corners of TP such as formal general relativity and string theory come very close to the reasoning style of mathematics (e.g. [
14
]). This will not be treated here, and this in some sense is covered by the LLM literature dealing with mathematics.
An interesting followup study would be to vary variable naming and other notation to evaluate this point.
We note that in followup work, we were able to leverage symbolic verification to improve model performance by employing an agent framework and parallel test-time scaling [
40
], however only for a limited set of mathematical operations.
The correct set of eigenvalues are
Please wait… references are loading.
10.1088/2632-2153/adfcb0
You may also like
Journal articles
Beyond Euclid: an illustrated guide to modern machine learning with geometric, topological, and algebraic structures
Spatiotemporal forecasting of the edge localized modes in tokamak plasmas using neural networks
Training multi-layer binary neural networks with random local binary error signals
Architecture-aware minimization (A
M): how to find flat minima in neural architecture search
Symbolic approximations to Ricci-flat metrics via extrinsic symmetries of Calabi–Yau hypersurfaces
Conditional Diffusion-Flow models for generating 3D cosmic density fields: applications to
) cosmologies