62,197 results found (page 2 of 4147)
https://iopscience.iop.org/article/10.1088/2632-2153/adfcb0

…data set to test TP reasoning skill over a broad range of difficulty. We aim to answer the following questions: How good is the current state-of-the-art AI for problem-solving in TP? Are existing models useful for research-level reasoning? What are the most common failure modes? …