AI Models Learn to Think, Improve Math
A new framework helps language models refine their problem-solving abilities significantly.
Large Language Models (LLMs) are now capable of learning "higher-order thinking" (HOT)—complex skills like analyzing and creating—by receiving feedback, much like students improve their homework through iterative critique.
Research Objectives
Researchers set out to investigate two main questions:
- Can LLMs enhance their HOT skills when solving math word problems (MWPs)?
- How do different LLMs perform across various thinking levels?
This study delves into the nuances of LLM cognition and improvement.
Introducing the THINK Evaluation System
The study utilized a novel multi-agent evaluation system called THINK.
THINK comprises:
- Six distinct agent programs, each representing a different level of thinking.
- One overarching evaluation agent responsible for holistic assessment.
They meticulously tested seven advanced LLMs, including notable models like GPT-4O and MISTRAL-8B-IT, on a diverse set of 120 math problems. These problems were sourced from social media and newly generated by GPT-4O itself.
The LLMs followed an iterative process:
- Generated initial problem solutions.
- Received structured feedback from the THINK system.
- Re-did their work, specifically focusing on "five keys" such as math concepts and clear narratives.
Key Findings
The study revealed several critical insights:
- LLMs perform well on simpler tasks, such as "remembering" facts.
- However, they struggle more with "applying" knowledge.
- Example: GPT-4O scored 86.92 in remembering but dropped to 76.71 in applying.
- Proprietary (non-open-source) models, such as GPT-4O, generally outperformed open-source alternatives.
- Crucially, receiving iterative feedback significantly helped LLMs improve their deeper reasoning skills.
"By making models 'think-aloud' through iterative critique, THINK offers a scalable, principled approach for the community to both measure and advance LLM cognition," the authors stated. "This paves the way for more robust reasoning capabilities in educational and real-world applications."
Limitations and Future Directions
The researchers acknowledge certain limitations:
- The study used a specific set of flawed math problems, which might restrict the types of errors identifiable.
- The evaluation method, while efficient, may not capture every facet of reasoning quality.
Future research aims to:
- Explore more diverse problem sets.
- Develop more comprehensive evaluation methods.
This new framework strongly indicates that AI can significantly improve its thinking skills through a structured learning process, hinting at a future where machines truly reason and create.
Reference
Yongan Yu, Mengqian Wu, Yiran Lin, Nikki G. Lobczowski. "THINK: Can Large Language Models Think-aloud?" arXiv preprint arXiv:2505.20184v1 (2025).