AI Models Learn to Think, Improve Math

A new framework helps language models refine their problem-solving abilities significantly.

Large Language Models (LLMs) are now capable of learning "higher-order thinking" (HOT)—complex skills like analyzing and creating—by receiving feedback, much like students improve their homework through iterative critique.

Research Objectives

Researchers set out to investigate two main questions:

Can LLMs enhance their HOT skills when solving math word problems (MWPs)?
How do different LLMs perform across various thinking levels?

This study delves into the nuances of LLM cognition and improvement.

Introducing the THINK Evaluation System

The study utilized a novel multi-agent evaluation system called THINK.

THINK comprises:

Six distinct agent programs, each representing a different level of thinking.
One overarching evaluation agent responsible for holistic assessment.

They meticulously tested seven advanced LLMs, including notable models like GPT-4O and MISTRAL-8B-IT, on a diverse set of 120 math problems. These problems were sourced from social media and newly generated by GPT-4O itself.

The LLMs followed an iterative process:

Generated initial problem solutions.
Received structured feedback from the THINK system.
Re-did their work, specifically focusing on "five keys" such as math concepts and clear narratives.

Key Findings

The study revealed several critical insights:

LLMs perform well on simpler tasks, such as "remembering" facts.
However, they struggle more with "applying" knowledge.
- Example: GPT-4O scored 86.92 in remembering but dropped to 76.71 in applying.
Proprietary (non-open-source) models, such as GPT-4O, generally outperformed open-source alternatives.
Crucially, receiving iterative feedback significantly helped LLMs improve their deeper reasoning skills.

"By making models 'think-aloud' through iterative critique, THINK offers a scalable, principled approach for the community to both measure and advance LLM cognition," the authors stated. "This paves the way for more robust reasoning capabilities in educational and real-world applications."

Limitations and Future Directions

The researchers acknowledge certain limitations:

The study used a specific set of flawed math problems, which might restrict the types of errors identifiable.
The evaluation method, while efficient, may not capture every facet of reasoning quality.

Future research aims to:

Explore more diverse problem sets.
Develop more comprehensive evaluation methods.

This new framework strongly indicates that AI can significantly improve its thinking skills through a structured learning process, hinting at a future where machines truly reason and create.

Reference

Yongan Yu, Mengqian Wu, Yiran Lin, Nikki G. Lobczowski. "THINK: Can Large Language Models Think-aloud?" arXiv preprint arXiv:2505.20184v1 (2025).