RatioLogo
Back

AI Models Learn to Think, Improve Math

A new framework helps language models refine their problem-solving abilities significantly.

Large Language Models (LLMs) are now capable of learning "higher-order thinking" (HOT)—complex skills like analyzing and creating—by receiving feedback, much like students improve their homework through iterative critique.

Research Objectives

Researchers set out to investigate two main questions:

  • Can LLMs enhance their HOT skills when solving math word problems (MWPs)?
  • How do different LLMs perform across various thinking levels?

This study delves into the nuances of LLM cognition and improvement.


Introducing the THINK Evaluation System

The study utilized a novel multi-agent evaluation system called THINK.

THINK comprises:

  • Six distinct agent programs, each representing a different level of thinking.
  • One overarching evaluation agent responsible for holistic assessment.

They meticulously tested seven advanced LLMs, including notable models like GPT-4O and MISTRAL-8B-IT, on a diverse set of 120 math problems. These problems were sourced from social media and newly generated by GPT-4O itself.

The LLMs followed an iterative process:

  1. Generated initial problem solutions.
  2. Received structured feedback from the THINK system.
  3. Re-did their work, specifically focusing on "five keys" such as math concepts and clear narratives.

Key Findings

The study revealed several critical insights:

  • LLMs perform well on simpler tasks, such as "remembering" facts.
  • However, they struggle more with "applying" knowledge.
    • Example: GPT-4O scored 86.92 in remembering but dropped to 76.71 in applying.
  • Proprietary (non-open-source) models, such as GPT-4O, generally outperformed open-source alternatives.
  • Crucially, receiving iterative feedback significantly helped LLMs improve their deeper reasoning skills.

"By making models 'think-aloud' through iterative critique, THINK offers a scalable, principled approach for the community to both measure and advance LLM cognition," the authors stated. "This paves the way for more robust reasoning capabilities in educational and real-world applications."


Limitations and Future Directions

The researchers acknowledge certain limitations:

  • The study used a specific set of flawed math problems, which might restrict the types of errors identifiable.
  • The evaluation method, while efficient, may not capture every facet of reasoning quality.

Future research aims to:

  • Explore more diverse problem sets.
  • Develop more comprehensive evaluation methods.

This new framework strongly indicates that AI can significantly improve its thinking skills through a structured learning process, hinting at a future where machines truly reason and create.


Reference

Yongan Yu, Mengqian Wu, Yiran Lin, Nikki G. Lobczowski. "THINK: Can Large Language Models Think-aloud?" arXiv preprint arXiv:2505.20184v1 (2025).