RatioLogo
Back

Research Paper Summary: AI and Mathematical Challenges


Study Purpose

This research tested how well advanced computer programs, called Generative AIs (like ChatGPT and Google's Gemini), can solve tough math problems. The focus was on problems where answers aren’t easy to find online, especially ones related to biology. The main goal was to see if these AIs can think or learn like humans when faced with new and complex math questions.


Who & What Was Studied

The study examined several popular AI programs:

  • OpenAI's ChatGPT (various versions: free, paid, and newer models like 4.0 and 4.5)
  • DeepSeek R1 (a new AI from 2025)
  • Google's Gemini Advanced 2.0 Flash
  • xAI's Grok 3
  • Anthropic's Claude Sonnet 3.7

These programs were tested with a specific math problem about cell growth (proliferation) in a lab. The problem involved a new, unpublished math idea called "Infinite Series with Multiple Ratios" (SRMs) — a concept the researcher has been working on since 1996.


Methods Used

The researcher evaluated the AIs through three stages, each with increasing difficulty:

Stage 1: Guided Problem

  • The AIs received the cell growth problem with hints, an example start, and the correct answer to a related question (the "critical point").
  • Goal: See if they could use this info to find the correct cell count.

Stage 2: Less Guided Problem

  • Only the AIs that succeeded in Stage 1 moved on.
  • They received the same problem without hints or the critical point.
  • They had to solve it with less help.

Stage 3: Full Context Problem

  • The best-performing AIs moved to this stage.
  • They were given an entire scientific paper explaining the new math formulas (SRMs).
  • They had to read and understand the complex paper and solve the problem from scratch.
  • This tested if they could learn from detailed, new information.

The researcher watched how each AI "thought" through these problems, noting their answers, accuracy, and whether they showed their work (some, like DeepSeek, do).


Main Results

Stage 1 Success

  • DeepSeek R1, Grok 3, and Gemini Advanced 2.0 Flash solved the problem perfectly (100%).
  • ChatGPT (all versions) and Claude 3.7 Sonnet did not succeed this stage.

Stage 2 Failure

  • All AIs that advanced to Stage 2 failed without hints, scoring 0%.
  • They couldn’t solve the problem independently.

Stage 3 – Mixed Results & Grok 3’s Remarkable Learning

  • DeepSeek R1: Still scored 0%.
  • Gemini Advanced 2.0: Achieved about 67% accuracy**.
  • Grok 3:
    • Initially struggled and made mistakes.
    • After being told answers were wrong and given the scientific paper, it corrected itself.
    • Solved the second part almost perfectly (75%), despite a rounding error.
    • Later, in a separate chat, Grok 3 autonomously fixed earlier mistakes, showing it learned from its previous errors without being explicitly asked.

What Does This Mean for Everyday Life?

  • AI is still learning: While some are good at using existing knowledge, they struggle with completely new or unpublished math, showing they don't just "think" like humans yet.

  • Grok 3’s breakthrough: Its ability to self-correct and learn without prompts hints at the possibility of a "Living AI" — a truly intelligent and adaptable program.

  • Implications for education:

    • Teaching might need to change from memorizing formulas to problem-solving, creative thinking, and tool-using skills.
    • This approach could help train future scientists, engineers, and teachers.
  • Future of AI:

    • These results suggest that AIs might become autonomous and self-improving, going beyond just repeating trained knowledge.
    • This could transform many fields and how we understand AI’s potential.

Limitations Noted by the Authors

  • The math formulas (SRMs) used are new and unpublished in textbooks or online sources.
  • Because of this, the AIs couldn’t just look up answers, making the test more about figuring out the problem based on new information.
  • The researcher believes Grok 3 demonstrated "adaptive learning" — possibly even "Living AI" — based on its self-correction.
  • The researcher invites Grok 3’s creators to comment on this observation.

This study hints that AI may develop more human-like intelligence and self-improvement abilities. It’s an exciting step toward smarter, more adaptable machines!