RatioLogo
Back

The KazMMLU Benchmark: Bridging AI's Localized Intelligence Gap

What if the world’s most advanced artificial intelligence systems are essentially "lost in translation" when they cross the border into Central Asia? While a chatbot might compose a perfect sonnet in English, it often falters when faced with the nuances of a high school chemistry exam in Almaty or a university law quiz in Astana.

Existing benchmarks for Large Language Models have long relied on translated English data, which strips away the cultural and educational soul of a nation. This matters for the 14 million Kazakh speakers and the millions more using Russian for administration, as an AI that cannot master local knowledge is one that cannot truly serve the public.

The Dataset: KazMMLU

To address this, researchers have unveiled KazMMLU, a rigorous new dataset of 23,000 multiple-choice questions. Its purpose is to push AI through the gauntlet of Kazakhstan’s national curriculum and professional standards.

This discovery exposes a "digital divide" in machine intelligence—the gap between global AI capabilities and localized, cultural understanding.

Performance Results

The Overall Hierarchy

The proprietary heavyweights led the pack:

  • DeepSeek V3 achieved a peak accuracy of 76.9%.
  • GPT-4o followed closely at 76.6%.

Among open-source models, critical for local developers:

  • Gemma-2-27B-IT set the standard at 57.4%.
  • Llama3.1-70B achieved 56.2%.

Key Discoveries & Anomalies

The data uncovers several critical patterns that define the current limits of AI for Kazakhstan.

The Linguistic Bias

Models consistently performed better in Russian than in Kazakh.

  • Example: DeepSeek V3 reached 81.4% in Russian at the high school level but dropped significantly when tested in Kazakh.

There is also a "reasoning trigger": models across the board performed better when prompted in English, even when answering questions about Kazakhstan. This suggests the AI's "brain" is still tethered to its English-centric training.

The Instruction Tuning Paradox

The effect of fine-tuning varied dramatically by model size:

  • Smaller models like Llama3.1-8B saw a +4.9% accuracy gain.
  • The larger Llama3.1-70B actually saw its performance degrade by -7.9%.

Calibration & Negation Sensitivity

The study found a high calibration correlation (r > 0.9), meaning when models felt confident, they were usually right.

However, they were easily tripped up by "Negation Sensitivity." In a subset of 2,554 questions, adding a simple "not" or "except" caused accuracies to plummet.

  • Example: Llama3.1-70B's score in Reading Literacy fell from 57.1% to 50.0% when negations were present.

Limitations & The Path Forward

Despite these insights, the researchers admit the benchmark has its limits.

The benchmark is strictly text-based, ignoring visual data common in modern exams. Furthermore, specialized university-level questions were predominantly available in Russian, leaving a gap in our understanding of high-level Kazakh reasoning.

Key Takeaway: Bridging this intelligence chasm will require more than just translation; it demands AI built from the ground up with the "localized signal" of Kazakhstan in mind.


Reference: Togmanov, M., Mukhituly, N., Turmakhan, D., et al. (2025). KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan. arXiv:2502.12829v2 [cs.CL].