In the High-Stakes World of Robotics, We Have Long Forced Machines to Think in the Short Term

Traditional Deep Reinforcement Learning (DRL) relies on a mathematical "discount factor"—a shortcut that tells a robot to prize immediate rewards over distant ones to ensure the math stays manageable. But for a robot designed for "continuing tasks," like walking for hours or managing a power grid, this creates a logical rift. The robot is trained to maximize a skewed, short-term return, even though its actual mission is to perform consistently well over an infinite horizon.

The Breakthrough: RVI-SAC

Researchers from the Tokyo Institute of Technology have now bridged this gap with RVI-SAC, a new algorithm that abandons the discount factor in favor of the "average reward criterion."

The Problem RVI-SAC Solves

This discovery matters because it solves the "Goldilocks" problem of robot training. Usually, if you set the discount rate too low, the robot is shortsighted; if you set it too close to 1.0 to simulate the long-term, the training often crashes. RVI-SAC eliminates this volatile variable entirely, allowing robots to learn complex movements like walking or swimming without the math breaking down.

Key Technical Innovations

The team achieved this by integrating several advanced techniques:

1. Core Algorithm Components

Relative Value Iteration (RVI): A foundational mathematical framework used for evaluating long-term performance without a discount.
Delayed f(Q) Update: A stabilizing technique that prevents the AI's neural network from becoming overwhelmed by data variance, ensuring smoother learning.
Automated Reset Cost (set at 10⁻³): This essentially teaches the robot that "falling over" has a specific, measurable price. This allows the algorithm to treat a series of attempts as one continuous learning experience.

Performance and Results

The data from the 10 random seed trials is striking, demonstrating superior performance in complex locomotion tasks.

Benchmark Performance

Swimmer: RVI-SAC matched the performance of the most ambitious (yet notoriously unstable) traditional discount settings.
Ant, Walker2d, and Humanoid: RVI-SAC outperformed the industry-standard "Soft Actor-Critic" (SAC) algorithm in these high-intensity benchmarks.

The Path Forward and Current Limitations

While a major breakthrough, the path to perfectly tireless robots still has hurdles. The researchers openly discuss the current boundaries of their work.

Areas for Future Research

Theoretical Proof: The team proved their math holds up in simple "tabular" settings, but a full theoretical proof for more complex, non-linear neural networks is deferred to future work.
New Environments: While the algorithm mastered the MuJoCo locomotion tasks, its performance in pixel-based environments or discrete-action tasks remains an open question for the next generation of researchers.

Key Takeaway: As the authors noted, "while traditional SAC using a discount rate may be significantly impacted by the choice of discount rate, RVI-SAC using the average reward resolves this issue." This marks a significant step toward creating stable, long-term thinkers in the world of autonomous machines.

Reference: RVI-SAC: Average Reward Off-Policy Deep Reinforcement Learning. Yukinari Hisaki and Isao Ono. Tokyo Institute of Technology. Proceedings of the 41st International Conference on Machine Learning (ICML) 2024.