Rethinking Robot Training: Moving Beyond Discounted Rewards

What if the primary mathematical tool used to train robots today is actually a compromise that hinders their long-term potential? For decades, Deep Reinforcement Learning (DRL) has relied on "discounted rewards"—the idea that a reward tomorrow is worth slightly less than one today.

This approach creates a myopic bias in perpetual tasks like walking or swimming, forcing engineers to manually adjust discount rates to prevent system failure.

The RVI-SAC Breakthrough

Researchers from the Tokyo Institute of Technology have unveiled a faster, more stable solution: RVI-SAC. This new algorithm moves away from short-term discounting in favor of the average reward criterion, a theoretically superior approach for infinite-horizon tasks.

This framework had long been notoriously difficult to stabilize in complex, high-dimensional environments until now.

Solving the "Tuning Trap"

In traditional frameworks like Soft Actor-Critic (SAC), setting the discount rate ( $\gamma$ ) is a delicate balancing act. A value of 0.99 might work, while 0.999 could cause the underlying mathematics to break down.

This study demonstrates that RVI-SAC eliminates this sensitivity. The algorithm allows robots to learn more robust behaviors without the need for constant human intervention or hyperparameter tweaking.

Core Technical Innovations

The team's success stems from several key innovations that stabilized the average reward approach.

The Delayed $f(Q)$ Update

This technique acts as a temporal smoother, stabilizing the inherent volatility of average reward learning. It prevents the training process from becoming unstable in complex environments.

Automatic Reset Cost Adjustment

The team solved the "robot fall" problem, where a fall ending an episode confuses the average reward logic. They treat a fall not as an ending, but as a high-cost transition.

By setting a target termination frequency of $1 \times 10^{-3}$ , the algorithm effectively learns to avoid failure as part of its continuous life cycle.

Experimental Validation

Testing Methodology

Environments: Tested across six Gymnasium MuJoCo environments, including notoriously difficult tasks like Humanoid and Swimmer.
Statistical Rigor: Used a sample size of 10 random seeds per experiment to ensure results were not due to chance.

Key Results

Consistent Performance: RVI-SAC consistently matched or outperformed the best-tuned traditional SAC across all tested environments.
Swimmer Success: In the Swimmer environment—where standard SAC with $\gamma=0.99$ typically fails—RVI-SAC matched the performance of the most highly-tuned versions.
Superior Stability: The new algorithm proved significantly more stable than other average-reward methods like ARO-DDPG.

Current Limitations & Future Work

Despite the breakthrough, the authors acknowledge important limitations and areas for future development.

Mathematical Foundations

The algorithm’s convergence is mathematically proven for simple "tabular" settings, but a formal proof for the complex neural networks used here remains an area for future work.

The Ergodicity Assumption

The framework assumes ergodicity—that the robot can eventually reach any state from any other state. This assumption might not hold true in chaotic, real-world environments, presenting a challenge for real-world deployment.

The Big Picture: For now, RVI-SAC stands as a powerful new blueprint for autonomous systems. It suggests that the best way to teach a machine to move is to let it look at the big picture rather than just the next step.

Reference:
Hisaki, Y., & Ono, I. (2024). RVI-SAC: Average Reward Off-Policy Deep Reinforcement Learning. Proceedings of the 41st International Conference on Machine Learning (ICML). (arXiv:2408.01972v1)