RIZE: Adaptive Regularization for Imitation Learning

Adib Karimi, Mohammad Mehdi Ebadzadeh
Amirkabir University of Technology
Transactions on Machine Learning Research (TMLR) 2025

Abstract

We propose a novel Inverse Reinforcement Learning (IRL) method that mitigates the rigidity of fixed reward structures and the limited flexibility of implicit reward regularization. Building on the Maximum Entropy IRL framework, our approach incorporates a squared temporal-difference (TD) regularizer with adaptive targets that evolve dynamically during training, thereby imposing adaptive bounds on recovered rewards and promoting robust decision-making. To capture richer return information, we integrate distributional RL into the learning process. Empirically, our method achieves expert-level performance on complex MuJoCo and Adroit environments, surpassing baseline methods on the Humanoid-v2 task with limited expert demonstrations. Extensive experiments and ablation studies further validate the effectiveness of the approach and provide insights into reward dynamics in imitation learning.

RIZE achieves expert-level performance

Figure: Average performance of last 1/3 episodes across MuJoCo and Adroit tasks (10 demonstrations).

Method

RIZE extends implicit reward regularization in MaxEnt IRL by introducing adaptive targets and using an Implicit Quantile Network (IQN) critic instead of standard point-estimate Q-networks. This captures full return distributions $Z^\pi(s,a)$ while optimizing policy on expectations.

The implicit reward is defined as:

$$ R_Q(s,a) = Q^\pi(s,a) - \gamma V^\pi(s') \quad \text{where} \quad Q^\pi(s,a) = \mathbb{E}[Z^\pi(s,a)] $$ $$ V^\pi(s') = \mathbb{E}_{a' \sim \pi} [Q^\pi(s',a') - \alpha \log \pi(a'|s')] $$

We introduce a convex squared TD regularizer with adaptive targets $\lambda^{\pi_E}$ (expert) and $\lambda^\pi$ (policy):

$$ \Gamma(R_Q, \lambda) = \mathbb{E}_{\rho_E}[(R_Q - \lambda^{\pi_E})^2] + \mathbb{E}_{\rho_\pi}[(R_Q - \lambda^\pi)^2] $$

The full critic objective is:

$$ \mathcal{L} = \mathbb{E}_{\rho_E}[R_Q] - \mathbb{E}_{\rho_\pi}[R_Q] - \mathcal{H}(\pi) - c \cdot \Gamma(R_Q, \lambda) $$

Targets are optimized to minimize TD error, forming a feedback loop. This yields bounded rewards (Corollary 4.2):

$$ R_Q \in \left[ -\frac{1}{2c} + \min\{\lambda^{\pi_E}, \lambda^\pi\}, \; \frac{1}{2c} + \max\{\lambda^{\pi_E}, \lambda^\pi\} \right] $$

Unlike fixed-target methods, our adaptive bounds evolve with training, preventing reward drift. The IQN critic further stabilizes learning by modeling return stochasticity, enabling robust performance on high-DoF tasks like Humanoid-v2 and Hammer-v1.

Results

RIZE is evaluated on MuJoCo (HalfCheetah, Walker2d, Ant, Humanoid, Hopper) and Adroit (Hammer) with 3 or 10 expert demonstrations. It achieves expert-level returns, outperforming IQ-Learn, LSIQ, SQIL, CSIL, and BC — especially on Humanoid-v2 and Hammer-v1. Ablations confirm that adaptive targets and IQN critic are key to stable, high-performance imitation.

BibTeX

@article{karimi2025rize,
  title={RIZE: Adaptive Regularization for Imitation Learning},
  author={Karimi, Adib and Ebadzadeh, Mohammad Mehdi},
  journal={Transactions on Machine Learning Research},
  year={2025},
  url={https://openreview.net/forum?id=a6DWqXJZCZ}
}