TL;DR We introduce the RLA World Model that can predict future DINO tokens accurately and efficiently, enabling a minimalist world action model and visual RL inside our world model.
Given an input frame (from validation set) at $t=0$, each world model predicts future frames at $t=h, t=2h, \dots$ from the input action chunk $a_{t:t+h}, a_{t+h:t+2h}, \dots$. We use the official action space for each task in IWS. The action space of all ManiSkill tasks is simply the robot's joint angles.
RLA-WM predicts future states by generating the residual latent action $z$ via flow matching, and decodes $\hat{s}_{t+h}$ with pre-trained RLA decoder $f_\text{dec}$.
We add a linear layer on top of a BC policy to predict RLA $z$, which learns from videos whose action or proprioceptive states are not available. As RLA enables accurate prediction of $s_{t+h}$, this simple framework becomes a minimalist world action model, which improves performance, without inference cost or coupling to a video backbone.
We show the first demonstration of visual RL inside a world model learned only from offline videos, which we refer to as World Model-based Reinforcement Learning (WMRL).
@article{zhang2026learning,
title={{Learning Visual Feature-Based World Models via Residual Latent Action}},
author={Zhang, Xinyu and Xu, Zhengtong and Tao, Yutian and Wang, Yeping and She, Yu and Boularias, Abdeslam},
journal={arXiv preprint arXiv:2605.07079},
year={2026},
eprint={2605.07079},
archivePrefix={arXiv},
primaryClass={cs.CV}
}