Learning Visual Feature-Based World Models via Residual Latent Action

1Rutgers University 2Purdue University 3University of Wisconsin-Madison

scroll

TL;DR We introduce the RLA World Model that can predict future DINO tokens accurately and efficiently, enabling a minimalist world action model and visual RL inside our world model.

Summary

  1. 1. RLA is surprisingly easy to learn, yet ideal for modeling state evolution.
    • 1.1 RLA is predictive. Decoder $f_\text{dec}$ accurately reconstructs $s_{t+h}$ from RLA $z$ and $s_t$ with just transformers without diffusion.
    • 1.2 RLA is generalizable, even when trained on limited data.
    • 1.3 RLA encodes temporal progression in latent space.
  2. 2. We build RLA World Model (RLA-WM), a simple and efficient state-of-the-art.
  3. 3. Applications
    • 3.1 RLA extends a BC policy to a minimalist world action model with just a linear layer, so it can learn directly from videos.
    • 3.2 We demonstrate visual RL inside our RLA-WM, which is learned from offline videos, without online interaction or handcrafted rewards.
We use image $512\times 512$, which means the DINO token dimension is $1024\times 1024$, yet with a RLA dimension of 2048 or just 64, we reconstruct $s_{t+h}$ with high fidelity.
These robot-object interactions were never observed during RLA autoencoder training. Yet, RLA still reconstructs them well.
Interpolating between a Gaussian noise vector and RLA produces frames that approximate temporally intermediate states. This indicates that the RLA latent space inherently captures temporal progression.
We predict RLA to estimate state evolution, instead of absolute states.

Future Prediction

Given an input frame (from validation set) at $t=0$, each world model predicts future frames at $t=h, t=2h, \dots$ from the input action chunk $a_{t:t+h}, a_{t+h:t+2h}, \dots$. We use the official action space for each task in IWS. The action space of all ManiSkill tasks is simply the robot's joint angles.

RLA World Model

RLA-WM predicts future states by generating the residual latent action $z$ via flow matching, and decodes $\hat{s}_{t+h}$ with pre-trained RLA decoder $f_\text{dec}$.

Learning from Actionless Videos

We add a linear layer on top of a BC policy to predict RLA $z$, which learns from videos whose action or proprioceptive states are not available. As RLA enables accurate prediction of $s_{t+h}$, this simple framework becomes a minimalist world action model, which improves performance, without inference cost or coupling to a video backbone.

> | Method | PushT | Roll Ball | Pull Cube | Pull Cube w. Tool | Poke Cube | Avg SR ↑ | | --- | --- | --- | --- | --- | --- | --- | | BC-ResNet | 3.6 | 42.0 | 33.6 | 7.6 | 49.2 | 27.2 | | **+RLA (Ours)** | *15.2 | *43.8 | *43.6 | *12.0 | *63.6| *35.6 |
Improved success rates over baseline by learning from actionless videos via RLA.

Visual Reinforcement Learning inside RLA-WM

We show the first demonstration of visual RL inside a world model learned only from offline videos, which we refer to as World Model-based Reinforcement Learning (WMRL).

> | Method | XArm (Poke Cube) | UR10e (Roll Ball) | UR10e (PushT) | | --- | --- | --- | --- | | BC-ResNet | 89.9 | 65.5 | 17.2 | | **WMRL (Ours)** | *95.9 | *73.1 | *20.7 |

Citation

@article{zhang2026learning,
  title={{Learning Visual Feature-Based World Models via Residual Latent Action}},
  author={Zhang, Xinyu and Xu, Zhengtong and Tao, Yutian and Wang, Yeping and She, Yu and Boularias, Abdeslam},
  journal={arXiv preprint arXiv:2605.07079},
  year={2026},
  eprint={2605.07079},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}