- on: June 8, 2026
- in: arXiv
MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation
- *
- shared first author
@article{chandratreya2026millivid,
title = { MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation },
author = { Chandratreya, Ishaan Preetam and
Charatan, David and
Van Hoorick, Basile and
Zakharov, Sergey and
Guizilini, Vitor and
Isola, Phillip and
Sitzmann, Vincent },
year = { 2026 },
booktitle = { arXiv },
} Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space.
Our approach first pre-trains an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture consequential information such as scene layout and semantics, while finer levels add high-frequency appearance and texture. We then train a video diffusion model to generate these tokens using coarse-to-fine rollout, preserving long-range consistency in geometry and object permanence while spending less compute on less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.