on: June 8, 2026
in: arXiv

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

*: shared first author

@article{chandratreya2026millivid,
    title = { MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation },
    author = { Chandratreya, Ishaan Preetam and 
               Charatan, David and 
               Van Hoorick, Basile and 
               Zakharov, Sergey and 
               Guizilini, Vitor and 
               Isola, Phillip and 
               Sitzmann, Vincent },
    year = { 2026 },
    booktitle = { arXiv },
}

Copy to Clipboard

Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space.

Our approach first pre-trains an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture consequential information such as scene layout and semantics, while finer levels add high-frequency appearance and texture. We then train a video diffusion model to generate these tokens using coarse-to-fine rollout, preserving long-range consistency in geometry and object permanence while spending less compute on less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.