Skip to content
Massachusetts Institute of Technology
  • on: June 8, 2026
  • in: arXiv

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

  • Ishaan Preetam Chandratreya *
  • David Charatan *
  • Basile Van Hoorick
  • Sergey Zakharov
  • Vitor Guizilini
  • Phillip Isola
  • Vincent Sitzmann
*
shared first author
@article{chandratreya2026millivid,
    title = { MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation },
    author = { Chandratreya, Ishaan Preetam and 
               Charatan, David and 
               Van Hoorick, Basile and 
               Zakharov, Sergey and 
               Guizilini, Vitor and 
               Isola, Phillip and 
               Sitzmann, Vincent },
    year = { 2026 },
    booktitle = { arXiv },
}
  • Copy to Clipboard

Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space.

Our approach first pre-trains an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture consequential information such as scene layout and semantics, while finer levels add high-frequency appearance and texture. We then train a video diffusion model to generate these tokens using coarse-to-fine rollout, preserving long-range consistency in geometry and object permanence while spending less compute on less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.