on: Dec. 20, 2025
in: arXiv

Large Video Planner

Boyuan Chen
Tianyuan Zhang
Haoran Geng
Kiwhan Song
William T. Freeman
Jitendra Malik
Russ Tedrake
Vincent Sitzmann
Yilun Du

@article{chen2025largevideoplanner,
    title = { Large Video Planner },
    author = { Chen, Boyuan and 
               Zhang, Tianyuan and 
               Geng, Haoran and 
               Song, Kiwhan and 
               Freeman, William T. and 
               Malik, Jitendra and 
               Tedrake, Russ and 
               Sitzmann, Vincent and 
               Du, Yilun },
    year = { 2025 },
    booktitle = { arXiv },
}

Copy to Clipboard

LVP is a video foundation model for robotics that generates video plans and deploys them on robots.

General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions.