- on: Dec. 20, 2025
- in: arXiv
Large Video Planner
@article{chen2025largevideoplanner,
title = { Large Video Planner },
author = { Chen, Boyuan and
Zhang, Tianyuan and
Geng, Haoran and
Song, Kiwhan and
Freeman, William T. and
Malik, Jitendra and
Tedrake, Russ and
Sitzmann, Vincent and
Du, Yilun },
year = { 2025 },
booktitle = { arXiv },
} LVP is a video foundation model for robotics that generates video plans and deploys them on robots.
General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions.