Skip to content
Massachusetts Institute of Technology
  • on: Jan. 2, 2023
  • in: ICLR

Neural Groundplans: Persistent Neural Scene Representations from a Single Image

  • Prafull Sharma
  • Ayush Tewari
  • Yilun Du
  • Sergey Zakharov
  • Rares Ambrus
  • Adrien Gaidon
  • William T. Freeman
  • Frédo Durand
  • Joshua B. Tenenbaum
  • Vincent Sitzmann
@inproceedings{sharma2023neural,
    title = { Neural Groundplans: Persistent Neural Scene Representations from a Single Image },
    author = { Sharma, Prafull and 
               Tewari, Ayush and 
               Du, Yilun and 
               Zakharov, Sergey and 
               Ambrus, Rares and 
               Gaidon, Adrien and 
               Freeman, William T. and 
               Durand, Frédo and 
               Tenenbaum, Joshua B. and 
               Sitzmann, Vincent },
    year = { 2023 },
    booktitle = { ICLR },
}
  • Copy to Clipboard

We present a method to map 2D image observations of a scene to a persistent 3D scene representation, enabling novel view synthesis and disentangled representation of the movable and immovable components of the scene. Motivated by the bird’s-eye-view (BEV) representation commonly used in vision and robotics, we propose conditional neural groundplans, ground-aligned 2D feature grids, as persistent and memory-efficient scene representations. Our method is trained self-supervised from unlabeled multi-view observations using differentiable rendering, and learns to complete geometry and appearance of occluded regions. In addition, we show that we can leverage multi-view videos at training time to learn to separately reconstruct static and movable components of the scene from a single image at test time. The ability to separately reconstruct movable objects enables a variety of downstream tasks using simple heuristics, such as extraction of object-centric 3D representations, novel view synthesis, instance-level segmentation, 3D bounding box prediction, and scene editing. This highlights the value of neural groundplans as a backbone for efficient 3D scene understanding models.