ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers

Abstract

Novel view synthesis is a long-standing problem. In this work, we consider a variant of the problem where we are given only a few context views sparsely covering a scene or an object. The goal is to predict novel viewpoints in the scene, which requires learning priors. The current state of the art is based on Neural Radiance Fields (NeRFs). While achieving impressive results, NeRF-based approaches suffer from long training times as they require evaluating thousands of 3D point samples via a deep neural network for each image. We propose a simple 2D-only method that maps context views and a query pose to a new image in a single pass of a neural network. Our model uses a two-stage architecture consisting of a codebook and a transformer model. The codebook is used to embed individual images into a smaller latent space, and the transformer solves the view synthesis task in this more compact space. ViewFormer architecture overview To train our model efficiently, we introduce a novel branching attention mechanism that allows us to use the same model not only for neural rendering but also for camera pose estimation. Experimental results on real-world scenes show that our approach is competitive compared to NeRF-based methods while not reasoning in 3D, and it is faster to train.

Common Objects in 3D dataset results

In this section we present the qualitative results for the Common Objects in 3D (CO3D) dataset [1]. In the video, we show the model trained on all ten classes used in [1]. Notice how the quality of the generated image improves with increasing context size.

InteriorNet dataset results

In this section we present the qualitative results for the InteriorNet dataset [2]. Notice that some details in the image are lost because the model is not able to represent the scene perfectly.

7-Scenes dataset results

The 7-Scenes dataset [3] was used primarily to evaluate the performance of camera pose estimation (results given in the paper). However, we also show qualitative results for the novel view synthesis task. In this case, the model was overfitting the training data, which resulted in lower quality of the generated images. We show images from all seven scenes.

References

[1] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3D: Large-scale learning and evaluation of real-life 3D category reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10901–10911, 2021.

[2] Wenbin Li, Sajad Saeedi, John McCormac, Ronald Clark, Dimos Tzoumanikas, Qing Ye, Yuzhong Huang, Rui Tang, and Stefan Leutenegger. InteriorNet: Mega-scale multi sensor photo-realistic indoor scenes dataset. In British Machine Vision Conference (BMVC), 2018.

[3] Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images. In CVPR, 2013.

[4] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. PixelNeRF: Neural Radiance Fields from One or Few Images. In CVPR, 2021.

Citation

Please use the following citation:

@inproceedings{kulhanek2022viewformer,
  title={ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers},
  author={Kulh{\'a}nek, Jon{\'a}{\v{s}} and Derner, Erik and Sattler, Torsten and Babu{\v{s}}ka, Robert},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2022},
}

ViewFormer NeRF-free Neural Rendering from Few Images Using Transformers