In situ semantic reconstruction for spontaneous augmented reality


The thesis will take place in Nancy in the TANGRAM team, a joint team between Inria and the University of Lorraine, within the UMR 7503 Loria . It will be supervised by Marie-Odile Berger ( ), research director at Inria, and Gilles Simon ( ), HDR lecturer at the University of Lorraine.


We wish to take advantage of the semantic information now widely available in scenes via machine learning to obtain these reconstructions. Our approach is in line with the semantic SLAM, which takes many forms (see [3]) by constructing 3D maps with semantic content but with the additional idea of reconstructing a complete entity with semantic and topological properties.   In this thesis, we wish to study two approaches: one bottom-up, starting from 3D geometry and semantic information and developing elaborate fusion mechanisms. The other one will try to build directly an entity from the information present in the scene by integrating semantic and topological constraints present in urban constructions. The problem of how to formulate shape constraints and integrate them into a reconstruction during a temporal observation process will be at the heart of this thesis.

The first approach will build on the state of the art of semantic reconstruction operating on RGB-D images [4,5]. These methods fuse depth maps from a Kinect or a CNN (Convolutional Neural Network) to generate dense 3D semantic meshes. Geometry extraction is most often independent of semantic information extraction [4], but some authors take advantage of the correlations between the two types of information, depth and semantic, to jointly estimate geometry and semantics using an end-to-end CNN [5]. In this approach, topological priors should be integrated into the convolutional network generating the 3D mesh, in the spirit of [6], and then geometric primitives should be extracted from the resulting mesh.

The second approach will exploit RGB images associated with 2D semantic maps to obtain CAD models directly, without using a 3D mesh. Intensity gradients, semantic labels and vanishing points [7] will be used to detect semantic primitives and reconstruct them in 3D, starting from a first RGB image. As such primitives are immediately exploitable to compute the camera motion, the initial model can be amended as the user’s movements reveal parts of the scene that are not visible at the beginning. This strategy is similar to the method [8], whose manual steps will be automated, and to semantic visual SLAM techniques [3]. It allows to control the extension of the model, taking into account topological constraints and uncertainty measures.


The objective of this thesis is to allow a user to incrementally reconstruct, in three dimensions, an architectural environment by physically moving in the scene. The generated model will consist of surface or volume primitives (rectangles, boxes, ellipsoids, etc.), associated with semantic classes (ground, façade, building, tree, etc.).  This type of representation is particularly adapted to CAD (Computer Aided Design), 3D computer graphics and GIS, where the objects manipulated must be both compact and informative.  It is also very interesting for augmented reality (AR) since it facilitates the registration of the 3D model in video images [1,2] and the placement of virtual objects in relation to objects in the real scene.

In situ reconstruction has the advantage of allowing the user to visually follow the course of operations in accordance with the WYSIWYG (What You See Is What You Get) paradigm and to influence it by choosing his movements around the parts of the scene to be modeled. It also opens the way to spontaneous AR, which must operate in environments that are discovered while being augmented. This concerns, for example, the collaborative exploration of places or the rapid prototyping of urban or interior design projects.


[1]       Matthieu Zins, Gilles Simon, Marie-Odile Berger. Object-Based Visual Camera Pose Estimation From Ellipsoidal Model and 3D-Aware Ellipse Prediction. International Journal of Computer Vision, Springer Verlag, 2022.

[2]       Antoine Fond, Marie-Odile Berger, Gilles Simon. Model-image registration of a building’s facade based on dense semantic segmentation. Computer Vision and Image Understanding, Elsevier, 2021, 206, pp. 103-185.

[3]       Linlin Xia, Jiashuo Cui, Ran Shen, Xun Xu, Yiping Gao, Xinying Li. A survey of image semantics-based visual simultaneous localization and mapping: Application-oriented solutions to autonomous navigation of mobile robots. International Journal of Advanced Robotic Systems, May-June 2020: 1–17.

[4]      Q.-H. Pham, B.-S. Hua, T. Nguyen, S.-K. Yeung. Real-time progressive 3d semantic segmentation for indoor scenes. IEEE WACV, 2019, pp. 1089–1098.

[5]       Davide Menini, Suryansh Kumar, Martin R. Oswald, Erik Sandstrom, Cristian Sminchisescu, Luc Van Gool. A Real-Time Online Learning Framework for Joint 3D Reconstruction and Semantic Segmentation of Indoor Scenes. IEEE Robotics and Automation Letters, 2021.

[6]    Qimin Chen, Vincent Nguye, Feng Han, Raimondas Kiveris, Zhuowen Tu. Topology-Aware Single-Image 3D Shape Reconstruction. IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2020.

[7]       Gilles Simon, Antoine Fond, Marie-Odile Berger. A-Contrario Horizon-First Vanishing Point Detection Using Second-Order Grouping Laws. European Conference on Computer Vision,  Sep 2018, Munich, Germany. pp. 323-338.

[8]       Gilles Simon, Marie-Odile Berger. Interactive Building and Augmentation of Piecewise Planar Environments Using the Intersection Lines. The Visual Computer, Springer Verlag, 2011, 27 (9), pp.827-841.


Semantic SLAM, convolutional neural networks, 3D reconstruction, augmented reality