PhD thesis: Localization in a scene of objects

Project-team: MAGRIT (Nancy)

Advisors: Marie-Odile Berger, Gilles Simon

start date: september, 1rst

key-words: visual localization, machine learning, augmented reality

Scientific Context:

The problem of visual localization within large environments using objects as features is the subject of this PhD thesis. Targeted applications are about augmented reality, especially in urban or industrial context. In recent years, research in pose estimation tasks has been dominated by convolutional networks (CNNs). Thanks to these methods, pose of an object can now be directly inferred from the appearance of objects instead of identifying individual surface points [3,4]. These approaches however require an accurate and textured 3d model for the learning stage. On the other hand, using objects as features for pose computation instead of the traditional key-points has emerged recently [1,2]. Based on the automatic detection of objects in 2D images and on the approximation of their 3d shapes with boxes or ellipsoids, these methods are less sensitive to local changes in appearance and to the presence of repeated patterns. If proofs of concepts of such systems exist, the transition to real scenes is not straightforwad.

The objective of the thesis is to extend these methods to the case of real large environments where models are known with a limited accuracy and relatively small image datasets are available. We will mainly focus in this works on techniques for rough re-localization, without any knowledge on the camera pose. This is a common practical case in gps-denied environments.

-Missions: (objectives, approach, etc.)

Our aim is to design robust object-base localization methods for real environments, either at a local level (i.e. when the pose is computed from one object) or at a global level, when a set of approximated objects are used for pose computation.

The following lines of research will be addressed:

  • at the local level when one object is considered: a recent and promising trend in pose computation is to predict 2d projections of the corners of a 3d bounding box (BB) of the objects [3,4] using convolutional networks. In practice, accurate models are required during the training stage to generate images of the object with various backgrounds, thus avoiding being influenced by the scene context. Extending such works to real datasets requires first to study the influence of the choice of the BB on the results and to define appropriate way for defining the BB. Second, methods have to be defined to generate synthetic images and combine them with real images for training.

  • At an intermediate level, methods allowing to take advantage both from object detection and from classical key-point matching will be designed. A key-difficulty there is that the accuracy of the two kind of features are not the same. In the case of objection detection, defining the accuracy of detection is in itself a problem.

  • Currently, image -model association is based on a set of predefined class of objects. Procedures for automatic detection and reconstruction of prominent objects able to contribute to the robustness of pose computation will be another focus of this work.


[1] M. Crocco, C. Rubino, and A. Del Bue. Structure from motion with objects. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

[2] J. Li, D. Meger, and G. Dudek. Context-coherent scenes of objects for camera pose estimation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 655–660, Sep. 2017.

[3] Markus Oberweger, Mahdi Rad, and Vincent Lepetit. Making deep heatmaps robust to partial occlusions for 3d object pose estimation. CoRR, abs/1804.03959, 2018.

[4] Bugra Tekin, Sudipta N. Sinha, and Pascal Fua. Real-time seamless single shot 6d object pose prediction. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 292–301, 2018.

Skills and profile:

Required qualification: Master computer science

Skills : computer vision, maching learning.

Additional information:

See for additional information of the activities of the team.

Supervision and contact:


Logo du CNRS

Logo d'Inria

Logo Université de Lorraine