Author: stachnis

2019-12: Emanuele Palazzolo Defended His PhD Thesis


Mapping the environment with the purpose of building a 3D model that represents it, is traditionally achieved by trained personnel, using measuring equipment such as cameras or terrestrial laser scanners. This process is often expensive and time-consuming. The use of a robotic platform for such a purpose can simplify the process and enables the use of 3D models for consumer applications or in environments inaccessible to human operators. However, fully autonomous 3D reconstruction is a complex task and it is the focus of several open research topics.
In this thesis, we try to address some of the open problems in active 3D environment reconstruction. For solving such a task, a robot should autonomously determine the best positions to record measurements and integrate these measurements in a model while exploring the environment. In this thesis, we first address the task of integrating the measurements from a sensor in real-time into a dense 3D model. Second, we focus on \emph{where} the sensor should be placed to explore an unknown environment by recording the necessary measurements as efficiently as possible. Third, we relax the assumption of a static environment, which is typically made in active 3D reconstruction. Specifically, we target long-term changes in the environment and we address the issue of how to identify them online with an exploring robot, to integrate them in an existing 3D model. Finally, we address the problem of identifying and dealing with dynamic elements in the environment, while recording the measurements.

In the first part of this thesis, we assume the environment to be static and we solve the first two problems. We propose an approach to 3D reconstruction in real-time using a consumer RGB-D sensor. A particular focus of our approach is its efficiency in terms of both execution time and memory consumption. Moreover, our method is particularly robust to situations where the structural cues are insufficient. Additionally, we propose an approach to compute iteratively the next best viewpoint for the sensor to maximize the information obtained from the measurements. Our algorithm is taylored for micro aerial vehicles (MAV) and takes into account the specific limitations that this kind of robots have.

In the second part of this work, we focus on non-static environments and we address the last two problems. We deal with long-term changes by proposing an approach that is able to identify the regions that changed on a 3D model, from a short sequence of images. Our method is fast enough to be suitable to run online on a mapping robot, which can direct its effort on the parts of the environment that have changed. Finally, we address the problem of mapping fully dynamic environments, by proposing an online 3D reconstruction approach that is able to identify and filter out dynamic elements in the measurements.

In sum, this thesis makes several contributions in the context of robotic map building and dealing with change. Compared to the current state of the art, the approaches presented in this thesis allow for a more robust real-time tracking of RGB-D sensors including the ability to deal with dynamic scenes. Moreover, this work provides a new, more efficient view point selection technique for MAV exploration, and an efficient online change detection approach operating on 3D models from images that is substantially faster than comparable existing methods. Thus, we advanced the state of the art in the field with respect to robustness as well as efficiency.

2019-09: Olga Vysotska defended her PhD Thesis

Olga Vysotska successfully defended her PhD thesis entitled “Visual Place Recognition in Changing Environments” at the University of Bonn on the Photogrammetry & Robotics Lab.


Localization is an essential capability of mobile robots and place recognition is an important component of localization. Only having precise localization, robots can reliably plan, navigate and understand the environment around them. The main task of visual place recognition algorithms is to recognize based on the visual input if the robot has seen previously a given place in the environment. Cameras are one of the popular sensors robots get information from. They are lightweight, affordable, and provide detailed descriptions of the environment in the form of images. Cameras are shown to be useful for the vast variety of emerging applications, from virtual and augmented reality applications to autonomous cars or even fleets of autonomous cars. All these applications need precise localization. Nowadays, the state-of-the-art methods are able to reliably estimate the position of the robots using image streams. One of the big challenges still is the ability to localize a camera given an image stream in the presence of drastic visual appearance changes in the environment. Visual appearance changes may be caused by a variety of different reasons, starting from camera-related factors, such as changes in exposure time, camera position-related factors, e.g. the scene is observed from a different position or viewing angle, occlusions, as well as factors that stem from natural sources, for example seasonal changes, different weather conditions, illumination changes, etc. These effects change the way the same place in the environments appears in the image and can lead to situations where it becomes hard even for humans to recognize the places. Also, the performance of the traditional visual localization approaches, such as FABMAP or DBow, decreases dramatically in the presence of strong visual appearance changes.

The techniques presented in this thesis aim at improving visual place recognition capabilities for robotic systems in the presence of dramatic visual appearance changes. To reduce the effect of visual changes on image matching performance, we exploit sequences of images rather than individual images. This becomes possible as robotic systems collect data sequentially and not in random order. We formulate the visual place recognition problem under strong appearance changes as a problem of matching image sequences collected by a robotic system at different points in time. A key insight here is the fact that matching sequences reduces the ambiguities in the data associations. This allows us to establish image correspondences between different sequences and thus recognize if two images represent the same place in the environment. To perform a search for image correspondences, we construct a graph that encodes the potential matches between the sequences and at the same time preserves the sequentiality of the data. The shortest path through such a data association graph provides the valid image correspondences between the sequences.

Robots operating reliably in an environment should be able to recognize a place in an online manner and not after having recorded all data beforehand. As opposed to collecting image sequences and then determining the associations between the sequences offline, a real-world system should be able to make a decision for every incoming image. In this thesis, we therefore propose an algorithm that is able to perform visual place recognition in changing environments in an online fashion between the query and the previously recorded reference sequences. Then, for every incoming query image, our algorithm checks if the robot is in the previously seen environment, i.e. there exists a matching image in the reference sequence, as well as if the current measurement is consistent with previously obtained query images.

Additionally, to be able to recognize places in an online manner, a robot needs to recognize the fact that it has left the previously mapped area as well as relocalize when it re-enters environment covered by the reference sequence. Thus, we relax the assumption that the robot should always travel within the previously mapped area and propose an improved graph-based matching procedure that allows for visual place recognition in case of partially overlapping image sequences.

To achieve a long-term autonomy, we further increase the robustness of our place recognition algorithm by incorporating information from multiple image sequences, collected along different overlapping and non-overlapping routes. This allows us to grow the coverage of the environment in terms of area as well as various scene appearances. The reference dataset then contains more images to match against and this increases the probability of finding a matching image, which can lead to improved localization. To be able to deploy a robot that performs localization in large scaled environments over extended periods of time, however, collecting a reference dataset may be a tedious, resource consuming and in some cases intractable task. Avoiding an explicit map collection stage fosters faster deployment of robotic systems in the real world since no map has to be collected beforehand. By using our visual place recognition approach the map collection stage can be skipped, as we are able to incorporate the information from a publicly available source, e.g., from Google Street View, into our framework due to its general formulation. This automatically enables us to perform place recognition on already existing publicly available data and thus avoid costly mapping phase. In this thesis, we additionally show how to organize the images from the publicly available source into the sequences to perform out-of-the-box visual place recognition without previously collecting the otherwise required reference image sequences at city scale.

All approaches described in this thesis have been published in peer-reviewed conference papers and journal articles. In addition to that, most of the presented contributions have been released publicly as open source software.

2019-07: Data Available: SemanticKITTI — A Dataset for Semantic Scene Understanding of LiDAR Sequences

SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences


With SemanticKITTI, we release a large dataset to propel research on laser-based semantic segmentation. We annotated all sequences of the KITTI Vision Odometry Benchmark and provide dense point-wise annotations for the complete 360 deg field-of-view of the employed automotive LiDAR. We propose three benchmark tasks based on this dataset: (i) semantic segmentation of point clouds using a single scan, (ii) semantic segmentation using sequences comprised of multiple past scans, and (iii) semantic scene completion, which requires to anticipate the semantic scene in the future. We provide baseline experiments and show that there is a need for more sophisticated models to efficiently tackle these tasks.

J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences,” in Proc. of the IEEE/CVF International Conf.~on Computer Vision (ICCV), 2019.

2019-07: Code Available: Bonnetal – an easy-to-use deep-learning training and deployment pipeline for robotics by Andres Milioto

We have recently open-sourced Bonnetal, an easy-to-use deep-learning training and deployment pipeline to do a suite of perception tasks, that we have developed for our robots’ perception systems.

Bonnetal can pre-train popular CNN backbones on ImageNet for transfer learning (popular model trained weights are downloaded by default from our server so the learning never happens from scratch) and it has fast decoders for real-time semantic segmentation. We have more applications in the internal pipeline that we will be open-sourcing within the framework as well, such as object detection, instance segmentation, keypoint/feature extraction, and more.

The key features of Bonnetal are:

  • The training interface is easy to use, even for a novice in machine learning,
  • The library of models for transfer learning requires significantly less training data and time for a new task and dataset, exploiting the knowledge that is already condensed in the pre-trained weights about low-level geometry and texture,
  • All architectures can be used with our C++ library, which also has a ROS wrapper so that you don’t have to code at all, and
  • All of the supported architectures are tested using NVIDIA’s TensorRT so that you can get that extra juice out of your Jetson or GPU, including fast inference tricks such as INT8 quantization and calibration (vs. standard, slower, floating point 32).

This video ( shows a person-vs-background segmentation network using a MobilenetsV2 architecture with a small Atrous Spatial Pyramid pooling module, running quantized to INT8 for fast inference, achieving 200FPS at VGA resolution on a single GPU.

Access to the code in our Lab’s GitHub:

2019-05: Code Available: ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras by Emanuele Palazzolo

ReFusion – 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals

ReFusion on github

Mapping and localization are essential capabilities of robotic systems. Although the majority of mapping systems focus on static environments, the deployment in real-world situations requires them to handle dynamic objects. In this paper, we propose an approach for an RGB-D sensor that is able to consistently map scenes containing multiple dynamic elements. For localization and mapping, we employ an efficient direct tracking on the truncated signed distance function (TSDF) and leverage color information encoded in the TSDF to estimate the pose of the sensor. The TSDF is efficiently represented using voxel hashing, with most computations parallelized on a GPU. For detecting dynamics, we exploit the residuals obtained after an initial registration, together with the explicit modeling of free space in the model. We evaluate our approach on existing datasets, and provide a new dataset showing highly dynamic scenes. These experiments show that our approach often surpass other state-of-the-art dense SLAM methods. We make available our dataset with the ground truth for both the trajectory of the RGB-D sensor obtained by a motion capture system and the model of the static environment using a high-precision terrestrial laser scanner.

If you use our implementation in your academic work, please cite the corresponding paper: E. Palazzolo, J. Behley, P. Lottes, P. Giguère, C. Stachniss. ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals, Submitted to IROS, 2019 (arxiv paper).

This code is related to the following publications:
E. Palazzolo, J. Behley, P. Lottes, P. Giguère, C. Stachniss. ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals, Submitted to IROS, 2019 (arxiv paper).

2019-03-11: Cyrill Stachniss receives AMiner TOP 10 Most Influential Scholar Award (2007-2017)

Cyrill Stachniss received the 2018 AMiner TOP 10 Most Influential Scholar Award (2007-2017) in the area of robotics. The AMiner Most Influential Scholar Annual List names the world’s top-cited research scholars from the fields of AI/robotics. The list is conferred in recognition of outstanding technical achievements with lasting contribution and impact to the research community. In 2018, the winners are among the most-cited scholars whose paper was published in the top venues of their respective subject fields between 2007 and 2017. Recipients are automatically determined by a computer algorithm deployed in the AMiner system that tracks and ranks scholars based on citation counts collected by top-venue publications. In specific, the list of the field Robot answers the question of between 2007 and 2017, who are the most cited scholars in ICRA and IROS conferences, which are identified as the top venues of this field.