Sim4CV – A Photo-Realistic Simulator for Computer Vision Applications
We present a photo-realistic training and evaluation simulator (Sim4CV) with extensive applications across various fields of computer vision. Built on top of the Unreal Engine, the simulator integrates full featured physics based cars, unmanned aerial vehicles (UAVs), and animated human actors in diverse urban and suburban 3D environments. The simulator fully integrates both several state-of-the-art tracking algorithms with a benchmark evaluation tool and a generic deep neural network (DNN) interface for real-time evaluation. It generates synthetic photo-realistic datasets with automatic groundtruth annotations to easily extend existing real-world datasets and provides extensive synthetic data variety through its ability to reconfigure synthetic worlds on the fly using an automatic world generation tool.
Structured Depth Prediction
Learning to predict scene depth from RGB inputs is a challenging task both for indoor and outdoor robot navigation. In this work we address unsupervised learning of scene depth and robot ego-motion where supervision is provided by monocular videos, as cameras are the cheapest, least restrictive and most ubiquitous sensor for robotics. Previous work in unsupervised image-to-depth learning has established strong baselines in the domain. We propose a novel approach which produces higher quality results, is able to model moving objects and is shown to transfer across data domains, e.g. from outdoors to indoor scenes. The main idea is to introduce geometric structure in the learning process, by modeling the scene and the individual objects; camera ego-motion and object motions are learned from monocular videos as input. Furthermore an online refinement method is introduced to adapt learning on the fly to unknown domains. The proposed approach outperforms all state-of-the-art approaches, including those that handle motion e.g. through learned flow. Our results are comparable in quality to the ones which used stereo as supervision and significantly improve depth prediction on scenes and datasets which contain a lot of object motion. The approach is of practical relevance, as it allows transfer across environments, by transferring models trained on data collected for robot navigation in urban scenes to indoor navigation settings.
Teaching UAVs to Race Using a Photo-Realistic Simulator
Automating the navigation of unmanned aerial vehicles(UAVs) in diverse scenarios has gained much attention in recent years. However, teaching UAVs to fly in challenging environments remains an unsolved problem, mainly due to the lack of training data. In this paper, we train a deep neural network to predict UAV controls from raw image data for the task of autonomous UAV racing in a photo-realistic simulation. Training is done through imitation learning with data augmentation to allow for the correction of navigation mistakes. Extensive experiments demonstrate that our trained network (when sufficient data augmen- tation is used) outperforms state-of-the-art methods and flies more consistently than many human pilots. Additionally, we show that our optimized network ar- chitecture can run in real-time on embedded hardware, allowing for efficient on- board processing critical for real-world deployment. From a broader perspective, our results underline the importance of extensive data augmentation techniques to improve robustness in end-to-end learning setups.
A Modular Approach Towards Autonomous Driving in Sim4CV
We present a novel and modular deep learning based approach towards autonomous driving. By using the deep network for pathway estimation only (thus de-coupling it from the underlying car controls), we show that tasks such as lane change, obstacle avoidance, and guided driving become straightforward and very simple to implement. Furthermore, changes regarding the vehicle or its behaviour can be applied easily on the controller side without changing the learned network. Our approach even works without any need for human-generated or hand-crafted training data (although manually collected data can be included if available), thus avoiding the high cost and tedious nature of manually collecting training data. We demonstrate the effectiveness of our approach by measuring the performance on different diversely arranged environments and maps, showing that it can outperform the capabilities of human drivers significantly.
Fast Mitochondria Segmentation for Connectomics
In connectomics, neuroscientists create the wiring diagram of a mammalian brain by identifying synaptic connections between neurons in electron microscopy images. These images are acquired at nanometer resolution to clearly see intracellular structures. Importantly, this allows for the identification of mitochondria which are the main energy providers of neurons and crucial for cell metabolism. Dysfunction in this process is a key factor for a variety of diseases such as autism, and the quantitative analysis of mitochondria is an important task. However, connectomics datasets can be petabytes in size and manual processing is not feasible. We, therefore, present a fully automatic mitochondria detector with high accuracy and extremely fast processing times. We evaluate our method on multiple real-world connectomics datasets, including an improved version of the EPFL Hippocampus mitochondria detection benchmark. Our results show a Jaccard index of up to 0.90 with inference speeds lower than 16ms for a 512 × 512 image tile. This speed is faster than the acquisition time of modern electron microscopes, allowing mitochondria detection in real-time. Compared to previous work, our detector ranks first among real-time methods and third overall.
OIL: Observational Imitation Learning
Recent work has explored the problem of autonomous navigation by imitating a teacher and learning an end-to-end policy, which directly predicts controls from raw images. However, these approaches tend to be sensitive to mistakes by the teacher and do not scale well to other environments or vehicles. To this end, we propose Observational Imitation Learning (OIL), a novel imitation learning variant that supports online training and automatic selection of optimal behavior by observing multiple imperfect teachers. We apply our proposed methodology to the challenging problems of autonomous driving and UAV racing. For both tasks, we utilize the Sim4CV simulator that enables the generation of large amounts of synthetic training data and also allows for online learning and evaluation. We train a perception network to predict waypoints from raw image data and use OIL to train another network to predict controls from these waypoints. Extensive experiments demonstrate that our trained network outperforms its teachers, conventional imitation learning (IL) and reinforcement learning (RL) baselines and even humans in simulation.
Effective Information Extraction and Querying in Medical Case Databases
In this project, we introduce a new product to accelerate the adoption of AI in Health Care. In cooperation with the MGH & BWH Center for Clinical Data Science, we applied and optimized natural language processing methods for dealing specifically with medical patient reports collected in hospitals. Reports are written in free form and pose special challenges, such as use of highly specialized vocabulary, abbreviations, typing errors and negated sentences. Standard approaches towards extracting key information or querying, while working well in general use cases, typically fail miserably on these kind of data. We developed a highly efficient search engine for medical databases that helps (i) clinicians to retrieve relevant cases, aiding both clinical diagnostics and treatment, and (ii) machine learning engineers working in the healthcare domain, looking towards automatic extraction of relevant train and test data.
Guided Video Generation Using FCNs in an Adversarial Setup
In this project, we present a variety of deep learning based setups for text-to-image and text-to-image sequence generation. Image sequence generation is a challenging task and an actively researched branch within computer vision, posing special challenges such as temporal coherency. To address this problem, we describe a variety of partial and complete solutions that we developed in three stages: (1) Text to Image Synthesis: using a traditional GAN, we generate a single image from a textual representation. (2) Text + Image to Video Synthesis: using a fully convolutional network, we generate an image sequence given a single frame and a description of the action taking place. (3) Inspired by the recent success of generative adversarial networks, we then also train this architecture in a truly adversarial setting. Throughout our work, we make use of different datasets. Primarily, we evaluated our approaches on our own synthetic datasets with increasing difficulty, then moving to natural images from a human action dataset. We also performed text to image synthesis experiments on the T-GIF dataset, but noticed that the high diversity and other issues with this dataset make it rather unsuitable for video generation experiments.
An Orbit-Fitting Optimization for Detecting Near-Earth Asteroids
This report summarizes a multi-step method for identifying full orbit trajectories of asteroids in our solar system. The input to this method is a list of partial ”tracklets,” a series of visible-light observations of an object in the night sky over the course of several hours. The output is a series of ”orbital elements” that completely define the asteroid’s trajectory through our solar system. Some steps of this method have been completed as a prior project and are outside the scope of this paper; specifically, (i) the partitioning of the night sky into 768 spatial regions and monthly time windows, and (ii) the initial clustering of asteroids within those regions/windows into ”asteroid slices”. The steps that are within the scope of this paper are (a) the optimization of the six ”motion parameters” of each cluster from a variety of optimization solvers, (b) performing refinements on the earlier clusters using the optimized motion parameters, (c) converting the motion parameters into time independent ”orbital elements,” and (d) meta-clustering the asteroid slices from different time/space windows into final asteroid trajectories. Ultimately, our algorithm correctly identifies 69.9% of just over 21,000 asteroids in a labeled dataset with an approx. 1% error rate of spurious asteroids. We hope to run this algorithm on a full, unsolved database of asteroids known as the Isolated Tracklet File (ITF).
Investigating Multiscale Class Activation Mapping for ConvNet-based Car Detection
In this paper we apply Class Activation Mapping (CAM) to the problem of car detection. For this, we perform weakly-supervised training of different Convolutional Neural Networks on large-scale datasets depicting cars and evaluate them on artificial and realistic image data. To detect also small object appearances, we combine CAM with a multiscale sliding window approach. We show that (i) the usage of CAM improves localization ability compared to the aggregation of results by a simple sliding window classifier considerably, and (ii) our multiscale sliding-window extension enhances the quality of localization results when focusing on small object appearances by increasing mapping resolution, although increasing computational costs. The presented car detector is robust and performs well on both artificial and challenging realistic data, achieving a F1-score of 0.82 or 0.77 in our metric, respectively.