Robust Understanding of Street Scenes Using Computer Vision
Summer School and Workshop on Uncertainty in Computer Vision for Automated Driving
Deep neural networks lead to stunning performance in semantic interpretation of street scenes.
Nevertheless, a number of safety relevant questions still remain under intensive research. In particular this applies to the question,
whether neural networks can successfully self-monitor their performance by quantifying the uncertainty of predictions,
and which kind of data they need to learn this. In this workshop, we bring together experts on uncertainty in machine learning
and computer vision to present state of the art research towards enhancing the safety of automated driving.
Figure: Visualization of Exemplary Methods for Prediction Uncertainty Quantification in Street Scenes
Relevant topics of interest for this workshop include (but are not limited to) the following:
Quantification of uncertainty
Detection of out-of-distribution samples
Resilience to adversarial attacks
Open world recognition
Domain shift detection
Domain adaptation
Cross-domain learning and robustness to real-world conditions
Uncertainty in Machine Learning -- Methods for Uncertainty Quantification
Schedule Workshop
Thursday, 09-22-2022
How Synthetic Training Powers Anomaly and Obstacle Detection in Traffic Scenes
Autonomous vehicles can encounter arbitrary obstacles in their paths. But popular datasets label a limited set of object types, such as cars, pedestrians and traffic signs.
With our most powerful detectors based on supervised learning, how do we detect that for which we have no examples? Many attempt to generate semi-synthetic training sets aiming for generalization.
I will present an overview of methods aiming to detect anomalous objects in traffic scenes, with the focus on their training process.
Leveraging Multi-Modality for Robust Scene Understanding
We explore how consistency between different sensor modalities or predictive tasks can be leveraged to ensure reliability and uncertainty
estimation in autonomous systems. For this, we need to be able to estimate uncertainty across a range of different predictive tasks, each
of which also exhibits different failure patterns. We then show how consistency between tasks can be leveraged to identify wrong
predictions. Finally, we provide an outlook how multi-modality can also be used as a training input, in order to improve systems
autonomously during their deployment.
Enabling AI in Safety-Critical Automotive Products
The application of AI is a key enabler for innovative automotive functions such as highly automated driving. This is especially true for the
perception of the environment in complex urban traffic situations. Assuring the safety of functions that make use of AI-based algorithms is
crucial. Existing and established safety processes cannot directly be transferred to machine learning methods. This talk will present challenges
and AI-specific requirements for the development process and introduce concepts and exemplary methods to address these challenges.
In the talk a description of problems and solutions to big data challenges in automotive camera projects with multiple Petabytes of data
will be addressed. Topics such as logging setup in a single vehicle, data collection, fleet handing as well as uploading data to the
cloud storage will be briefly addressed. In the main part a description of AiBox solution developed by Aptiv will be presented. The AiBox
uses the AI algorithms to analyze the video, both in real time (in the vehicle) as well as on the storage cluster. It allows to greatly
improve the data analyst work by automatically generating metadata. It also enables functions such as finding videos which are not
suitable for further development or matching videos by content.
Tracking and Retrieval of Out of Distribution Objects in Video Sequences
In this work we present two video test data sets for the novel computer vision (CV) task of out of distribution tracking (OOD tracking). Here, OOD objects are understood as objects with a semantic
class outside the semantic space of an underlying image segmentation algorithm, or an instance within the semantic space which however looks decisively different from the instances contained in the training data. OOD objects
occurring on video sequences should be detected on single frames as early as possible and tracked over their time of appearance as long as possible. During the time of appearance, they should be segmented as precisely as possible.
We present the SOS data set containing 20 video sequences of street scenes and more than 1000 labeled frames with up to two OOD objects. We furthermore publish the synthetic CARLA-WildLife data set that consists of 26 video
sequences containing up to four OOD objects on a single frame. We propose metrics to measure the success of OOD tracking and develop a baseline algorithm that efficiently tracks the OOD objects. As an application that benefits
from OOD tracking, we retrieve OOD sequences from unlabeled videos of street scenes containing OOD objects.
SegmentMeIfYouCan: A Benchmark for Anomaly Segmentation
The detection and localization of previously-unseen objects is of utmost importance for safety-critical applications
such as perception for automated driving, especially if such unknown objects appear on the road ahead. Our benchmark addresses two tasks:
Anomalous object segmentation, which considers any previously-unseen object category; and road obstacle segmentation, which focuses on any object
on the road, may it be known or unknown. We provide two corresponding datasets together with a test suite, performing an in-depth method
analysis.
Gradient-Based Quantification of Epistemic Uncertainty for Deep Object Detectors
Deep neural networks have redefined the state-of-the-art in performance object detection performance and are subject to ongoing development and innovation.
However, deep neural networks tend to be probabilistically unreliable in that they tend to be over-confident which can have a negative effect on down-stream tasks.
To improve confidence estimation, uncertainty quantification methods such as Monte-Carlo dropout and deep ensembles which aim at capturing epistemic uncertainty in terms of probabilistic prediction variance.
We propose novel instance-wise uncertainty scores that are based on the network's self-learning gradient.
Our uncertainty scores can be used to great benefit to detect false positive predictions and lead to well-calibrated confidence estimates.
Moreover, we show that the combination with purely output-based methods leads to improvements over both individual models (gradient- and output-based) indicating mutual non-redundancy.
The object detection obtained from our confidence estimates leads to probabilistic reliability and an overall improvement of detection performance and stability with respect to prediction confidence thresholds.
Improving Robustness under Domain Shift of Semantic Segmentation by Depth Estimation
Semantic segmentation of images is an important tool for environment perception. In recent years, deep neural networks have demonstrated
outstanding performance for this task. However, their performance is tied to the domain represented by the training data and thus, open
world scenarios cause inaccurate predictions which is hazardous in safety relevant applications like automated driving. In this talk,
we present a method which improves a semantic segmentation prediction using monocular depth estimation by reducing the occurrence of
non-detected objects in presence of domain shift. To this end, a depth heatmap is inferred via a modified segmentation network that
generates foreground-background masks, operating in parallel to a given semantic segmentation network. Both segmentation masks are
aggregated with a focus on foreground classes (here road users) to reduce false negatives. To also decrease the occurrence of false
positives, a pruning based on uncertainty estimates is applied. The effectiveness of this approach is demonstrated by fewer non-detected
objects of most important classes and enhanced generalization to other domains compared to the basic semantic segmentation prediction.
Domain Adaptation with Generative Adversarial Networks
Domain adaptation is of huge interest as labeling is an expensive and error-prone task, especially when labels are needed on pixel-level like in semantic segmentation.
Therefore, one would like to be able to train neural networks on synthetic domains, where data is abundant and labels are precise. However, these models often perform poorly
on out-of-domain images. To mitigate the shift in the input, image-to-image approaches can be used. Nevertheless, standard image-to-image approaches that bridge the domain of deployment
with the synthetic training domain do not focus on the downstream task but only on the visual inspection level. We therefore propose a “task aware” version of a GAN in an image-to-image domain
adaptation approach. With the help of a small amount of labeled ground truth data, we guide the image-to-image translation to a more suitable input image for a semantic segmentation network trained
on synthetic data (synthetic-domain expert).
We demonstrate a modular semi-supervised domain adaptation method for semantic segmentation by training a downstream task aware CycleGAN while refraining from adapting the synthetic semantic segmentation expert.
Robust Evaluation of Computer Vision for Autonomous Driving
Robustness is a key quality when choosing the best computer vision algorithms for autonomous driving. Traditional evaluations and
benchmarks focus on a single performance metric and dataset bias can easily distort the perceived robustness at benchmarks vs. the
actual usage in real-life. This talk presents two projects which propose solutions to this gap: the Wilddash 2 benchmark (WD2) and
the Robust Vision Challenge (RVC). WD2 uses hazard-aware testing to quantify robustness and negative test cases as a tool to evaluate
robustness for unexpected situations. The RVC approach to create a meta-ranking allows to reduce the influence of individual dataset
bias and promotes algorithms which focus on generalization vs. solving a single dataset.
Creating Synthetic and Real Data for Semantic Scene Understanding in Adverse Conditions
Adverse conditions such as night, fog, rain and snow are hindering the deployment of autonomous cars, as they present a big challenge to the
visual perception component of such systems due to the associated deterioration in the quality of the measured visual signals. In this talk,
we will review techniques both for generating synthetic datasets and constructing real datasets pertaining to adverse conditions, which can be
used for training as well as benchmarking semantic understanding algorithms on driving scenes. In particular, we will first present a class of
methods for generating partially synthetic data corresponding to adverse conditions in a physically based fashion from real clear-weather
counterparts by leveraging the respective underlying optical models. The synthesized data inherit the semantic annotations of their real
counterparts and are thus straightforward to use for adapting semantic segmentation and object detection models to the respective adverse
condition. In the second part of the talk, we will focus on the construction of real adverse-condition datasets and will introduce ACDC, the
Adverse Conditions Dataset with Correspondences for semantic driving scene understanding. We will analyze the specialized two-stage annotation
protocol that has been developed for creating the pixel-level semantic annotations of ACDC, which exploits privileged information and
distinguishes between intra-image regions of clear and uncertain semantic content. Thus, ACDC supports both standard semantic segmentation and
the new task of uncertainty-aware semantic segmentation and serves as a new benchmark for domain adaptation from normal to adverse conditions.
Automated Detection of Labeling Errors in Semantic Segmentation Datasets
In applications like automated driving, the acquisition of semantic segmentation labels for datasets by human labor is time consuming, exhausting and
therefore error prone. Labeling errors may affect the quality of deep learning models and the quality of benchmark results. In this talk, we present a
method based on deep learning and uncertainty quantification, which detects labeling errors in semantic segmentation datasets. In order to enable method
development in that field, we propose a benchmark based on a synthetic and a real dataset alongside with an evaluation protocol for methods that detect
label errors in semantic segmentation datasets. Our method shows strong detection results on our benchmark while keeping the rate of falsely detected
labeling errors under control. Furthermore, we apply our method to a number of popular semantic segmentation datasets, report results on the precision
of our detection method for moderate sample sizes and present examples of labeling errors.
Road Anomaly Detection by Generative and Discriminative Road Appearance Modelling
In this talk, I will discuss approaches to detecting unknown objects in the
context of autonomous driving that leverages both generative and discriminative
modeling principles. We formulate the problem of detecting the unknown objects as an anomaly
detection task, assuming that the unknown stuff or anomalous object appearances
cannot be learned. The main driving ideas behind the approaches are derived
from the general properties of anomalous objects: (1) anomalies are not
known, i.e. not from a class that we could model since once something novel is observed
in training, it can be modeled and ceases to be an anomaly, and (2) anomalies
are unique, i.e. "not the same" as non-anomalous objects in the image.
We demonstrate the effectiveness of these approaches on several standard datasets, where
it achieves state-of-the-art results.
Cross-domain learning of dense prediction models
Training semantic segmentation models on multiple datasets has sparked a lot of recent interest in the computer vision community.
This interest is motivated by an expensive annotation process and a desire to achieve proficiency across multiple visual domains.
However, established datasets have mutually incompatible labeling systems which do not promote principled inference in the wild.
We develop a method for seamless learning on datasets with overlapping classes based on partial labels and probabilistic loss. We
propose a principled approach for manual extraction of a universal taxonomy and a mapping function that connects our universal
label space to the dataset-specific label spaces. We further propose an automatic method for constructing universal taxonomies based
on the analysis of correlation matrices. Our method detects subset-superset relationships between dataset-specific labels, and supports
learning of sub-class logits by treating super-classes as partial labels. Our method achieves competitive within-dataset and
cross-dataset generalization, as well as the ability to learn visual concepts which are not separately labeled in any of the
training datasets. Experiments reveal competitive or state-of-the-art performance on two multi-domain dataset collections, MSeg
dataset collection and WildDash 2 benchmark.
Accurate future anticipation is of vital importance for reliable and timely decision making in autonomous systems. Dense semantic
forecasting anticipates future events in video by inferring pixel-level semantics of an unobserved future image. Feature-level
forecasting methods are applicable to various single-frame architectures and tasks. Our feature-level approach consists of two modules.
Feature-to-motion (F2M) module forecasts a dense deformation field that warps past features into their future positions.
Feature-to-feature (F2F) module regresses the future features directly and is therefore able to account for emergent scenery. The
compound F2MF model decouples the effects of motion from the effects of novelty in a task-agnostic manner. We aim to apply F2MF
forecasting to the most subsampled and the most abstract representation of the desired single-frame model. Our design takes advantage
of deformable convolutions and spatial correlation coefficients across neighboring time instants. We perform experiments on three
dense prediction tasks: semantic segmentation, instance-level segmentation, and panoptic segmentation. The results reveal
state-of-the-art forecasting accuracy across three dense prediction tasks.
Standard machine learning is unable to accommodate inputs that do not belong to the training distribution. The resulting models often
give rise to confident incorrect predictions which may lead to devastating consequences. This problem is especially demanding in the
context of dense prediction since input images may be only partially anomalous. Previous work has addressed dense anomaly detection
by training on mixed-content images. However, such approach may produce over-optimistic performance due to overlap between training
and validation negatives. We extend this approach with synthetic negative patches generated by normalizing flow. The negative patches
are simultaneously trained to achieve high inlier likelihood and uniform discriminative prediction. We also propose to detect anomalies
according to a principled information-theoretic criterion which can be consistently applied through training and inference. The
resulting models achieve outstanding results in spite of minimal computational overhead and refraining from auxiliary negative data.
Revisiting One-Way Consistency for Semi-Supervised Semantic Segmentation
Semi-supervised learning is very important for practical deployments of deep models since it relaxes the dependence on labeled data. It
is especially interesting in the scope of dense prediction because pixel-level annotation requires significant effort. We consider
semi-supervised algorithms that enforce consistent predictions over perturbed unlabeled inputs. We study the advantages of perturbing
only one of the two model instances and computing the gradient only in the perturbed instance. We also propose a competitive perturbation
model as a composition of geometric warp and photometric jittering. We experiment on efficient models due to their importance for
real-time and low-power applications. Our experiments show advantages of (1) one-way consistency, (2) perturbing only the student branch,
and (3) the proposed perturbation model.
Efficient assessment of road infrastructure safety is essential for reducing traffic-related fatalities. We address the automatic
recognition of road safety attributes defined by the iRAP road safety standard. The task is formulated as a multi-class classification
of each iRAP attribute in video clips that correspond to 10-meter road segments. Our solution has two parts. The first is an efficient
convolutional multi-task model with shared features, which recognizes all attributes from multiple frames with a single forward pass
and learns in an end-to-end fashion. The second is a sequential model that works on the outputs of the convolutional model and considers
a larger context to learn the specific temporal behavior and annotation conventions of each individual attribute. We perform experiments
on a real-world dataset acquired along 2,300 km of public roads in Bosnia and Herzegovina, manually annotated for all attributes by
trained experts. A majority of the attributes in our dataset suffer from severe class imbalance, which we address with dynamically
weighted losses based on class frequencies and recall during training, and by using per-attribute macro-F1 scores for evaluation. We
adapt our approach to the Honda Scenes Dataset for traffic scene classification and achieve improvements over previous approaches across
all tasks.