Robust Understanding of Street Scenes Using Computer Vision!

Robust Understanding of Street Scenes Using Computer Vision

Summer School and Workshop on Uncertainty in Computer Vision for Automated Driving

Deep neural networks lead to stunning performance in semantic interpretation of street scenes. Nevertheless, a number of safety relevant questions still remain under intensive research. In particular this applies to the question, whether neural networks can successfully self-monitor their performance by quantifying the uncertainty of predictions, and which kind of data they need to learn this. In this workshop, we bring together experts on uncertainty in machine learning and computer vision to present state of the art research towards enhancing the safety of automated driving.

Figure: Visualization of Exemplary Methods for Prediction Uncertainty Quantification in Street Scenes

Relevant topics of interest for this workshop include (but are not limited to) the following:

Quantification of uncertainty
Detection of out-of-distribution samples
Resilience to adversarial attacks
Open world recognition
Domain shift detection
Domain adaptation
Cross-domain learning and robustness to real-world conditions
Sensor fusion and other types of redundancy

Organizers

Prof. Dr. Siniša Šegvić

University of Zagreb

Prof. Dr. Hanno Gottschalk

University of Wuppertal

Schedule Summer School

Wednesday, 09-21-2022

09:05-10:35

Hanno Gottschalk, University of Wuppertal

From Information to Outliers

10:35-11:00

Coffee Break

11:00-12:30

Matthias Rottmann, EPFL

Uncertainty in Machine Learning -- Notions of Uncertainty

12:30-13:45

Lunch break

13:45-15:15

Matthias Rottmann, EPFL

Uncertainty in Machine Learning -- Methods for Uncertainty Quantification

Schedule Workshop

Thursday, 09-22-2022

09:00-09:45

, EPFL

How Synthetic Training Powers Anomaly and Obstacle Detection in Traffic Scenes

09:45-10:30

, ETH Zurich

Leveraging Multi-Modality for Robust Scene Understanding

10:30-10:40

Coffee Break

10:40-11:00

, CARIAD

Enabling AI in Safety-Critical Automotive Products

11:00-11:20

, Aptiv

Automotive Big Data Video Analysis with AiBox

11:20-11:40

, dSPACE

Challenges in Data Selection: How to Fish in the Data Lake

11:40-12:00

, Intel

Learning Confidence Classification from Richly Annotated Synthetic Data

12:00-13:00

Poster Session

13:00-14:00

Lunch break

14:00-14:30

, University of Wuppertal

Tracking and Retrieval of Out of Distribution Objects in Video Sequences

14:30-15:00

, Bielefeld University

SegmentMeIfYouCan: A Benchmark for Anomaly Segmentation

15:00-15:30

, University of Wuppertal

Gradient-Based Quantification of Epistemic Uncertainty for Deep Object Detectors

15:30-15:45

Coffee Break

15:45-16:15

, Ruhr University Bochum

Improving Robustness under Domain Shift of Semantic Segmentation by Depth Estimation

16:15-16:45

, University of Wuppertal

Domain Adaptation with Generative Adversarial Networks

16:45-17:05

, Rimac Automobili

Challenges in Scene Understanding for Autonomous Racing

Friday, 09-23-2022

09:00-09:45

, AIT Vienna

Robust Evaluation of Computer Vision for Autonomous Driving

09:45-10:30

, ETH Zurich

Creating Synthetic and Real Data for Semantic Scene Understanding in Adverse Conditions

10:30-10:45

Coffee Break

10:45-11:30

, EPFL

Automated Detection of Labeling Errors in Semantic Segmentation Datasets

11:30-12:15

, Czech Technical University in Prague

Road Anomaly Detection by Generative and Discriminative Road Appearance Modelling

12:15-14:00

Lunch break

14:00-14:20

, Gideon Brothers

Scene Understanding for Autonomous Mobile Robots in Warehouse Operations

14:20-14:40

, Microblink

Transformer Architectures in Production: Efficiency and Self-Supervision

14:40-15:00

, Xylon

Implementing Computer Vision Algorithms on FPGAs: Challenges & Advantages

15:00-15:20

, RoMB technologies

From the Road into the Warehouse: Semantic Segmentation for Autonomous Forklifts

15:20-15:35

Coffee Break

15:35-15:50

, University of Zagreb

Cross-Domain Learning of Dense Prediction Models

15:50-16:05

, University of Zagreb

Dense Semantic Forecasting

16:05-16:20

, University of Zagreb

Dense Anomaly Detection with Synthetic Negatives

16:20-16:35

, University of Zagreb

Revisiting One-Way Consistency for Semi-Supervised Semantic Segmentation

16:35-16:50

, University of Zagreb

Detection of Road-Safety Attributes in Video

Chewie

The workshop has been supported by the University of Wuppertal and the Croatian Science Foundation project ADEPT.

09:00-09:45		, EPFL
		How Synthetic Training Powers Anomaly and Obstacle Detection in Traffic Scenes Autonomous vehicles can encounter arbitrary obstacles in their paths. But popular datasets label a limited set of object types, such as cars, pedestrians and traffic signs. With our most powerful detectors based on supervised learning, how do we detect that for which we have no examples? Many attempt to generate semi-synthetic training sets aiming for generalization. I will present an overview of methods aiming to detect anomalous objects in traffic scenes, with the focus on their training process.
09:45-10:30		, ETH Zurich
		Leveraging Multi-Modality for Robust Scene Understanding We explore how consistency between different sensor modalities or predictive tasks can be leveraged to ensure reliability and uncertainty estimation in autonomous systems. For this, we need to be able to estimate uncertainty across a range of different predictive tasks, each of which also exhibits different failure patterns. We then show how consistency between tasks can be leveraged to identify wrong predictions. Finally, we provide an outlook how multi-modality can also be used as a training input, in order to improve systems autonomously during their deployment.
10:30-10:40		Coffee Break
10:40-11:00		, CARIAD
		Enabling AI in Safety-Critical Automotive Products The application of AI is a key enabler for innovative automotive functions such as highly automated driving. This is especially true for the perception of the environment in complex urban traffic situations. Assuring the safety of functions that make use of AI-based algorithms is crucial. Existing and established safety processes cannot directly be transferred to machine learning methods. This talk will present challenges and AI-specific requirements for the development process and introduce concepts and exemplary methods to address these challenges.
11:00-11:20		, Aptiv
		Automotive Big Data Video Analysis with AiBox In the talk a description of problems and solutions to big data challenges in automotive camera projects with multiple Petabytes of data will be addressed. Topics such as logging setup in a single vehicle, data collection, fleet handing as well as uploading data to the cloud storage will be briefly addressed. In the main part a description of AiBox solution developed by Aptiv will be presented. The AiBox uses the AI algorithms to analyze the video, both in real time (in the vehicle) as well as on the storage cluster. It allows to greatly improve the data analyst work by automatically generating metadata. It also enables functions such as finding videos which are not suitable for further development or matching videos by content.
11:20-11:40		, dSPACE
		Challenges in Data Selection: How to Fish in the Data Lake
11:40-12:00		, Intel
		Learning Confidence Classification from Richly Annotated Synthetic Data
12:00-13:00		Poster Session
13:00-14:00		Lunch break
14:00-14:30		, University of Wuppertal
		Tracking and Retrieval of Out of Distribution Objects in Video Sequences In this work we present two video test data sets for the novel computer vision (CV) task of out of distribution tracking (OOD tracking). Here, OOD objects are understood as objects with a semantic class outside the semantic space of an underlying image segmentation algorithm, or an instance within the semantic space which however looks decisively different from the instances contained in the training data. OOD objects occurring on video sequences should be detected on single frames as early as possible and tracked over their time of appearance as long as possible. During the time of appearance, they should be segmented as precisely as possible. We present the SOS data set containing 20 video sequences of street scenes and more than 1000 labeled frames with up to two OOD objects. We furthermore publish the synthetic CARLA-WildLife data set that consists of 26 video sequences containing up to four OOD objects on a single frame. We propose metrics to measure the success of OOD tracking and develop a baseline algorithm that efficiently tracks the OOD objects. As an application that benefits from OOD tracking, we retrieve OOD sequences from unlabeled videos of street scenes containing OOD objects.
14:30-15:00		, Bielefeld University
		SegmentMeIfYouCan: A Benchmark for Anomaly Segmentation The detection and localization of previously-unseen objects is of utmost importance for safety-critical applications such as perception for automated driving, especially if such unknown objects appear on the road ahead. Our benchmark addresses two tasks: Anomalous object segmentation, which considers any previously-unseen object category; and road obstacle segmentation, which focuses on any object on the road, may it be known or unknown. We provide two corresponding datasets together with a test suite, performing an in-depth method analysis.
15:00-15:30		, University of Wuppertal
		Gradient-Based Quantification of Epistemic Uncertainty for Deep Object Detectors Deep neural networks have redefined the state-of-the-art in performance object detection performance and are subject to ongoing development and innovation. However, deep neural networks tend to be probabilistically unreliable in that they tend to be over-confident which can have a negative effect on down-stream tasks. To improve confidence estimation, uncertainty quantification methods such as Monte-Carlo dropout and deep ensembles which aim at capturing epistemic uncertainty in terms of probabilistic prediction variance. We propose novel instance-wise uncertainty scores that are based on the network's self-learning gradient. Our uncertainty scores can be used to great benefit to detect false positive predictions and lead to well-calibrated confidence estimates. Moreover, we show that the combination with purely output-based methods leads to improvements over both individual models (gradient- and output-based) indicating mutual non-redundancy. The object detection obtained from our confidence estimates leads to probabilistic reliability and an overall improvement of detection performance and stability with respect to prediction confidence thresholds.
15:30-15:45		Coffee Break
15:45-16:15		, Ruhr University Bochum
		Improving Robustness under Domain Shift of Semantic Segmentation by Depth Estimation Semantic segmentation of images is an important tool for environment perception. In recent years, deep neural networks have demonstrated outstanding performance for this task. However, their performance is tied to the domain represented by the training data and thus, open world scenarios cause inaccurate predictions which is hazardous in safety relevant applications like automated driving. In this talk, we present a method which improves a semantic segmentation prediction using monocular depth estimation by reducing the occurrence of non-detected objects in presence of domain shift. To this end, a depth heatmap is inferred via a modified segmentation network that generates foreground-background masks, operating in parallel to a given semantic segmentation network. Both segmentation masks are aggregated with a focus on foreground classes (here road users) to reduce false negatives. To also decrease the occurrence of false positives, a pruning based on uncertainty estimates is applied. The effectiveness of this approach is demonstrated by fewer non-detected objects of most important classes and enhanced generalization to other domains compared to the basic semantic segmentation prediction.
16:15-16:45		, University of Wuppertal
		Domain Adaptation with Generative Adversarial Networks Domain adaptation is of huge interest as labeling is an expensive and error-prone task, especially when labels are needed on pixel-level like in semantic segmentation. Therefore, one would like to be able to train neural networks on synthetic domains, where data is abundant and labels are precise. However, these models often perform poorly on out-of-domain images. To mitigate the shift in the input, image-to-image approaches can be used. Nevertheless, standard image-to-image approaches that bridge the domain of deployment with the synthetic training domain do not focus on the downstream task but only on the visual inspection level. We therefore propose a “task aware” version of a GAN in an image-to-image domain adaptation approach. With the help of a small amount of labeled ground truth data, we guide the image-to-image translation to a more suitable input image for a semantic segmentation network trained on synthetic data (synthetic-domain expert). We demonstrate a modular semi-supervised domain adaptation method for semantic segmentation by training a downstream task aware CycleGAN while refraining from adapting the synthetic semantic segmentation expert.
16:45-17:05		, Rimac Automobili
		Challenges in Scene Understanding for Autonomous Racing

09:00-09:45		, AIT Vienna
		Robust Evaluation of Computer Vision for Autonomous Driving Robustness is a key quality when choosing the best computer vision algorithms for autonomous driving. Traditional evaluations and benchmarks focus on a single performance metric and dataset bias can easily distort the perceived robustness at benchmarks vs. the actual usage in real-life. This talk presents two projects which propose solutions to this gap: the Wilddash 2 benchmark (WD2) and the Robust Vision Challenge (RVC). WD2 uses hazard-aware testing to quantify robustness and negative test cases as a tool to evaluate robustness for unexpected situations. The RVC approach to create a meta-ranking allows to reduce the influence of individual dataset bias and promotes algorithms which focus on generalization vs. solving a single dataset.
09:45-10:30		, ETH Zurich
		Creating Synthetic and Real Data for Semantic Scene Understanding in Adverse Conditions Adverse conditions such as night, fog, rain and snow are hindering the deployment of autonomous cars, as they present a big challenge to the visual perception component of such systems due to the associated deterioration in the quality of the measured visual signals. In this talk, we will review techniques both for generating synthetic datasets and constructing real datasets pertaining to adverse conditions, which can be used for training as well as benchmarking semantic understanding algorithms on driving scenes. In particular, we will first present a class of methods for generating partially synthetic data corresponding to adverse conditions in a physically based fashion from real clear-weather counterparts by leveraging the respective underlying optical models. The synthesized data inherit the semantic annotations of their real counterparts and are thus straightforward to use for adapting semantic segmentation and object detection models to the respective adverse condition. In the second part of the talk, we will focus on the construction of real adverse-condition datasets and will introduce ACDC, the Adverse Conditions Dataset with Correspondences for semantic driving scene understanding. We will analyze the specialized two-stage annotation protocol that has been developed for creating the pixel-level semantic annotations of ACDC, which exploits privileged information and distinguishes between intra-image regions of clear and uncertain semantic content. Thus, ACDC supports both standard semantic segmentation and the new task of uncertainty-aware semantic segmentation and serves as a new benchmark for domain adaptation from normal to adverse conditions.
10:30-10:45		Coffee Break
10:45-11:30		, EPFL
		Automated Detection of Labeling Errors in Semantic Segmentation Datasets In applications like automated driving, the acquisition of semantic segmentation labels for datasets by human labor is time consuming, exhausting and therefore error prone. Labeling errors may affect the quality of deep learning models and the quality of benchmark results. In this talk, we present a method based on deep learning and uncertainty quantification, which detects labeling errors in semantic segmentation datasets. In order to enable method development in that field, we propose a benchmark based on a synthetic and a real dataset alongside with an evaluation protocol for methods that detect label errors in semantic segmentation datasets. Our method shows strong detection results on our benchmark while keeping the rate of falsely detected labeling errors under control. Furthermore, we apply our method to a number of popular semantic segmentation datasets, report results on the precision of our detection method for moderate sample sizes and present examples of labeling errors.
11:30-12:15		, Czech Technical University in Prague
		Road Anomaly Detection by Generative and Discriminative Road Appearance Modelling In this talk, I will discuss approaches to detecting unknown objects in the context of autonomous driving that leverages both generative and discriminative modeling principles. We formulate the problem of detecting the unknown objects as an anomaly detection task, assuming that the unknown stuff or anomalous object appearances cannot be learned. The main driving ideas behind the approaches are derived from the general properties of anomalous objects: (1) anomalies are not known, i.e. not from a class that we could model since once something novel is observed in training, it can be modeled and ceases to be an anomaly, and (2) anomalies are unique, i.e. "not the same" as non-anomalous objects in the image. We demonstrate the effectiveness of these approaches on several standard datasets, where it achieves state-of-the-art results.
12:15-14:00		Lunch break
14:00-14:20		, Gideon Brothers
		Scene Understanding for Autonomous Mobile Robots in Warehouse Operations
14:20-14:40		, Microblink
		Transformer Architectures in Production: Efficiency and Self-Supervision
14:40-15:00		, Xylon
		Implementing Computer Vision Algorithms on FPGAs: Challenges & Advantages
15:00-15:20		, RoMB technologies
		From the Road into the Warehouse: Semantic Segmentation for Autonomous Forklifts
15:20-15:35		Coffee Break
15:35-15:50		, University of Zagreb
		Cross-Domain Learning of Dense Prediction Models Cross-domain learning of dense prediction models Training semantic segmentation models on multiple datasets has sparked a lot of recent interest in the computer vision community. This interest is motivated by an expensive annotation process and a desire to achieve proficiency across multiple visual domains. However, established datasets have mutually incompatible labeling systems which do not promote principled inference in the wild. We develop a method for seamless learning on datasets with overlapping classes based on partial labels and probabilistic loss. We propose a principled approach for manual extraction of a universal taxonomy and a mapping function that connects our universal label space to the dataset-specific label spaces. We further propose an automatic method for constructing universal taxonomies based on the analysis of correlation matrices. Our method detects subset-superset relationships between dataset-specific labels, and supports learning of sub-class logits by treating super-classes as partial labels. Our method achieves competitive within-dataset and cross-dataset generalization, as well as the ability to learn visual concepts which are not separately labeled in any of the training datasets. Experiments reveal competitive or state-of-the-art performance on two multi-domain dataset collections, MSeg dataset collection and WildDash 2 benchmark.
15:50-16:05		, University of Zagreb
		Dense Semantic Forecasting Accurate future anticipation is of vital importance for reliable and timely decision making in autonomous systems. Dense semantic forecasting anticipates future events in video by inferring pixel-level semantics of an unobserved future image. Feature-level forecasting methods are applicable to various single-frame architectures and tasks. Our feature-level approach consists of two modules. Feature-to-motion (F2M) module forecasts a dense deformation field that warps past features into their future positions. Feature-to-feature (F2F) module regresses the future features directly and is therefore able to account for emergent scenery. The compound F2MF model decouples the effects of motion from the effects of novelty in a task-agnostic manner. We aim to apply F2MF forecasting to the most subsampled and the most abstract representation of the desired single-frame model. Our design takes advantage of deformable convolutions and spatial correlation coefficients across neighboring time instants. We perform experiments on three dense prediction tasks: semantic segmentation, instance-level segmentation, and panoptic segmentation. The results reveal state-of-the-art forecasting accuracy across three dense prediction tasks.
16:05-16:20		, University of Zagreb
		Dense Anomaly Detection with Synthetic Negatives Standard machine learning is unable to accommodate inputs that do not belong to the training distribution. The resulting models often give rise to confident incorrect predictions which may lead to devastating consequences. This problem is especially demanding in the context of dense prediction since input images may be only partially anomalous. Previous work has addressed dense anomaly detection by training on mixed-content images. However, such approach may produce over-optimistic performance due to overlap between training and validation negatives. We extend this approach with synthetic negative patches generated by normalizing flow. The negative patches are simultaneously trained to achieve high inlier likelihood and uniform discriminative prediction. We also propose to detect anomalies according to a principled information-theoretic criterion which can be consistently applied through training and inference. The resulting models achieve outstanding results in spite of minimal computational overhead and refraining from auxiliary negative data.
16:20-16:35		, University of Zagreb
		Revisiting One-Way Consistency for Semi-Supervised Semantic Segmentation Semi-supervised learning is very important for practical deployments of deep models since it relaxes the dependence on labeled data. It is especially interesting in the scope of dense prediction because pixel-level annotation requires significant effort. We consider semi-supervised algorithms that enforce consistent predictions over perturbed unlabeled inputs. We study the advantages of perturbing only one of the two model instances and computing the gradient only in the perturbed instance. We also propose a competitive perturbation model as a composition of geometric warp and photometric jittering. We experiment on efficient models due to their importance for real-time and low-power applications. Our experiments show advantages of (1) one-way consistency, (2) perturbing only the student branch, and (3) the proposed perturbation model.
16:35-16:50		, University of Zagreb
		Detection of Road-Safety Attributes in Video Efficient assessment of road infrastructure safety is essential for reducing traffic-related fatalities. We address the automatic recognition of road safety attributes defined by the iRAP road safety standard. The task is formulated as a multi-class classification of each iRAP attribute in video clips that correspond to 10-meter road segments. Our solution has two parts. The first is an efficient convolutional multi-task model with shared features, which recognizes all attributes from multiple frames with a single forward pass and learns in an end-to-end fashion. The second is a sequential model that works on the outputs of the convolutional model and considers a larger context to learn the specific temporal behavior and annotation conventions of each individual attribute. We perform experiments on a real-world dataset acquired along 2,300 km of public roads in Bosnia and Herzegovina, manually annotated for all attributes by trained experts. A majority of the attributes in our dataset suffer from severe class imbalance, which we address with dynamically weighted losses based on class frequencies and recall during training, and by using per-attribute macro-F1 scores for evaluation. We adapt our approach to the Honda Scenes Dataset for traffic scene classification and achieve improvements over previous approaches across all tasks.