User:Jongmoochoi/Simultaneous learning and perception

(Draft)

Introduction
We introduce a new concept of machine learning method, Simultaneous Learning And Perception (SLAP), which continuously updates the internal representation of a perception model while observing the environment without external communication. SLAP is a specific form of self-supervised learning and it focuses on the system providing autonomous behavior while interacting between a learning process and a sensor in the autonomous system. It is a SLAP approach if the parameters of a neural network for a perception task in an autonomous system is continuously updated while processing the input data without having any external supervision.

Why SLAP? A machine learning-based perception model, such as an object detector using a camera sensor, might produce noisy and uncertain measurements. Although the recent development of deep learning approaches, showing human-level performances on many applications such as Go play, it is very difficult to provide a reliable and dependable perception measurements in various conditions partly because the training data of a deep learning-based object detection must contain all possible scenes and objects. It is impractical to assume a fixed size training dataset contains all future unseen objects. Today, many autonomous systems, including autonomous driving vehicles, have been deployed with such imperfect machine learning-based object detectors. To mitigate this, as an example, autonomous driving vehicles uses more expensive but reliable sensors (e.g., LiDAR), multiple cameras (e.g., stereo camera system), or sensor fusion methods with other sensors (e.g., radar and camera). SLAP aims at addressing the problem, how to improve the accuracy of a perception model continuously after deploying it on an autonomous system.

Related work
SLAP has various aspects.

SLAP is a semi-supervised learning method. Semi-supervised learning uses both labeled and unlabeled data. On one hand, SLAP can use a supervision method during the updating or generated labeled data (or pseudo ground truth), it is different from unsupervised learning that does not use any labeled data. On the other hand, SLAP updates the internal representation with unlabeled input data.

SLAP is an incremental learning method because it continuously updates the internal model but SLAP focuses on leveraging unlabeled data stream.

Since an autonomous driving car can easily collect raw images (or video) while driving, one can try to use existing unsupervised or semi-supervised methods to leverage many unlabeled data to improve the accuracy of a detection model. However, up to our knowledge, a successful solution has not been developed yet.

Unsupervised learning. Unsupervised learning is a method to train a model without using labeled data [19]. Usually, a detector trained by an unsupervised learning method does not perform well. It can be explained partly by the fact that the intra-class variation of an object can be higher than the inter-class variations in a driving scene, caused by the different shapes/textures, pose changes, and/or illumination changes. Hence, an unsupervised learning model is used as a pre-training step or as a feature extraction step followed by addition classification methods using a supervised learning method (labeled data).

Pseudo labels. A semi-supervised learning with pseudo-label is proposed in [1]. To generate pseudo labels, a set of weak annotators are used in [10]. It is not clear if the aggregation provides confident labels because the weak annotators are trained on the same data as the pretrained detector. A combination of pseudo-labels and fine-tune is presented in [11] where the method uses provided noisy labels. These pseudo label techniques are limited because the methods do not add any new information and they learn with the labeling information what they are already able to generate.

External information: To operate the state-of-the-art autonomous driving vehicles with imperfect machine learning models, most companies/organizations keep collecting data, generate training data with human annotators, retrain deep learning models, and update the deployed models using the Over-the-Air (OTA) technology. It can be a solution to improve the accuracy of a detection model, but this approach has several limitation: it depends on wireless communication (e.g., Mars?); the updating frequency is limited; the actual driving scene of an individual car may not be the same as the training data in the off-line, remote training process.

Augmented Pseudo Labels
A SLAP system takes an image as the input, runs the object detector, generates pseudo labels, retrains the detector using the pseudo labels, returns the detection results with the updated detector, and repeat the whole process.

We present Augmented Pseudo Label (APL) to train an object detection model that has been deployed on a system with only unlabeled raw image data. After collecting one or more unlabeled image(s), the proposed APL uses a pretrained (or current) detection model to localize all target objects on the input image data, automatically generates annotations, which is the same as prior Pseudo Label technique, synthesizes domain specific augmented images considering the motion context such as motion blur, lighting changes, and different weather conditions, generates a new labeled dataset, and then retrain the current, pretrained model with the APL data along with optional previously generated annotated data. The proposed APL can add new information on unlabeled data by leveraging the inference power of the prior Pseudo Label technique along with the identity preserving data augmentation techniques. Figure 1 shows the basic concept of the proposed approach. It should be a considerable step toward self-learning for autonomous systems. We describe an overview of the proposed system and the details of main components.

1)    Overview

A visual object detector finds all locations of target objects in the image space. For simplicity, we consider a simple representation of an object, which is a rectangular bounding box representing the boundary of an object (e.g., a car or a person). Given a number of pre-defined target object classes, a visual object detector classifies all possible rectangular regions in the input image to one of the object classes. One of classes can represent the background of the scene. The detector can be represented as a function from the all regions to the estimated classes.

Given an input image, an implemented detector produces a set of the pairs of detection bounding box and the corresponding class. We can measure the performance of the detector by comparing the output with the ground truth labels. To train the detector, we need a set of labeled data which are generally annotated by human operators. A training procedure can also be represented by a function from the initial detector to an updated detector.

The proposed Augmented Pseudo Labeling (APL) consists of (1) pseudo labeling with a pretrained model, (2) domain specific augmentation, and (3) an optional retraining with the generated APL data shown in Figure 2.

The key idea is that the proposed APL uses a pretrained detector to localize all target objects per video frame, automatically generates annotations, which is the same as prior Pseudo Label technique (PL) [1], synthesizes augmented images considering the action context such as motion blur, lighting changes, and different weather conditions, and then retrain the networks with these challenging images along with the identical (previously generated) annotations. The proposed APL can add new information on unlabeled data by leveraging the inference power of the prior Pseudo Label technique along with the identity preserving data augmentation techniques.

Authors
Jongmoo Choi (jongmoochoi@gmail.com)