[2104.11747] Learnable Online Graph Representations for 3D Multi-Object Tracking

Significance

Adding track nodes to the detection graph

Keypoints

• Propose a GNN based method for online 3D multi object tracking
• Demonstrate performance of the proposed method by experiments

Review

Background

Online multi object tracking (MOT) in 3D is a fundamental problem in autonomous systems. Although learning based methods for 2D MOT have been successful with the development of graph neural networks (GNNs), extension to 3D MOT in an online-setting has not been studied as much. This paper proposes a GNN based method for online 3D MOT task.

Keypoints

Propose a GNN based method for online 3D multi object tracking
Graph definition

One of the most important point of the proposed method is that the definition of the graph includes two types of nodes/edges, the tracks and the detections. Conventional methods that cast the object tracking as a graph representation learning problem usually define detected objects as vertices, and the tracks between frames as edges. Track nodes are separately introduced to the object detection graph and jointly learns to track the detection nodes during message passing step of the GNN iterations. Scheme of track nodes and detection nodes

Now, the graph is defined as the tuple $G=(V_{D},V_{T},E_{DD},E_{TD})$, which corresponds to detection nodes, track nodes, detection edges, and track edges, respectively. Graph defnition at timepoints $t-2$, $t-1$, $t$ The above figure describes the graph definition. While the detection nodes are illustrated as a complete graph with detection edges at between all possible nodes, it is not necessary to be so.

Problem formulation

The three tasks that should be solved in the 3D MOT problem are:

1. Assignment of detections to existing track.
2. Linking of detections across timesteps.
3. Classification of false positive detections.

The authors reformulate this task into one joint classification problem on the tracking graph $G=(V_{D},V_{T},E_{DD},E_{TD})$ by classifying whether each elements of the tuple (except the track node $V_{T}$) are active or not. If the active graph elements are correctly found, then the objects (detection nodes) and their track across timepoints (detection edges and track edges) can provide accurate tracking results.

Neural message passing for object tracking

Neural message passing with GNN is done upon the tracking graph $G$. The neural message passing consists of two-step update of node features, (i) generating messages (representation of each edge features) and (ii) aggregating generated messages (representation of each node features). To obtain classification of each nodes and edges, multi-layer perceptrons are applied to the generated messages from $E_{DD}$, $E_{TD}$, and to the aggregated messages at $V_{D}$.

Now that the nodes/edges are classified as whether active or inactive, the track of the object can be updated by the classification result. Examples of update scenarios Example (a) of the above figure shows an ideal case where the inference of active nodes/edges match to a single definite solution. Example (b) represents a case where the detection node at first timestep is not tracked by the detection edge, but is complementarily tracked by the tracking edge at the first timestep. Example (c) is a case where the detection edge is multiply connected between the second and the third timestep, but the global solution can be found from the active tracking edge. A more formal definition of the neural message passing step and the track update is referred to the original paper

Model training

The authors propose to employ data augmentation techniques, which consists of introducing stochasticity during training. After training the model with strong data augmentation on the offline dataset, the authors propose to use the trained model for generating track samples with same detections from the dataset. This two-stage training provides performance improvment in the experiments.

Demonstrate performance of the proposed method by experiments

The experiments are performed over nuScenes dataset with LIDAR detections only. For the multiple detection, CenterPoint and MEGVII are used as the baseline detectors for the proposed method. Comparison with AB3DMOT, StanfordIPRL, GNN3DMOT, and CenterPoint show that the proposed method slightly outperforms previous methods in AMOTA score. Performance of the proposed method for the nuScenes test set

Ablation study demonstrated importance of each approaches of the proposed method. Ablation study result. ‘Online’ refers to two-stage training