# [CVPR2020/PaperSummary]Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud

*Please note that this post is for my future self to look back and review the materials on this paper without reading it all over again….*

Graph neural network is currently the hot cake topic and due to its robustness, it makes sense to apply on point cloud data. This paper using GNN to perform 3D Point cloud detection on the standard available dataset.

In this paper the author has used Graph Neural networks for 3D object detection on point cloud along with auto-registration mechanism to reduce a translation variance with box merging and scoring operation to combine detections from multiple vertices of points. The proposed architecture is tested on KITTI dataset and code made available publicly.

**Introduction**

A point cloud composes a set of points in space is a widely used format for 3D sensors such as LIDAR , detection of objects accurately from a point cloud is crucial in applications such as autonomous driving.

Applying Convolution Neural Network to detect objects has yield efficient results in the past which requires regular grids as input to the network, a regular grid generates uneven points in the grid cell, and by applying CNN on them leads to a loss in information and more computation.

Recent advances by algorithms like PointNet and Deepsets allow an unordered point as input however they have to sample and group points iteratively to create a point set representation which is computationally costly. The author has proposed a Graph Neural Network Point-GNN which takes the point graph as its input and outputs the category with bounding boxes. In short, the contributions of the paper are

1.A new object detection approach using GNN on point cloud i.e Point-GNN which is a single-stage detector

2.Point-GNN with auto-registration mechanism that detects multiple objects in a single shot

**2. Related Work**

The prior works can be grouped into three categories as shown in Figure1

*Point cloud in grids:* In this method, the point cloud is converted into regular grids by projection on point clouds to a 2D Bird's eye view image and then applying 2D CNN on the image or by representing 3D point clouds as voxels and then applying 3D CNN but by both methods, there is either loss of information or increase in computation cost.

*Point cloud in sets:* In this method, an neural network extracts features from an unordered set of points directly and each point is processed by a multi-layer-perceptron (MLP) to obtain a point feature vector. These methods though reduce the local irregularity of the point cloud but still, there is a mismatch between regular grids and the overall point cloud structure.

*Point cloud in graphs:* GNN is used to construct graphs for point clouds, they iteratively update their vertex features by aggregating features along the edges, the aggregation scheme is sometimes similar to sets but GNN has more complex features into consideration along its edges and vertexes. A few works have investigated the design of GNN for object detection where prediction of object shape is required.

**3. Point-GNN for 3D Object Detection in a Point Cloud**

The proposed architecture by the author contains three components as shown in Figure2.

- Graph construction
- GNN of T iterations
- Bounding box merging and scoring

*3.1. Graph Construction*

The author has formally defined the point cloud of N points as a set of P = {p1,p2…pN} where pi=(xi,si) is a point with both 3D coordinates xi ∈ R3 and the state value si ∈ Rk a k length vector that represents the point property. The state value si can be the reflected laser intensity or the features which encode the surrounding objects. Given a point cloud P, we construct a graph G = (P, E) by using P as the vertices and connecting a point to its neighbors within a fixed radius r, i.e.

The construction of such a graph is the well-known fixed radius near-neighbors search problem. By using a cell list to find the point pairs in the neighborhood we can solve the problem with a runtime complexity of O(cN) where c is the max number of neighbors with the radius. Since in practice the point clouds comprise of ten to thousands and constructing a graph is computationally expensive so the author has used voxels to downsample the point cloud density and then construct a graph for it.

The point clouds are encoded in the initial state value si of the vertex to preserve the original information and we search the raw points within a radius of r0 of each vertex and use the neural network on the set to extract their features. The author embeds the lidar reflection intensity and the relative coordinates using an MLP and then aggregates them by the Max function. The author used the resulting features as its initial state value of the vertex. The constructed graph is shown in the Figure 2 a

*3.2. Graph Neural Network with Auto-Registration*

A typical graph neural network refines the vertex features by aggregating features along the edges. In the (t+1)th iteration, it updates each vertex feature in the form as given by Eq2:

where et and vt are the edge and vertex features from the tth iteration. A function f t(.) computes the edge feature between two vertices. ρ(.) is a set function which aggregates the edge features for each vertex. gt(.) takes the aggregated edge features to update the vertex features. The graph neural network then outputs the vertex features or repeats the process in the next iteration. The author has included information about the object where the vertex belongs and refined the Eq2 to obtain Eq3.

The relative coordinates of the neighbors are used as an input to ft(.) for edge feature extraction, they induce translation invariance against the global shift of the point cloud but it is still sensitive to translation in the neighborhood area. To induce the translation variance the author proposes aligning the neighbors coordinates by the structural features instead of the center vertex coordinates which already have information about the structure from the previous iteration so that we can predict an alignment offset and propose an auto registration method given by Eq4.

∆xt i is the coordination offset for the vertices to register their coordinates. ht(.) calculates the offset using the center vertex state value from the previous iteration. By setting ht(.) to output zero, the GNN can disable the offset if necessary.

The author has modeled f t(.), gt(.), and ht(.) using multi-layer perceptrons (MLP) and add a residual connection in gt(.) as shown in Figure2b, for ρ(.) is Max for its robustness[3]. A single iteration in the proposed graph network is then given by Eq5.

where [, ] represents the concatenation operation. For every iteration, t uses a different set of MLPt which is not shared but after T iterations of the graph neural network the vertex state value to predict both categories and bounding boxes of the object where the vertex belongs.

A classification branch MLPcls computes a multi-class probability and localization branch MLPloc computes a bounding box for each class.

*3.3. Loss*

The classification branch computes a multiclass probability distribution {pc1,pc2…pcM} for each vertex . M is the total number of object classes including the background class, if the vertex is within the bounding box of an object the author assigned object class to the vertex and if its outside then it is considered as background class, and average cross-entropy is used for classification loss calculation given in Eq6.

where pi c and yci are the predicted probability and the one-hot class label for the i-th vertex respectively.

For the object bounding box, the author predicts the 7 degree-of-freedom format b = (x, y, z, l, h, w, θ), where (x, y, z) represent the center position of the bounding box,(l, h, w) represents the box length, height, and width respectively, and θ is the yaw angle. We encode the bounding box

with the vertex coordinates (xv, yv, zv) is given by Eq7

where lm, hm, wm, θ0, θm are constant scale factors

The localization branch predicts the encoded bounding box δb = (δx, δy, δz, δl, δh, δw, δθ) for each class. If a vertex is within a bounding box, the author computes the Huber loss between the ground truth and our prediction and outside any bounding boxes or it belongs to a class then its localization loss is zero. The localization loss is given by Eq8

To prevent overfitting the L1 regularization is added to each MLP

where α, β and γ are constant weights to balance each loss

*3.4. Box Merging and Scoring*

Since multiple vertices can be on the same object, the neural network can output multiple bounding boxes of the same object. Non Maximum Suppression (NMS) is used to obtain the bounding box of an object which has a maximum confidence score, to improve the localization accuracy the author has proposed to calculate the merged bounding box cluster considering the median position and size of the overlapped bounding box. The confidence score is obtained by the sum of classification scores weighted by IOU and an occlusion factor.

Given a box bi, let li, wi,hi be its length, width, and height, and let vil, viw, vih be the unit vectors that indicate their directions respectively. xj are the coordinates of point pj. The occlusion factor oi is given by Eq10:

The modified version of the standard algorithm of NMS is given below

**4. Experiments**

*4.1. Dataset*

The author has evaluated the architecture on KITTI dataset. The KITTI dataset contains 7481 training samples and 7518 testing samples, the dataset only annotates objects that are visible within the image, we process the point cloud only within the field of view of the image. The KITTI benchmark evaluates the average precision (AP) of three types of objects: Car, Pedestrian, and Cyclist.

*4.2. Implementation Details*

Number of iterations T =3

Max number of input edges to vertex(training) = 256

Auto registration = all GNN layers using 2 layer MLPh units(64,3)

Classification score size = MLPcls (64,#number of classes)

Localization size = MLPloc(64,64,7)

Class Car : (lm, hm, wm) = (3.88m, 1.5m, 1.63m)

Side view of car = θ ∈ [-π/4, π/4]

Front-view car =θ ∈ [π/4, 3π/4] as two different classes

Therefore θ0 = 0 and θ0 = π/2 respectively. The scale θm is set as π/2. Together with the Background class and DoNotCare class

Graph Construction :

graph radius with r = 4m and r0 = 1m.

Pˆ as a downsampled point cloud by a voxel size of 0.8 meters in training and 0.4 meters in inference.

M LPf and M LPg, are both of sizes (300, 300)

For the initial vertex state, M LP is of (32, 64, 128, 300) for embedding raw points and another M LP of (300, 300) after the M ax aggregation.

NMS threshold = 0.01 in NMS.

Class Pedestrian and Cyclist: (lm, hm, wm) = (0.88m, 1.77m, 0.65m) & (1.76m, 1.75m, 0.6m)

Side view of car = θ ∈ [-π/4, π/4]

Front-view car =θ ∈ [π/4, 3π/4] as two different classes

Therefore θ0 = 0 and θ0 = π/2 respectively. The scale θm is set as π/2. Together with the Background class and DoNotCare class

Graph Construction :

- graph radius with r = 1.6m.
- Pˆ as a downsampled point cloud by a voxel size of 0.4 meters in training and 0.2 meters in inference.
- M LPf and M LPg, are both of sizes (256, 256).
- For the initial vertex state, M LP is of (32, 64, 128,256,512) for embedding raw points and another M LP of (256, 256) after the M ax aggregation.
- NMS threshold = 0.2 in NMS

Training details

- Batchsize = 4
- The loss weights are α = 0.1, β = 10, γ = 5e — 7.
- stochastic gradient descent (SGD) with a stair-case learning-rate decay.
- For Car, we use an initial learning rate of 0.125 and a decay rate of 0.1 every 400K steps.
- For Pedestrian and Cyclist, we use an learning rate of 0.32 and a decay rate of 0.25 every 400K steps.

*4.3. Data Augmentation*

- Global rotation, global flipping, box translation and vertex jitter.
- During training, random rotation of the point cloud by yaw ∆θ ∼ N (0, π/8) and then flip the x-axis by a probability of 0.5. After that, each box and points within 110% size of the box randomly shift by (∆x ∼ N (0, 3), ∆y = 0, ∆z ∼ N (0, 3)).
- During the translation, we check and avoid collisions among boxes, or between background points and boxes.
- During graph construction, we use a random voxel downsample to induce vertex jitter

*4.3.1 Results*

The model is evaluated on the KITTI dataset .The KITTI dataset

evaluates the Average Precision (AP) on three difficulty levels: Easy, Moderate, and Hard and the results are as mentioned below where Table1 gives the average precision on 3D object detection and Table2 gives the average precision on BEV

4.4. Ablation Study

The ablation study is done for the car class by keeping all the training params constant for all the experiments and the results are shown in Table3

*Box merging and scoring: *For the test without box merging, the author has modified line 11 in Algorithm Fig1. Instead of taking the median bounding box, we directly take the bounding box with the highest classification score as in standard NMS.

As shown in Row 3 and Row 4 of Table 3, both box merging and box scoring outperform the baseline when combined as shown in Row 6 of the table, they further outperform the individual accuracy in every category. Similarly, when not using auto-registration, box merging and box scoring (Row 5) also achieve higher accuracy than standard NMS (Row1).

*Auto-registration mechanism:* As shown in Row 2, by using auto-registration alone, the author surpasses the baseline without auto-registration (Row 1)

on every category of 3D detection and the moderate and hard categories of BEV detection

Combining the auto-registration mechanism with box merging and scoring (Row 6), they achieve higher accuracy than using the auto-registration alone (Row 2).

The combination of all three modules (Row 6) does not outperform box merging and score (Row 5). The hypothesis is that the regularization likely needs to be tuned after adding the auto-registration branch.

The author extracts ∆x from different GNN iterations and adds them to the vertex position of Equation 4. Figure 3shows the vertices that output detection results and their positions with added offsets. Figure 3shows the vertices that output detection results and their positions with added offsets.

When the GNN gets deeper, the relative coordinates of the neighbor vertices depend less on the center vertex position but more on the property of the point cloud. The offset ∆x cancels the translation of the center vertex, and thus reduces the sensitivity to the vertex translation.

*Point-GNN iterations:* The author studies the impact of the number of iterations on the detection accuracy. They train Point-GNNs with T = 1, T = 2, and compare them with T = 3. Additionally, the detector is trained using the initial vertex state directly without any Point-GNN iteration.

The initial vertex state alone achieves the lowest accuracy since it only has a small receptive field around the vertex. the local information cannot flow along the graph edges, and therefore its receptive field cannot expand. Even with a single Point-GNN iteration T = 1, the accuracy improves significantly. T = 2 has higher accuracy than T = 3, which is likely due to the training difficulty when the neural network goes deeper.

*Running-time analysis:* The author implementation is written in Python

and uses Tensorflow for GPU computation. The inference is measured in time on a desktop with Xeon E5–1630 CPU and GTX 1070 GPU. The average processing time for one sample in the validation split is 643ms. Reading the dataset and running the calibration takes 11.0% time (70ms). Creating the graph representation consumes 18.9% time (121ms). The inference of the GNN takes 56.4% time (363ms). Box merging and scoring take 13.1% time (84ms).

*Robustness on LiDAR sparsity:*The KITTI dataset collects point cloud data using a 64-scanning-line LiDAR.To mimic a LiDAR system with fewer scanning lines, we downsample the scanning lines in the KITTI validation dataset. Because KITTI gives the point cloud without the scanning line information,

we use k-means to cluster the elevation angles of points into 64 clusters, where each cluster represents a LiDAR scanning line. We then downsample the point cloud to 32, 16, 8 scanning lines by skipping scanning lines in between.

As shown in Table 5. The accuracy for the moderate and hard levels drops fast with downsampled data, while the detection for the easy level data maintains a reasonable accuracy until it is downsampled to 8 scanning lines.

**5. Conclusion**

The author has presented a graph neural network, named PointGNN, to detect 3D objects from a graph representation of the point cloud with a proposed auto-registration mechanism that reduces transition variance, and the box merging and scoring operation which improves the detection accuracy.

**Blogger’s Conclusion**

*Pros*

- A novel work of 3D object detection using GNN
- Auto registration and refinement module implemented

*Cons*

- Accuracy drops considerably for Large Range and sparse point cloud .
- High latency time ie low fps.

*Reference*

If any errors found please mail me at abhigoku10@gmail.com…*\(^o^)/*