[IEEE2019/PaperSummary] LDLS: 3D Object Segmentation through Label
Diffusion from 2D Images

abhigoku10
12 min readApr 15, 2020

--

Please note that this post is for my future self to look back and review the materials on this paper without reading it all over again….

3D point cloud segmentation reprents the global geometric structure and fine grained details of the point which has a numerous usecases in many applications . There are different methods of performing segmentation on point cloud like algortihm directly on 3D points , 3D projection based algorithm , 2D-3D projection based methods . Lidar point cloud based segmentation can be categorized into different categories as mentioned below :

1.Scene Level -Semantic Segmentation

2.Object Level-Instance Semantic Segmentation

3.Part Level- Part Segmentation

In the current paper the author has proposed 2D-3D projection based method for object level i.e Instance /semantic segmentation .

Abstract

Annotation of 3D point cloud is a mundane and cumbersome task and training a algorithm to perform 3D point cloud segmentation requires huge dataset . To mitigate this the author has proposed a method where they perform 2D instance/semantic segmentation on images since large dataset is obtained as open source for 2D and then use projection based method to project the segmented masks on to the 3D point cloud and then build a graph for connecting the neighbouring points using semi supervised label diffusion process. The method is termed as LDLS and evaluation is performed on KITTI benchmark dataset.

Introduction

ROBOTS in a variety of applications require the ability to recognize objects of interest in their environment and distinguish them from the background, using 3D sensors.

Examples range from autonomous cars detecting nearby pedestrians,to an industrial robot identifying an object to be assembled.

In the current paper they propose a novel approch of 3D point cloud based segmentation leveraging the success of CNN on 2D segmentation and semi supervised learning problem on graph. The steps are given as follows

  1. Applying an off-the-shelf object segmentation algorithm (Mask-RCNN [1]) to the 2D image in order to detect object classes and instances at the pixel-by-pixel level.
  2. Constructing a graph by connecting 2D pixels to 3D lidar points according to their 2D projected locations,as well as connecting lidar points that neighbor one another in 3D space.
  3. Using label diffusion method [2] to propagate 2D segmentation labels through this graph, thereby labeling the 3D lidar points

Related Work

In this section we look into some of the methods which the authors has taken the references from

  1. Deep Learning on Point Clouds : PointNet [3]defines a network architecture that operates directly on unstructured point clouds and extracts features that are invariant to point re-ordering, capturing both local and global point cloud information.Other methods extend convolutional neural networks to point clouds. Since 3D points lack the grid structure of images. (a)one approach is to arrange the points into a 3D voxel grid and perform 3D convolution (b) panoramic projection (c) bird’s-eye view .
  2. Graphical Models and 2D-3D Fusion : Wang et al. [4] propose a semantic segmentation method for image-aligned 3D point clouds by retrieving referenced labeled images of similar appearances and then propagating
    their labels to the 3D points using a graphical model. Zhang et al. [5] train a neural network for 2D semantic segmentation, then project
    onto dense 3D data from a long-range laser scanner.

APPROACH

Fig. 1. The full segmentation pipeline, from the input point cloud and image
to the final lidar point cloud segmentation

In this section we shall explain in detail the LDLS method in different steps as shown in Fig1

(A)Problem Formulation:

  1. Object instance segmentation in 3D point clouds are formulated by considering all the input lidar points xi from i=1 to Npoints .
  2. Similarly we consider the all the input pixels in an image as pi from i=1 to Npixels.
  3. Semi-supervised learning assumes that a set of data points is available, of
    which a subset of points is labeled, so we use Graph based semi supervised learning to annotate the remaining points by definining connections between data points and then diffusing labels along these connections.
  4. To construct a framework to lidar point cloud segmentation the author constructs a graph by drawing connections from 2D pixels to 3D lidar points, as well as among the 3D points.The 2D pixels are labeled according to results from 2D object segmentation of the RGB image and the graph is then used to diffuse labels onto the 3D points, which are all initially unlabeled.

(B)Graph Construction:

The graph G used in the LDLS method consists of :

  1. Two types of nodes 2D image pixels and 3D lidar points
  2. Two types of connections between nodes from a 2D pixel to a 3D
    point, and between two 3D points.

a) Initial Graph Node Labeling:

2D pixels are labelled using image segmentation algorithm, which assigns every image pixel an instance label y (where y = 0 corresponds to the background), and associates each instance with a class label c.

The output will therefore be several distinct instance masks obtained from MaskRCNN[1] (any other models can be used), each containing many pixels, as seen in Fig1(iii).

This instance-class association is deterministic for each image, simplifying the task of assigning instance labels to the lidar points within the camera’s field of view

b) 2D-to-3D Connections:

The labels in the lidar point cloud are labelled by labelling each point according to the 2D instance mask into which it is projected as shown in Fig1 (iv). This method is prone to significant labelling errors around the boundaries due to calibration between sensors and 2D segmentation masks which is unware of depth, so the pipeline should therefore combine 2D and 3D information and leverage both information sources for producing a final
3D segmentation.

The author combines 2D and 3D information into a graph for semi-supervised label diffusion by constructing a subgraph G2D->3D connecting 2D pixels to 3D lidar points, represented by a (Npoints × Npixels) matrix by using in Eq1.

Eq1: Graph martix of 2D-3D

where p(xi)- pixels which are projected on lidar points

λ -controls the amount of information that can flow from a pixel to a connected lidar point, the author uses the constant value of 0.001.

c) 3D-to-3D Connections:

The author constructs a connection between 3D points by using a exponential weighted KNN based neighbour graph from the points to reflect the underlying 3D geometry. This subgraph is denoted as G3D-3D, represented by a (Npoints × Npoints) matrix obtained by using Eq2.

Eq2: Graph matrix for 3D-3D

where non zero points captures the similarity between points xi and xj for small value of K, this subgraph is sparse enabling fast computation during the diffusion step. The author sets K = 10 and σ = 1.

d) Full Label Diffusion Graph: The full graph for label diffusion, combining the 2D-to-3D connections as well as the 3D-to-3D connections is defined using Eq3.

Eq3: Label diffusion graph matrix

where I - (Npixels × Npixels) identity matrix. N =Npoints+Npixels; then G is (N ×N).The author then normalizes each row of the matrix to sum to 1

(C) Label Diffusion

The authors intuition behind the diffusion process is for the 2D pixels to act as source nodes that continuously provide the label information out through the 3D points, which is then diffused throughout the point cloud according to the connections between points.

For label diffusion we consider total M + 1 object instances including the background instance that are detected by the 2D segmentation method. The N dimensional label vector for each instance m is defined by z(m).

Eq4:Instance Label vector

Eq4 is used obtain the corresponding enteries of 3D points which are intialised to zero and 2D pixels which are defined based on the segmentation mask represented by mew.

Eq5: Iterative diffusion Label

Eq5 reprents the iterative computation to diffuse labels throught the graph nodes for all the M+1 instances .

If point xi is unlabeled, but is connected to atleast one pixel pj labeled with instance m and G2D-3D > 0,then by iterative computation we obtain zi (m) > 0, indicating an increased likelihood that xi will be labeled with instance m as a result of label diffusion from pj by using Eq4 .

Eq5 can be iteratively applied to perform label diffusion until convergence of all z(m), or until a maximum number of iterations i.e 200 by experimentation. In the end convert the likelihood values to lidar point labels according to Eq6.

Eq6: Likelihood after convergence

Label diffusion results have disjointed sections of object segmentations; most often it occurs if projection or mask boundary errors result in a large number of contiguous background of lidar points being projected to inside a 2D
segmentation mask.

The author introduces outlier removal step based on connected components for clearing errors caused by label diffusion . It is defined as G(m) for the subgraph of G3D-3D consisting of lidar points labeled as object m, i.e. xi is
a node of G(m) if and only if yi = m. Then, let C(Gm) be the largest connected component in G(m), which is treated as an undirected graph. The lidar points are update using Eq7 where the final output consits of all the points labelled as either background point or as an object instance.

Eq7: Outlier removal equation

The overview of the LDLS algorithm is shown in Fig2.

Fig2: LDLS algorithm

RESULTS AND EVALUATION

A. Quantitative Evaluation on the KITTI Data Set:

The LDLS algorithm is evaluated on KITTI dataset for cars , pedestrain classes which are available more . For 2D Segmentation author has used MaskRCNN[1] pre-trained model . Table1 shows the results in comparision with other architectures like SqueezeSeg,SqueezeSegV2, PointSeg after training on 8057-point cloud training set.

Table1 : COMPARISON OF SEMANTIC SEGMENTATION ACCURACY

Mask-RCNN[1] model used in LDLS does not distinguish between the two classes i.e pedestrain and person sitting so both are considered as Pedestrain for evaluation.

On the manually labeled data, the difference is far more pronounced. LDLS method achieves a 27:0% increase in IoU for car segmentation and a 116:2% increase for pedestrian segmentation as compared to the next-best method.

The author gives one reason for the difference in performance between LDLS and SqueezeSeg/PointSeg is that the latter methods were trained on data that includes annotation errors.However, this gap grows wider for evaluation on error-free annotations.

Instance Segmentation Evaluation:

The author presents an instance segmentation evaluation by applying LDLS to manually annotated KITTI ground truth data and additionally study the effect of object range on accuracy, the results are shown in Table2 .

Table2 :INSTANCE SEGMENTATION PERFORMANCE ON MANUAL ANNOTATIONS.
Fig3:Effect of range on semantic and instance segmentation precision and
recall. Instance segmentation metrics use IoU = :70.

According to author ,Label diffusion assumes that neighboring points are
more likely to share a class label, but this assumption weakens at further distances from the sensor, where lidar data becomes sparser.

The author hypothesize that LDLS should be more reliable for objects that are closer to the sensor and visible with a higher density of points. This hypothesis is tested by performing evaluations at different ranges the results are plotted on Fig3.

As range from the sensor increases, instance segmentation performance decreases significantly.

Fig4:Scatter plot showing object range versus segmentation IoU. Each point
is a pedestrian or car instance. Zero IoU points indicate false negatives

Fig 4 plots each object instance within our test set on a scatter plot, as a function of distance to the object’s centroid against instance segmentation IoU. As objects become more distant, a wider range of IoU results appear.

The authors experiments indicate that LDLS generally segments object instances more reliably at closer distances, with performance falling off as range increases.

This suggests that an all-purpose robotic perception system may be best served by using a vision-based bounding box object detector at far ranges, with LDLS applied at close ranges to allow a robot to precisely sense and interact with its immediate surroundings.

Ablation Study:

Table3: Ablation Study

To demonstrate the benefits of the different components of the LDLS pipeline, the author performs an ablation study by removing different components and comparing.

  1. Direct projection labeling Lidar points are naively labeled, without graph diffusion, based on whether they project to within a 2D segmentation mask in the image.
  2. Diffusion without outlier removal The full pipeline is executed, except for the final outlier removal step.

Table 3 shows that both the diffusion and outlier removal steps improve
overall performance, with the former contributing a major performance gain. This finding confirms the value of label diffusion in fusing 2D and 3D information.

B. Qualitative Evaluation

To qualitatively evaluate LDLS on large-scale annotated ground truth for limited two object classes, and data from a high-resolution lidar sensor. The author considers different environments, classes, sensors, and data collection platforms, two additional sets of results:

a) residential and urban sequences from KITTI [6], and

b) a sequence captured on the Cornell University campus using a ClearpathTM Jackal mobile ground robot with a Velodyne VLP-16 lidar sensor and RGB camera.

Fig. 5. Qualitative results from running LDLS on the KITTI Drive 91 sequences, as well as on data collected on the Cornell campus using a mobile robot

From the above Fig5. In KITTI data, new object classes are generally segmented with qualitatively comparable accuracy to cars and pedestrians, although narrower objects such as bicycles present a challenge.

In comparison, the campus data collection on the Jackal robot exhibits more segmentation errors, and performance breaks down more significantly for
objects at farther distances.

The authors hypothesize the following reasons for these differences:

  1. The VLP-16 sensor outputs only 16 laser scan lines, as opposed to the 64-scan lidar used in KITTI, making the lidar point clouds sparser and more difficult to segment, especially at further ranges.
  2. Errors from sensor calibration and time synchronization
    were higher on the Jackal, compared to the KITTI data set.

CONCLUSION by WRITER

Pros:

  1. LDLS is simple projection based method for annotation of large point cloud data
  2. Python implementation averages approximately .38 seconds per frame on an Nvidia GTX 1080 Ti, excluding the computation of Mask-RCNN[1] results.
  3. MaskRCNN model can be replaced with higher accuracy models of 2D segmentation.
  4. LDLS method is scalable to different class objects.

Cons :

  1. High latency due to two modules (a) 2D segmentation using pre-trained model (b)semi supervised learning which makes it not suitable for real time scenarios like self driving cars.
  2. Accuracy is highly dependent on the density of lidar points in point cloud captured using sensor.
  3. LDLS point cloud segmentation accuracy decreases as distance range increases .
  4. Semi supervised graph method is linearly dependent on the number of points of the class object.

If any errors found please mail me at abhigoku10@gmail.com…*\(^o^)/*

github repo : https://github.com/brian-h-wang/LDLS

References

[1] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask r-cnn,” in ´ ICCV, 2017.

[2]X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled data
with label propagation,” Tech. Rep., 2002.

[3]C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on
point sets for 3d classification and segmentation,” in CVPR, 2017.

[4]Y. Wang, R. Ji, and S.-F. Chang, “Label propagation from imagenet to
3d point clouds,” in CVPR, 2013.

[5]R. Zhang, G. Li, M. Li, and L. Wang, “Fusion of images and point
clouds for the semantic segmentation of large-scale 3D scenes based on
deep learning,” ISPRS Journal of Photogrammetry and Remote Sensing,
2018.

[6]A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:
The kitti dataset,” International Journal of Robotics Research (IJRR),
2013.

[7]R. Zhang, G. Li, M. Li, and L. Wang, “Fusion of images and point
clouds for the semantic segmentation of large-scale 3D scenes based on
deep learning,” ISPRS Journal of Photogrammetry and Remote Sensing,
2018.

[8]K. Lertniphonphan, S. Komorita, K. Tasaka, and H. Yanagihara, “2d
to 3d label propagation for object detection in point cloud,” in ICME
Workshops, 2018.

[9]W. Wang, R. Yu, Q. Huang, and U. Neumann, “Sgpn: Similarity group
proposal network for 3d point cloud instance segmentation,” in CVPR,
2018

[10]X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled data
with label propagation,” Tech. Rep., 2002

--

--