[CVPR2020/PaperSummary]RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds
Please note that this post is for my future self to look back and review the materials on this paper without reading it all over again….
3D Point cloud segmentation is a challenging task due to segmentation of objects which represent global structural property. More difficult task is to have a deep learning model which is computationally efficient and having low memory footprint to perform the segmentation. There are different methods of performing segmentation like 3D point based method , 3D projection based algorithm , 2D-3D projection based methods . Lidar point cloud based segmentation can be categorized into different categories as mentioned below :
1.Scene Level -Semantic Segmentation
2.Object Level-Instance Semantic Segmentation
3.Part Level- Part Segmentation
In this paper author has proposed point based method for scene level segmentation which has low memory footprint and highly computationaly efficient.
Abstract
In this paper , the author has proposed a lightweight neural architecture which can process large scale point clouds using RandLA-Net architecture which is 200x times faster to the SOTA architecture, since most of the existing architecture are using expensive sampling techniques and computationally expensive post and pre processing methods . RandLA-Net is benchmarked on semantic KITTI and semantic3D[6] dataset.
1.Introduction
An efficient semantic segmentation of large-scale 3D point clouds is a fundamental and essential capability for realtime intelligent systems, such as autonomous driving and augmented reality. The key challenge is the point cloud data are denser or sparse based on the type of objects and are highly unstructured its becomes quite a problem to appply CNN based solutions directly.
By the emergent of the poineer work done by PointNet[1] which directly process the 3D point cloud for learning features using shared multi layer perceptron . This method is computationally effictient but fails to capture context information of each point , many additional works were introduced to improve the results but failed . The reasons for the limtations of the network are mentioned below :
- The commonly used point-sampling methods of these networks are either computationally expensive or memory inefficient. For example, the widely employed farthest-point sampling [2] takes over 200 seconds to sample 10% of 1 million points.
- Most existing local feature learners usually rely on computationally expensive kernelisation or graph construction, thereby being unable to process massive number of points.
- For a large-scale point cloud, which usually consists of hundreds of objects, the existing local feature learners are either incapable of capturing complex structures, or do so inefficiently, due to their limited size of receptive fields.
In this paper, the author has desinged a memory and computationally efficient neural architecture, which is able to directly process large-scale 3D point clouds in a single pass, without requiring any pre/post-processing steps
such as voxelization, block partitioning or graph construction using RandLA net algorithm , the key contributions are listed below:
• Analysing and compare existing sampling approaches,identifying random sampling as the most suitable component for efficient learning on large-scale point clouds.
• Proposing an effective local feature aggregation module to automatically preserve complex local structures by progressively increasing the receptive field for each point.
• Demonstrating a significant memory and computational gains over baselines, and surpass the state-of-the-art semantic segmentation methods on multiple large-scale benchmarks.
Recent works have started to tackle the task of directly processing large-scale point clouds. SPG [3] preprocesses the large point clouds as super graphs before applying neural networks to learn per super-point semantics. Both FCPN [5] and PCT [4] combine voxelization and point-level networks to process massive point clouds.
The above mentioned methods have good accuracy but fail on real time applications since there pre and post processing times are computationally expensisve.
2. Related Work
The author has made a comparision with recent learning approaches which include projectionbased, voxel-based and point-based schemes.
(1)Projection and Voxel Based Networks: Due you to the success of 2D CNNs, many works project/flatten 3D point clouds onto 2D images to address the task of object detection. However, many geometric details are lost during the projection.Point clouds can be voxelized into 3D grids and then powerful 3D CNNs are applied .
The above methods give good results on semantic segmentation and object detection, their primary limitation is the heavy computation cost, especially when processing large-scale point clouds.
(2) Point Based Networks : Inspired by PointNet/PointNet++ [1, 2], many recent works introduced sophisticated neural modules to learn per-point local features.These modules can be generally classified as
- neighbouring feature pooling
- graph message passing
- kernel-based convolution
- attention-based aggregation
(3) Learning for Large-scale Point Clouds : SPG [3] preprocesses the large point clouds as super graphs to learn per super-point semantics. The recent FCPN [5] and PCT [4] apply both voxel-base and point-based networks to process the massive point cloud.The graph partitioning and voxelisation are still computationally expensive.
3. RandLA-Net Architecture
The RandLA-Net is distinguished in three ways:
1) It only relies on random sampling (RS) within the network, thereby requiring much less memory and computation.
2) The proposed local feature aggregator(LFA/LA) can obtain successively larger receptive fields by explicitly considering the local spatial relationship and point features, thus being more effective and robust for learning complex local patterns.
3) The entire network only consists of shared MLPs without relying on any expensive operations such as graph construction and kernelisation, therefore being superbly efficient for large-scale point clouds.
3.1. Overview
In RandLA-Net network the author has proposed to use the simple and fast approach of random sampling to greatly decrease point density, by applying a carefully designed local feature aggregator to retain prominent features. This allows the entire network to achieve an excellent trade-off between efficiency and effectiveness.
3.2. The quest for efficient sampling
The sampling approaches can be classified into heuristic and learning based approaches.
(1) Heuristic Sampling
- Farthest Point Sampling (FPS): In this method each time the sample is selected, the point farthest from the k-1 points obtained by the previous sampling is selected. FPS can better ensure that the sampled points have better coverage, so it is widely used in the field of point cloud segmentation (eg, PointNet ++, PointCNN, PointConv, PointWeb). FPS is is computationally complexity of O(N²) and takes upto 200sec to process on single GPU which show that its not suitable for large scale point clouds.
- Inverse Density Importance Sampling (IDIS): In this method reordering of each point according to its density, keeping the points with lower density as much as possible. The computational complexity of IDIS is approximately of O(N) and takes upto 10sec to process large scale point clouds, IDIS is also more sensitive to outliers .
- Random Sampling (RS): In this method , downsampling uniformly selects K points from the input N points, each point has the same probability of being selected. The computational complexity of RS is O(1) which is agnostic to the total number of input points, i.e., it is constant-time and hence inherently scalable.It only takes 0.004s to process 10⁶ points.
(2) Learning-based Sampling
- Generator-based Sampling (GS): GS method approximates the original point cloud by learning to generate a subset. GS is a task-oriented, downsampling learnable method of data-driven, but the problem is that a subset of inference phase requires the generated match the original point cloud, this step depends on the FPS matching , and thus the introduction of More additional calculations. It takes up to 1200 seconds to downsample 10% of 10⁶ points
- Continuous Relaxation based Sampling (CRS): CRS uses reparameterization trick to relax non-differentiable downsampling operations to the continuous domain to make end-to-end training possible. Each sampling point obtained after CRS sampling is actually a weighted sum of the entire point cloud. CRS downsamples a large scene point cloud with millions of points to 10% of its original size requires up to 300GB of GPU memory.
- Policy Gradient based Sampling (PGS): PGS represents the downsampling operation as a Markov decision process, which aims to learn an effective downsampling strategy. The method sequentially learns a probability for each point to decide whether to retain it. However, when the input is a large scene point cloud, the entire network has a huge search space (exploration space). Through further experiments, it is found that when PGS is applied to a large point cloud, the network is very difficult to converge.
The author concludes the below points for the above sampling methods
- FPS, IDIS and GS are too computationally expensive to be applied for large-scale point clouds.
- CRS approaches have an excessive memory footprint and PGS is
hard to learn. - RS has the following two advantages (a) it is remarkably computational
efficient as it is agnostic to the total number of input points (b) it does not require extra memory for computation. - RS results in many useful point features being dropped. To overcome it, the author proposed a powerful local feature aggregation module.
3.3. Local Feature Aggregation
The local feature aggregation module is applied to each 3D point in parallel and it consists of three neural units as shown in Fig2:
- Local Spatial Encoding (LocSE):
In this module given a point cloud P with all the features are explictly used to encode the three-dimensional coordinate information , so all neighbouring points and coresponding point features are always aware of their relative spatial locations as shown in Fig3.
This allows the LocSE unit to explicitly observe the local geometric patterns, thus eventually benefiting the entire network to effectively learn complex local structures. The geometry of the space is better learned from the information. The learning is specifically, it is divided into the following steps:
- Finding Neighbouring Points : For , the ith point, its neighbouring points are firstly gathered by the simple Knearest neighbours (KNN) algorithm for efficiency.KNN is based on the point-wise Euclidean distances.
- Relative Point Position Encoding : or each of the nearest K points {p1 i · · · pk i · · · pK i} of the center point pi, explicitly encode the relative point position as given by Eq1. This tends to aid the network to learn local features and obtains good performance in practice.
where pi and pk i are the x-y-z positions of points
- Point Feature Augmentation : For each neighbouring point p^k i , the encoded relative point positions rk i are concatenated with its corresponding point features fi k, obtaining an augmented feature vector ^ fi k. The output of LocSE unit is a new set of neighbouring features F ^i = f^ fi1 · · ·^ fi k · · ·^ fi Kg,which explicitly encodes the local geometric structures for the center point pi.LocSE explicitly encodes the relative positions to augment the neighbouring point features.
2. Attentive pooling:
This neural unit is used to aggregate the neighborhood point feature sets output by the above unit. Most of current algorithms use heuristic max / mean / sum pooling to hard integrate the neighborhood point feature set, which may cause loss of useful information as show in Fig 4. In the current implementation following steps are used
Computing Attention Scores: For a given set of local features a shared function is designed g() to learn a unique attention score and aggregate useful information in the neighborhood point feature set. The shared MLP is followed by softmax function defined in Eq2.
where W is the learnable weights of a shared MLP
Weighted Summation: The learnt attention scores are summed using Eq3.
For a given input point cloud P , for the ith point pi, our LocSE and Attentive Pooling units learn to aggregate the geometric patterns and features of its K
nearest points, and finally generate an informative feature vector ~ fi.
3 Dilated Residual Block
Since RS is going to continously down sample the input point cloud, it is necessary to increase the receptive field of each point significantly such that geometric details are reserved .
The author considers the idea from Resnet architecture and connect multiple LocSE, Attentive Pooling, and skip connections to form a Dilated Residual Block as show in Fig 2.
The detail working of dilated block is shown in Fig5 and Fig 6 where it can be seen: the red dot active receptive field after the first LocSE / Attentive Pooling operation is adjacent to the neighbouring adjacent points, and then in the second polymerization after most receptive field can be extended to two points in the neighborhood.
This is a cheap way of dilating the receptive field and expanding the effective neighbourhood
On the whole , local feature aggregation module is designed to effectively preserve complex local structures via explicitly considering neighbouring geometries and significantly increasing receptive fields.
3.4 Implementation
RS and LFA modules are combined to form RandLA -Net based on the standard encoder-decoder structure. The structure of the network is shown in Fig7 .
From the above structure the input point cloud is continuously down-sampled in RandLA-Net to save computing resources and memory overhead.
in the upsampling stage of the decoder a more efficient nearest neighbor interpolation is chosen to further improve the efficiency of the algorithm.
Training Details
Optimizer : Adam
Learning Rate : 0.001 and decrease by 5% after each epoch
4.Experiments
4.1. Efficiency of Random Sampling
I n this section, we empirically evaluate the efficiency of existing sampling approaches including FPS, IDIS, RS, GS, CRS, and PGS in terms of computing time and GPU memory consumption. The author downsamples the point cloud, a total of five downsampling, each sampling only retains 25% of the original point cloud points.
The experiments are conducted in terms of groups of 4 which are given below :
- Group 1 : For a small-scale point cloud ~ 10 ^ 3, the difference between the above sampling method in computing time and memory consumption is not obvious, and overall it is acceptable as show in Fig8 (a).
- Group 2 /3/4(a): For a large-scale point cloud ~ 10 ^ 6, the calculation time required by FPS / IDIS / GS increases significantly, and CRS requires a large amount of GPU memory shown in Fig8(b) by dotted line.
- Group2/3/4 (b):In contrast, RS has significant advantages in terms of computing time and memory consumption, so it is very suitable for processing large-scale point clouds shown in Fig8(b).
4.2. Efficiency of RandLA-Net
In this section, the author systematically evaluate the overall efficiency of our RandLA-Net on real-world large-scale point they chose verification set as SemanticKITTI [7] data set (Sequence 8: 4071 frames in total) for comparative testing.
Evaluation is done for the following three indicators: total time, model parameters and the maximum number of points the network can handle on the same machine with an AMD 3700X @3.6GHz CPU and an NVIDIA
RTX2080Ti GPU.
Analysis: Table 1 quantitatively shows the results below are the summary points.
- SPG [3] has the least model parameters, but it takes the longest time. The main reason is that the computational cost of steps such as geometrical partitioning and super-graph construction is relatively high;
- PointNet ++ and PointCNN also take a long time, the main reason is that FPS is more time-consuming when dealing with large scene point clouds
- PointNet and KPConv cannot handle very large-scale point clouds at one time(10⁶) mainly because there is no downsampling operation (PointNet) or the model is more complicated.
- Thanks to simple random sampling and the efficient local feature aggregation module based on MLP, RandLA-Net takes the least time (~ 23 frames per second) and can process up to 10⁶point clouds at a time.
4.3. Semantic Segmentation on Benchmarks
In this section, we evaluate the semantic segmentation of our RandLA-Net on three large-scale public datasets:(1)Semantic3D [6](2) SemanticKITTI [7] (3) S3DIS [8].
(1) Evaluation on Semantic3D : This dataset consists of 15 point clouds for training and 15 for online testing. Each point cloud has up to 108 points, covering up to 160×240×30 meters in real-world 3D space. The raw 3D points belong to 8 classes and contain 3D coordinates, RGB information, and intensity. We only use the 3D coordinates and color information to train and test our RandLANet. Table 2 shows that RandlaNet perform better results compared to other architecture for overall mIOU.
(2) Evaluation on SemanticKITTI: This dataset consists of 43552 densely annotated LIDAR scans belonging to 21 sequences. Each scan is a large-scale point cloud with ∼ 105 points and spanning up to 160×160×20 meters in 3D space. Officially, the sequences 00∼07 and 09∼10 (19130 scans) are used for training, the sequence 08 (4071 scans) for validation, and the sequences 11∼21 (20351 scans) for online testing. The raw 3D points only have 3D coordinates without color information. The mIoU score over 19 categories is used as the standard metric.
From Table3 It can be seen that our RandLA-Net surpasses all point based approaches by a large margin and outperform all projection based methods , but not significantly, primarily because DarkNet achieves much better results on the small object category such as trafficsign.
(3) Evaluation on S3DIS: This dataset [8] consists of 271 rooms belonging to 6 large areas. Each point cloud is a medium-sized single room (∼ 20×15×5 meters) with dense 3D points. To evaluate the semantic segmentation of our RandLA-Net, we use the standard 6-fold crossvalidation in our experiments.
RandLA-Net achieves on-par or better performance than state-of-the-art methods. Most of the current archtiecture tend to use sophisticated but expensive operations or samplings to optimize the networks on small blocks (e.g., 1×1 meter) of point clouds, and the relatively small rooms act in
their favours to be divided into tiny blocks. By contrast, RandLA-Net takes the entire rooms as input and is able to efficiently infer per-point semantics in a single pass.
4.4. Ablation Study
The author performs the abalation study with different modules and by obtaining the metrics mIOU on the sequence 08 of SemanticKITTI [7]dataset.
(a) (1) Removing local spatial encoding (LocSE). This unit enables each 3D point to explicitly observe its local geometry.
(b)(2∼4) Replacing attentive pooling by max/mean/sum pooling. The attentive pooling unit learns to automatically combine all local point features.
(c.)(5) Simplifying the dilated residual block:The dilated residual block stacks multiple LocSE units and attentive poolings, substantially dilating the receptive field for each 3Dpoint.
Abalation Summary :
- The greatest impact is caused by the removal of the chained spatial embedding and attentive pooling blocks.
- The removal of the local spatial encoding unit shows the next greatest impact on performance, demonstrating that this module is necessary to effectively learn local and relative geometry context.
- Removing the attention module diminishes performance by not being able to effectively retain useful features.
5. Conclusion
- For the semantic segmentation task of 3D point cloud , it is possible to efficiently and effectively segment large-scale point clouds by using a lightweight network architecture .
- Random sampling in our framework to significantly reduce the memory footprint and computational cost.
- A local feature aggregation module is also introduced to effectively preserve useful features from a wide neighbourhood.
- It would be interesting to extend our framework for the end-to-end 3D instance segmentation on large-scale point clouds by drawing
on the recent work and also for the real-time dynamic point cloud processing
Blogger’s Conclusion
Pros
- Low memory foot print and highly computationally efficient for real time applications .
- Efficient way to process large scale point clouds using RS
- Local Aggregation module technique is highly important to maintain geometric structure of objects
- On Semantic Kitti dataset it is reported to have 23 fps for 360 deg view
Cons
- Accuracy drops considerably for Small Scale point cloud .
- Random Sampling module is less efficient for small scale point cloud .
- We can obtain high fps on small scale point cloud but with a trade with accuracy.
If any errors found please mail me at abhigoku10@gmail.com…*\(^o^)/*
Reference
[1] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep learning on point sets for 3D classification and segmentation. CVPR, 2017.
[2] Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. PointNet ++: Deep hierarchical feature learning on point sets in a metric space. NeurIPS, 2017
[3] Loic Landrieu and Martin Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. In CVPR, 2018.
[4] Siheng Chen, Sufeng Niu, Tian Lan, and Baoan Liu. PCT: Large-scale 3D point cloud representations via graph inception networks with applications to autonomous driving. In ICIP, 2019.
[5] Dario Rethage, Johanna Wald, Jurgen Sturm, Nassir Navab, and Federico Tombari. Fully-convolutional point networks for large-scale point clouds. In ECCV, 2018. 2, 3
[6]Timo Hackel, N. Savinov, L. Ladicky, Jan D. Wegner, K.Schindler, and M. Pollefeys. SEMANTIC3D.NET: A new large-scale point cloud classification benchmark. In ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2017.
[7]Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and Juergen Gall. SemanticKITTI: A dataset for semantic scene understanding of lidar sequences. In ICCV, 2019.
[8]Iro Armeni, Sasha Sax, Amir R Zamir, and Silvio Savarese. Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017