[CVPR2020/PaperSummary]RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds

  1. The commonly used point-sampling methods of these networks are either computationally expensive or memory inefficient. For example, the widely employed farthest-point sampling [2] takes over 200 seconds to sample 10% of 1 million points.
  2. Most existing local feature learners usually rely on computationally expensive kernelisation or graph construction, thereby being unable to process massive number of points.
  3. For a large-scale point cloud, which usually consists of hundreds of objects, the existing local feature learners are either incapable of capturing complex structures, or do so inefficiently, due to their limited size of receptive fields.
  1. neighbouring feature pooling
  2. graph message passing
  3. kernel-based convolution
  4. attention-based aggregation
Fig1 : In each layer of RandLA-Net, the large-scale point cloud is significantly downsampled, yet is capable of retaining features necessary for accurate segmentation
  • Farthest Point Sampling (FPS): In this method each time the sample is selected, the point farthest from the k-1 points obtained by the previous sampling is selected. FPS can better ensure that the sampled points have better coverage, so it is widely used in the field of point cloud segmentation (eg, PointNet ++, PointCNN, PointConv, PointWeb). FPS is is computationally complexity of O(N²) and takes upto 200sec to process on single GPU which show that its not suitable for large scale point clouds.
  • Inverse Density Importance Sampling (IDIS): In this method reordering of each point according to its density, keeping the points with lower density as much as possible. The computational complexity of IDIS is approximately of O(N) and takes upto 10sec to process large scale point clouds, IDIS is also more sensitive to outliers .
  • Random Sampling (RS): In this method , downsampling uniformly selects K points from the input N points, each point has the same probability of being selected. The computational complexity of RS is O(1) which is agnostic to the total number of input points, i.e., it is constant-time and hence inherently scalable.It only takes 0.004s to process 10⁶ points.
  • Generator-based Sampling (GS): GS method approximates the original point cloud by learning to generate a subset. GS is a task-oriented, downsampling learnable method of data-driven, but the problem is that a subset of inference phase requires the generated match the original point cloud, this step depends on the FPS matching , and thus the introduction of More additional calculations. It takes up to 1200 seconds to downsample 10% of 10⁶ points
  • Continuous Relaxation based Sampling (CRS): CRS uses reparameterization trick to relax non-differentiable downsampling operations to the continuous domain to make end-to-end training possible. Each sampling point obtained after CRS sampling is actually a weighted sum of the entire point cloud. CRS downsamples a large scene point cloud with millions of points to 10% of its original size requires up to 300GB of GPU memory.
  • Policy Gradient based Sampling (PGS): PGS represents the downsampling operation as a Markov decision process, which aims to learn an effective downsampling strategy. The method sequentially learns a probability for each point to decide whether to retain it. However, when the input is a large scene point cloud, the entire network has a huge search space (exploration space). Through further experiments, it is found that when PGS is applied to a large point cloud, the network is very difficult to converge.
  1. FPS, IDIS and GS are too computationally expensive to be applied for large-scale point clouds.
  2. CRS approaches have an excessive memory footprint and PGS is
    hard to learn.
  3. RS has the following two advantages (a) it is remarkably computational
    efficient as it is agnostic to the total number of input points (b) it does not require extra memory for computation.
  4. RS results in many useful point features being dropped. To overcome it, the author proposed a powerful local feature aggregation module.
Fig2 : The proposed local feature aggregation module
  1. Local Spatial Encoding (LocSE):
Fig3: LocSE module
  • Finding Neighbouring Points : For , the ith point, its neighbouring points are firstly gathered by the simple Knearest neighbours (KNN) algorithm for efficiency.KNN is based on the point-wise Euclidean distances.
  • Relative Point Position Encoding : or each of the nearest K points {p1 i · · · pk i · · · pK i} of the center point pi, explicitly encode the relative point position as given by Eq1. This tends to aid the network to learn local features and obtains good performance in practice.
  • Point Feature Augmentation : For each neighbouring point p^k i , the encoded relative point positions rk i are concatenated with its corresponding point features fi k, obtaining an augmented feature vector ^ fi k. The output of LocSE unit is a new set of neighbouring features F ^i = f^ fi1 · · ·^ fi k · · ·^ fi Kg,which explicitly encodes the local geometric structures for the center point pi.LocSE explicitly encodes the relative positions to augment the neighbouring point features.
Fig4: Attentive Pooling module
Eq2: Softmax function of shared MLP
Eq3: Score summation function
Fig5: Dilated Residual Block
Fig6:Illustration of the dilated residual block which significantly increases the receptive field (dotted circle) of each point, colored points represent the aggregated features. L: Local spatial encoding, A: Attentive pooling.
Fig7:The network structure of RandLA-Net. (N, D) represent the number of points and feature dimensions, respectively. FC: fully connected layer, LFA: local feature aggregation, RS: random sampling, MLP: shared multilayer perceptron, US: upsampling, DP: Dropout
Fig8: Time and memory consumption of different sampling approaches. The dashed lines represent estimated values due to the limited GPU memory
  • Group 1 : For a small-scale point cloud ~ 10 ^ 3, the difference between the above sampling method in computing time and memory consumption is not obvious, and overall it is acceptable as show in Fig8 (a).
  • Group 2 /3/4(a): For a large-scale point cloud ~ 10 ^ 6, the calculation time required by FPS / IDIS / GS increases significantly, and CRS requires a large amount of GPU memory shown in Fig8(b) by dotted line.
  • Group2/3/4 (b):In contrast, RS has significant advantages in terms of computing time and memory consumption, so it is very suitable for processing large-scale point clouds shown in Fig8(b).
Table 1 : The computation time, network parameters and maximum number of input points of different approaches for semantic segmentation on Sequence 08 of SemanticKITTI
  • SPG [3] has the least model parameters, but it takes the longest time. The main reason is that the computational cost of steps such as geometrical partitioning and super-graph construction is relatively high;
  • PointNet ++ and PointCNN also take a long time, the main reason is that FPS is more time-consuming when dealing with large scene point clouds
  • PointNet and KPConv cannot handle very large-scale point clouds at one time(10⁶) mainly because there is no downsampling operation (PointNet) or the model is more complicated.
  • Thanks to simple random sampling and the efficient local feature aggregation module based on MLP, RandLA-Net takes the least time (~ 23 frames per second) and can process up to 10⁶point clouds at a time.
Table2 : Quantitative results of different approaches on Semantic3D (reduced-8)
Table 3. Quantitative results of different approaches on SemanticKITTI [7]
Table 4. Quantitative results of different approaches on S3DIS dataset [8] (6-fold cross validation)
Table 5. The mean IoU scores of all ablated networks based on our full RandLA-Net
  1. The greatest impact is caused by the removal of the chained spatial embedding and attentive pooling blocks.
  2. The removal of the local spatial encoding unit shows the next greatest impact on performance, demonstrating that this module is necessary to effectively learn local and relative geometry context.
  3. Removing the attention module diminishes performance by not being able to effectively retain useful features.
  • For the semantic segmentation task of 3D point cloud , it is possible to efficiently and effectively segment large-scale point clouds by using a lightweight network architecture .
  • Random sampling in our framework to significantly reduce the memory footprint and computational cost.
  • A local feature aggregation module is also introduced to effectively preserve useful features from a wide neighbourhood.
  • It would be interesting to extend our framework for the end-to-end 3D instance segmentation on large-scale point clouds by drawing
    on the recent work and also for the real-time dynamic point cloud processing
  • Low memory foot print and highly computationally efficient for real time applications .
  • Efficient way to process large scale point clouds using RS
  • Local Aggregation module technique is highly important to maintain geometric structure of objects
  • On Semantic Kitti dataset it is reported to have 23 fps for 360 deg view
  • Accuracy drops considerably for Small Scale point cloud .
  • Random Sampling module is less efficient for small scale point cloud .
  • We can obtain high fps on small scale point cloud but with a trade with accuracy.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store