[CV2019/PaperSummary] YOLACT :Real-time Instance Segmentation

Yolact : Bounding Box with Instance Segmentation
Fig 1:Speed-performance trade-off for various instance segmentation methods on COCO.
  1. generating a set of prototypes masks
  2. predicting the sub instance mask co efficient
  1. Introduction
  1. Two stage detectors have high accuracy but low performance
  2. Dependent on Feature Localisation to generate/ produce masks of the objects
  1. Some prototype spatially partition the image
  2. Some localize the instances
  3. Some detect instance contours
  4. Some encode position-sensitive directional maps
  5. Some do the combo of the above operations
  1. Lightweight assembly process due to parallel structure
  2. Marginal amount of computational overhead to one-stage detectors like ResNet101[5]
  3. Masks quality are high
  4. Generic concept of adding of generating prototypes and mask coefficients
    which could be added to almost any modern object detector
  1. Instance Segmentation:The two stage detector like Mask-RCNN [3] is a representative two-stage instance segmentation approach that first generates candidate region-of-interests (ROIs) and then classifies and segments those ROIs in the second stage. The next few work followed on improving the FPN features or addressing the incompatibility between a mask’s confidence score and its localisation accuracy. A quite bit of work went on using one-stage instance segmentation methods generate position sensitive maps that are assembled into final masks with position-sensitive pooling or combine semantic segmentation logits and direction prediction logits.
  2. Real-time Instance Segmentation: Few works have been available which perform real time instance segmentation .A quite few architectures are Straight to Shapes which perform instance segmentation with learned encodings of shapes at 30 fps, but its accuracy is far from that of modern
    baselines. Box2Pix relies on an extremely light-weight backbone detector (GoogLeNet v1 and SSD[6]) combined with a hand-engineered algorithm to obtain 10.9 fps on Cityscapes and 35 fps on KITTI . However the results observed a large drop in relative performance going from a semantically simple dataset (KITTI) to a more complex one (Cityscapes), so an even more difficult dataset (COCO) would pose a challenge.
  3. Prototypes: The concept of using prototypes have been used extensively in the vision community , they are mainly used for obtaining the features whereas the current author has used to assemble masks for instance segmentation which are specific to each image then having global prototypes for entire dataset
Fig 2: YOLACT Architecture Blue/yellow indicates low/high values in the prototypes, gray nodes indicate functions that are not trained, and k = 4 in this example
  1. First branch obtains a set of image-sized “prototype masks” that do not depend on any one instance by using an FCN method.
  2. Second branch adds an an extra head to the object detection branch to predict a vector of “mask coefficients” for each anchor that encode an instance’s representation in the prototype space.
  3. Then by linearly combining the First branch and Second branch we generate the masks of instances which have be passed from NMS
  1. The masks from prototypes are spatially coherent i.e pixels close to each other belong the instances of same objects, the coherence property is an advantage for convolutional layer but not in full connected layers.
  2. This is quite a problem in one-stage detectors since they produce class and and box coefficients for each anchor as an output of an fc layer, two-stage detectors get around this by using localisation step
  3. So using the fc layers which are good at producing the semantic vectors and conv layers for obtaining the spatially coherent masks to produce “mask coefficients” and “protoype masks”
  4. The prototypes and mask coefficients can be computed independently,
    the computational overhead over that of the backbone detector comes mostly from the assembly step, which can be implemented as a single matrix multiplication.
  1. Taking protonet from deeper backbone features which produces robust and high quality masks so from FPN -P3 the last layer having k channels is considered as shown in Fig 3, then it is upsampled to one fourth the dimensions of the input image to increase performance on small objects.
  2. Individual prototype losses are not considered explicitly but instead the final mask loss after assembly.
  3. Relu or non -linearity operation is performed on the protonet’s output to keep it unbound as it allows the network to produce large, overpowering activation's on prototypes it is very confident about for eg the background
Fig 3 : Protonet Architecture The labels denote feature size and channels for an image size of 550 × 550. Arrows indicate 3 × 3 conv layers, except for the final conv which is 1 × 1. The increase in size is an upsample followed by a conv
  1. To predict c class confidences
  2. The other to predict 4 bounding box regressors.
  1. Combining the prototype branch and mask coefficient branch by using a linear combination of the former with the latter as coefficients.
  2. Applying a sigmoid nonlinearity to produce the final masks.
  3. The combination is done using using a single matrix multiplication and sigmoid:
Eq1: Mask coefficient formulae
  1. FCN’s are translation invariant so methods like FCIS and Mask RCNN explicitly add translation variance, by directional maps and position-sensitive repooling, or by putting the mask branch in the second stage.
  2. In Yolact the only translation in-variance added to crop the final mask with predicted bounding box but yolact learns how to localise instances on its own through different activation in its prototypes
Fig 4 .Prototype Behavior The activations of the same six prototypes across different images. Prototypes 1, 4, and 5 are partition maps with boundaries clearly defined in image a, prototype 2 is a bottom-left directional map, prototype 3 segments out the background and provides instance contours, and prototype 6 segments out the ground.
  1. In the image a of Fig 4 note that the prototype activations for the solid red image is not possible in FCN’s without padding since a convolution outputs to a single pixel, if its input everywhere in the image is the same, the result everywhere in the conv output will be the same.
  2. Conceptually, one way to obtain the activation is to have multiple layers in sequence spread the padded 0’s out from the edge toward the center (e.g., with a kernel like [1, 0]).
  3. The consistent rim of padding in modern FCNs like ResNet gives the
    network the ability to tell how far away from the image’s edge a pixel is which the network clearly exhibits inherently the translation variance.
  4. In Fig 4, prototypes 1, 4, 5, and 6 we observe that activation of prototype has happened only on certain “partitions” objects that are on one side of an implicitly learned boundary.
  5. By combining these partition maps, the network can distinguish between different (even overlapping) instances of the same semantic class. For instance, in image d in Fig 4, the green umbrella can be separated from prototype 4.
  6. Prototypes are compressible , if protonet combines the functionality of
    multiple prototypes into one, the mask coefficient branch can learn which situations call for which functionality.
  7. For instance,prototype 2 encodes the bottom-left side of objects, but also fires more strongly on instances in a vertical strip down the middle of the image.
  8. Prototype 4 is a partitioning prototype but also fires most strongly on in-
    stances in the bottom-left corner.
  9. Prototype 5 is similar but for instances on the right.
  1. ResNet-101 [5] with FPN as the default feature backbone and a base image size of 550 × 550.
  2. Like RetinaNet, FPN is modified by not producing P 2 and by producing P 6 and P 7 as successive 3 × 3 stride 2 conv layers starting from P 5 and place 3 anchors with aspect ratios [1, 1/2, 2] on each.
  3. The anchors of P 3 have areas of 24 pixels squared, and every subsequent layer has double the scale of the previous (resulting in the scales [24, 48, 96, 192, 384]).
  4. The prediction head attached to each P i , we have one 3 × 3 conv shared by all three branches, and then each branch gets its own 3×3 conv in parallel.
  5. Smooth-L 1 loss is applied to train box regressors and encode box regression coordinates and softmax cross entropy with c positive labels and 1 background label to train class prediction.
  1. Compute a c × n × n pairwise IoU matrix X for the top n detections
  2. Batched sorting in descending order by score for each of c classes.
  3. Computation of IoU which can be easily vectorized. Then, find which detections to remove by checking if there are any higher-scoring detections with a corresponding IoU greater than some threshold t.
  1. First setting the lower triangle and diagonal of X to 0, wich can be performed in one batched triu call.
Eq 2 : Lower triangle setting
Eq 3: Col wise max multiplication
Table 1 : Mask Performance We compare our approach to other state-of-the-art methods for mask mAP and speed on COCO test-dev.
  1. YOLACT-550 offers competitive instance segmentation performance while at 3.8x the speed of the previous fastest instance segmentation method on COCO.
  2. We also observe that the gap between YOLACT-550 and Mask R-CNN at the 50% overlap threshold is 9.5 AP while at the 75% IoU threshold it’s 6.6.
  3. The highest (95%) IoU threshold, we outperform
    Mask R-CNN with 1.6 AP vs. 1.3 AP.
  4. The report for alternate configurations of the model in mentioned in Table 1 . In addition to base 550 × 550 image size model, 400 × 400 (YOLACT-400) and 700 × 700 (YOLACT-700) models are trained by adjusting the anchor scales for these models accordingly (s x = s 550 /550 ∗ x).
  5. Lowering the image size results in a large decrease in performance, demonstrating that instance segmentation naturally demands larger images.
  6. Increasing the image size decreases speed significantly but also increases performance, as expected.
  7. In addition to the base backbone of ResNet-101 [5], we also test ResNet-50[5] and DarkNet-53 [8] to obtain even faster results.
  8. If higher speeds are preferable then use ResNet-50 or DarkNet-53 instead of lowering the image size, as the performance of these configurations is much better than YOLACT-400, while only being slightly slower.
Table 2 : Fast NMS Fast NMS performs only slightly worse than standard NMS
Table 3:Prototypes Choices for k in our method.
Fig 5 :YOLACT localization failure
Fig 6 : YOLACT leakage

Final Words ….



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store