[CV2019/PaperSummary] YOLACT :Real-time Instance Segmentation

15 min readDec 27, 2019
Yolact : Bounding Box with Instance Segmentation

From the world of performing object detection to object segmentation is a giant leap , performing mask segmentation of objects is much harder than obtaining bounding box of objects in object detection. In this current paper the author has a made an attempt to obtain segmentation mask of object i.e instance segmentation in real time.

“Boxes are stupid anyway though, I’m probably a true believer in masks except I can’t get YOLO to learn them.” – Joseph Redmon, YOLOv3

Please note that this post is for my future self to look back and review the materials on this paper without reading it all over again….

Fig 1:Speed-performance trade-off for various instance segmentation methods on COCO.


In the world of segmentation methods where obtaining masks accurately and in real time performance is a challenge , the author gives an simple fully convolutional model which for real time instance segmentation that achieves 29.8 mAP @33fps on MSCOCO test set as shown in Fig1 which is higher than the current SOTA. The author was able to achieve the instance segmentation in real time by performing the below subtasks in parallel.

  1. generating a set of prototypes masks
  2. predicting the sub instance mask co efficient

The author has proposed a FAST NMS which is the replacement for the standard NMS which helps him to gain on performance.

  1. Introduction

Over the years great progress has been made by the members of the vision community in instance segmentation by extending the object detector architectures i.e the two stage object detectors like Faster RCNN[1] RFCN[2] concepts to obtain masks like Mask RCNN[3] FCIS[4]. The drawback of these architectures are .

  1. Two stage detectors have high accuracy but low performance
  2. Dependent on Feature Localisation to generate/ produce masks of the objects

To address these issues the author has proposed YOLACT which is a single stage detector extension which performs instance segmentation by breaking into subtasks , they forgo explicitly the localisation step.

The network learns to localise masks on its own where visually , spatially and semantically similar instances appear in the prototypes .The number of prototype masks in YOLACT is independent of the number of categories ,this leads to distributed representation in the prototype space , this behaviour leads to following advantages

  1. Some prototype spatially partition the image
  2. Some localize the instances
  3. Some detect instance contours
  4. Some encode position-sensitive directional maps
  5. Some do the combo of the above operations

They have many practical advantages with this approach which are mentioned below

  1. Lightweight assembly process due to parallel structure
  2. Marginal amount of computational overhead to one-stage detectors like ResNet101[5]
  3. Masks quality are high
  4. Generic concept of adding of generating prototypes and mask coefficients
    which could be added to almost any modern object detector

The author compares the instance segmentation to the process of human vision where the eye “localises {what}and then segments{where}”, which is similar performed in the YOLACT ie the linear coefficients and the corresponding detection branch can be thought of recognising the individual instances {what}and prototype masks can be thought as localising instances in space {where}

2. Related Work

In this section we look into some of the methods which the authors has taken the references from

  1. Instance Segmentation:The two stage detector like Mask-RCNN [3] is a representative two-stage instance segmentation approach that first generates candidate region-of-interests (ROIs) and then classifies and segments those ROIs in the second stage. The next few work followed on improving the FPN features or addressing the incompatibility between a mask’s confidence score and its localisation accuracy. A quite bit of work went on using one-stage instance segmentation methods generate position sensitive maps that are assembled into final masks with position-sensitive pooling or combine semantic segmentation logits and direction prediction logits.
  2. Real-time Instance Segmentation: Few works have been available which perform real time instance segmentation .A quite few architectures are Straight to Shapes which perform instance segmentation with learned encodings of shapes at 30 fps, but its accuracy is far from that of modern
    baselines. Box2Pix relies on an extremely light-weight backbone detector (GoogLeNet v1 and SSD[6]) combined with a hand-engineered algorithm to obtain 10.9 fps on Cityscapes and 35 fps on KITTI . However the results observed a large drop in relative performance going from a semantically simple dataset (KITTI) to a more complex one (Cityscapes), so an even more difficult dataset (COCO) would pose a challenge.
  3. Prototypes: The concept of using prototypes have been used extensively in the vision community , they are mainly used for obtaining the features whereas the current author has used to assemble masks for instance segmentation which are specific to each image then having global prototypes for entire dataset


Fig 2: YOLACT Architecture Blue/yellow indicates low/high values in the prototypes, gray nodes indicate functions that are not trained, and k = 4 in this example

The author’s idea was to add a mask branch to the one-stage detectors without an explicit localisation step . The Fig 2 shows the Yolact architecture with different modules .To perform this activity the author has divided the complex task of instance segmentation into two simpler, parallel tasks that can be assembled to form the final masks.

  1. First branch obtains a set of image-sized “prototype masks” that do not depend on any one instance by using an FCN method.
  2. Second branch adds an an extra head to the object detection branch to predict a vector of “mask coefficients” for each anchor that encode an instance’s representation in the prototype space.
  3. Then by linearly combining the First branch and Second branch we generate the masks of instances which have be passed from NMS

The rationale of the author for performing the process like this is given below

  1. The masks from prototypes are spatially coherent i.e pixels close to each other belong the instances of same objects, the coherence property is an advantage for convolutional layer but not in full connected layers.
  2. This is quite a problem in one-stage detectors since they produce class and and box coefficients for each anchor as an output of an fc layer, two-stage detectors get around this by using localisation step
  3. So using the fc layers which are good at producing the semantic vectors and conv layers for obtaining the spatially coherent masks to produce “mask coefficients” and “protoype masks”
  4. The prototypes and mask coefficients can be computed independently,
    the computational overhead over that of the backbone detector comes mostly from the assembly step, which can be implemented as a single matrix multiplication.

3.1. Prototype Generation

The prototype generation branch (protonet) predicts a set of k prototype masks for the entire image.

The author has made important design choices which is listed below :

  1. Taking protonet from deeper backbone features which produces robust and high quality masks so from FPN -P3 the last layer having k channels is considered as shown in Fig 3, then it is upsampled to one fourth the dimensions of the input image to increase performance on small objects.
  2. Individual prototype losses are not considered explicitly but instead the final mask loss after assembly.
  3. Relu or non -linearity operation is performed on the protonet’s output to keep it unbound as it allows the network to produce large, overpowering activation's on prototypes it is very confident about for eg the background
Fig 3 : Protonet Architecture The labels denote feature size and channels for an image size of 550 × 550. Arrows indicate 3 × 3 conv layers, except for the final conv which is 1 × 1. The increase in size is an upsample followed by a conv

3.2. Mask Coefficients

In anchor based object detectors there are two branches in their prediction head.

  1. To predict c class confidences
  2. The other to predict 4 bounding box regressors.

To obtain the mask coefficient prediction, the author has simply add a third
branch in parallel that predicts k mask coefficients, one corresponding to each prototype, thus instead of producing 4 + c coefficients per anchor, we produce 4 + c + k. The author has applied tanh to the k mask coefficients, which produces more stable outputs over no nonlinearity. The relevance of this design choice is shown in Fig 2 , as neither mask would be constructable without allowing for subtraction.

3.3. Mask Assembly

Assembly Steps: The steps to produce the instance masks are given below

  1. Combining the prototype branch and mask coefficient branch by using a linear combination of the former with the latter as coefficients.
  2. Applying a sigmoid nonlinearity to produce the final masks.
  3. The combination is done using using a single matrix multiplication and sigmoid:
Eq1: Mask coefficient formulae

where P is an h×w ×k matrix of prototype masks and C is a n × k matrix of mask coefficients for n instances surviving NMS and score thresholding.

Losses : The author uses three losses to train the model: classification loss Lcls , box regression loss L box and mask loss L mask . To compute mask loss, they simply take the pixel-wise binary cross entropy between assembled masks M and the ground truth masks M gt : L mask = BCE(M, M gt ).

Cropping Masks : To preserve small objects in the prototypes, during evaluation the author crops the final masks with the predicted bounding box and during training they crop with the ground truth bounding box and divide L mask by the ground truth bounding box area.

3.4. Emergent Behaviour

In this section the author explains the behaviour of prototype masks and mask co-efficients .

  1. FCN’s are translation invariant so methods like FCIS and Mask RCNN explicitly add translation variance, by directional maps and position-sensitive repooling, or by putting the mask branch in the second stage.
  2. In Yolact the only translation in-variance added to crop the final mask with predicted bounding box but yolact learns how to localise instances on its own through different activation in its prototypes

The prototype activation in Yolact is explained in the following points & Fig 4

Fig 4 .Prototype Behavior The activations of the same six prototypes across different images. Prototypes 1, 4, and 5 are partition maps with boundaries clearly defined in image a, prototype 2 is a bottom-left directional map, prototype 3 segments out the background and provides instance contours, and prototype 6 segments out the ground.
  1. In the image a of Fig 4 note that the prototype activations for the solid red image is not possible in FCN’s without padding since a convolution outputs to a single pixel, if its input everywhere in the image is the same, the result everywhere in the conv output will be the same.
  2. Conceptually, one way to obtain the activation is to have multiple layers in sequence spread the padded 0’s out from the edge toward the center (e.g., with a kernel like [1, 0]).
  3. The consistent rim of padding in modern FCNs like ResNet gives the
    network the ability to tell how far away from the image’s edge a pixel is which the network clearly exhibits inherently the translation variance.
  4. In Fig 4, prototypes 1, 4, 5, and 6 we observe that activation of prototype has happened only on certain “partitions” objects that are on one side of an implicitly learned boundary.
  5. By combining these partition maps, the network can distinguish between different (even overlapping) instances of the same semantic class. For instance, in image d in Fig 4, the green umbrella can be separated from prototype 4.
  6. Prototypes are compressible , if protonet combines the functionality of
    multiple prototypes into one, the mask coefficient branch can learn which situations call for which functionality.
  7. For instance,prototype 2 encodes the bottom-left side of objects, but also fires more strongly on instances in a vertical strip down the middle of the image.
  8. Prototype 4 is a partitioning prototype but also fires most strongly on in-
    stances in the bottom-left corner.
  9. Prototype 5 is similar but for instances on the right.

4. Backbone Detector

While designing the backbone the author kept in mind a backbone which had rich features and speed so they followed the RetinaNet[7].

YOLACT Detector Design Features :

  1. ResNet-101 [5] with FPN as the default feature backbone and a base image size of 550 × 550.
  2. Like RetinaNet, FPN is modified by not producing P 2 and by producing P 6 and P 7 as successive 3 × 3 stride 2 conv layers starting from P 5 and place 3 anchors with aspect ratios [1, 1/2, 2] on each.
  3. The anchors of P 3 have areas of 24 pixels squared, and every subsequent layer has double the scale of the previous (resulting in the scales [24, 48, 96, 192, 384]).
  4. The prediction head attached to each P i , we have one 3 × 3 conv shared by all three branches, and then each branch gets its own 3×3 conv in parallel.
  5. Smooth-L 1 loss is applied to train box regressors and encode box regression coordinates and softmax cross entropy with c positive labels and 1 background label to train class prediction.

5. Other Improvements

In this section we shall discuss about concepts of FastNMS and Segmentation Loss which is used to increase the speed of the detector with less penalty and maintain the same accuracy .

Standard NMS : In most object detectors NMS is used to suppress duplicate detections. The NMS operation is performed sequentially, that is for each of the c classes in the dataset, sort the detected boxes descending by confidence, and then for each detection remove all those with lower confidence than it that have an IoU overlap greater than some threshold. Though its is fast it is a large barrier when it comes to obtained 30 fps

Fast NMS: To remove the sequential nature of the traditional NMS the author introduces the Fast NMS where every instance can be decided to be kept or discarded in parallel , to perform this we use already -removed detections to suppress other detections, which is not possible in traditional NMS.

Steps of Fast NMS : The following steps are followed

  1. Compute a c × n × n pairwise IoU matrix X for the top n detections
  2. Batched sorting in descending order by score for each of c classes.
  3. Computation of IoU which can be easily vectorized. Then, find which detections to remove by checking if there are any higher-scoring detections with a corresponding IoU greater than some threshold t.

Implementation of Fast NMS :

  1. First setting the lower triangle and diagonal of X to 0, wich can be performed in one batched triu call.
Eq 2 : Lower triangle setting

2. Taking the column-wise max,to compute a matrix K of maximum IoU values for each detection.

Eq 3: Col wise max multiplication

3. Thresholding this matrix with t (K < t) will indicate which detections to keep for each class.

Semantic Segmentation Loss : To maintain accuracy even when speed is increased the author applies extra losses to the model during training which are not executed during test time this increase the feature richness while no speed penalty .

The authors applies the semantic segmentation loss on the feature space using layers that are evaluated during training. To create predictions during training, simply attach a 1x1 conv layer with c output channels directly to the largest feature map (P 3 ) to the backbone because each pixel can be assigned to more than one class, we use sigmoid and c channels instead of softmax and c+1.

6. Results and Ablation study

The below results are reported on MS COCO’s instance segmentation task using the standard metrics for the task. The training is on train2017 and evaluate on val2017 and test-dev.

6.1. Instance Segmentation Results

The testing of all speed is performed on single Titan XP , the below Table 1 gives the comparison of Yolact with other method both on accuracy and speed .

Table 1 : Mask Performance We compare our approach to other state-of-the-art methods for mask mAP and speed on COCO test-dev.
  1. YOLACT-550 offers competitive instance segmentation performance while at 3.8x the speed of the previous fastest instance segmentation method on COCO.
  2. We also observe that the gap between YOLACT-550 and Mask R-CNN at the 50% overlap threshold is 9.5 AP while at the 75% IoU threshold it’s 6.6.
  3. The highest (95%) IoU threshold, we outperform
    Mask R-CNN with 1.6 AP vs. 1.3 AP.
  4. The report for alternate configurations of the model in mentioned in Table 1 . In addition to base 550 × 550 image size model, 400 × 400 (YOLACT-400) and 700 × 700 (YOLACT-700) models are trained by adjusting the anchor scales for these models accordingly (s x = s 550 /550 ∗ x).
  5. Lowering the image size results in a large decrease in performance, demonstrating that instance segmentation naturally demands larger images.
  6. Increasing the image size decreases speed significantly but also increases performance, as expected.
  7. In addition to the base backbone of ResNet-101 [5], we also test ResNet-50[5] and DarkNet-53 [8] to obtain even faster results.
  8. If higher speeds are preferable then use ResNet-50 or DarkNet-53 instead of lowering the image size, as the performance of these configurations is much better than YOLACT-400, while only being slightly slower.

6.2. Mask Quality

The final mask is of size 138 × 138, and since masks are created directly from the original features (with no repooling step to transform and potentially misalign the features). The base model achieves 1.6 AP while Mask R-CNN obtains 1.3 as mentioned in Table 1. This indicates that repooling does result in a quantifiable decrease in mask quality.

6.3. Temporal Stability

Though the model is trained using static images and no temporal smoothing, yet the model produces more temporally stable masks on videos than Mask R-CNN, whose masks jitter across frames even when objects are completely stationary.Masks produced in two-stage methods are highly dependent on their region proposals in the first stage since Yolact is one-stage detector even if the model predicts different boxes across frames, the prototypes are not affected, yielding much more temporally stable masks.

6.4. Implementation Details

Batch size : 8 on one GPU using ImageNet pretrained weights.

Optimiser : SGD

Iterations: 800k

Learning Rate : Initial learning rate of 10 −3 and divide by 10 at iterations 280k, 600k, 700k, and 750k

Weight Decay :5×10 −4

Momentum :0.9

Augmentations : Methods used in SSD [6].

Fast NMS : Implementation of FastNMS in other methods and obtaining the comparison results for both performance and accuracy as shown in Table 2.

Table 2 : Fast NMS Fast NMS performs only slightly worse than standard NMS

Prototype : The author used different values for prototype mask and evaluated the performance and accuracy and mention the value of k=32 as show in Table 3

Table 3:Prototypes Choices for k in our method.

7. Discussion

The author mentions Yolact’s drawbacks in few areas when compared to the SOTA methods, they mainly highlight two areas which is mentioned below

Localization Failure : If there are too many objects in one spot in a scene, the network can fail to localize each object in its own prototype. In these cases, the network will output something closer to a foreground mask than an instance segmentation for some objects in the group. An example of
this can be seen in the first image in Fig 5 where the blue truck under the red airplane is not properly localized.

Fig 5 :YOLACT localization failure

Leakage : The network leverages the fact that masks are cropped after assembly, and makes no attempt to suppress noise outside of the cropped region. This works fine when the bounding box is accurate, but when it is not, that noise can creep into the instance mask, creating some “leakage” from outside the cropped region. This can also happen when two instances are far away from each other, because the network has learned that it doesn’t need to localize far away instances the cropping will take care of it. For instance,Fig 6 exhibits this leakage because the mask branch deems the three skiers to be far enough away to not have to separate them.

Fig 6 : YOLACT leakage

Final Words ….

As per as my survey this is the best architecture which is currently available which balances the trade off between accuracy vs performance. There are drawbacks in few practical areas like person detection in crowded scenes , mask quality for larger classes of objects but the author has made a great effort to create a baseline architecture of one-stage detector which produces masks and bounding in real time..

If any errors found please mail me at abhigoku10@gmail.com…

github repo : https://github.com/dbolya/yolact


[1] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015. 1, 3, 5

[2] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In NeurIPS, 2016

[3] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn.In ICCV, 2017. 1, 2, 3, 4, 5, 6, 7, 8

[4] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. In CVPR, 2017. 1,2, 4, 7, 8

[5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 2, 5, 7, 8

[6] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.Fu, and A. C. Berg. Ssd: Single shot multibox detector. In ECCV, 2016. 1, 2, 4, 5, 8

[7] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. In CVPR, 2017. 3, 4, 5

[8] J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv:1804.02767, 2018. 1, 2, 5, 7, 8