[CV2019/PaperSummary] EfficientDet: Scalable and Efficient Object Detection

11 min readDec 18, 2019

In the world of object detection balancing the trade off between accuracy and performance efficiency of the detectors is a major challenge ,we would want high accurate object detectors which gives real time performance and use few computation time so that they can be deployed easily on the hardware platform . EfficientDet attempts to minimize the trade off and give the best detector both in term of accuracy and performance

Please note that this post is for my future self to look back and review the materials on this paper without reading it all over again


In this paper the author had studied different SOTA architectures and proposed key features for the object detector .

  1. Bi Directional Feature Pyramid Network (BiFPN)
  2. Compound scaling method which uniformly scales resolution, depth and width for backbone,feature network and box/class prediction network at the same time .

The EfficientDet-D7 on COCO dataset achieves 51.0 mAP with 52M params ,326B FLOPS and 4x smaller than previous detectors


Advancements in object detector architecture has always faced a trade of between accuracy vs efficiency , though progress has been made in previous detector architectures from two stage->one stage detectors , anchor free detectors or compressing the models which tend to achieve in efficiency but their is downfall in accuracy factor.

So this paper tackles the problem where a detector has high accuracy with efficiency and can be easily deployable in smaller hardware which has less computation . The author considers the one stage detector paradigm and then identifys the challenges in the architecture design

Challenge 1- Efficient Multi-Scale Feature Fusion : Fusing the multiple features of different resolutions from different layers instead of summing them into different distinction and having unequality of features

Challenge 2- Model Scaling : Its good to have a backbone which can scale simultaneously in resolution , depth and width instead of having a bigger backbone which would only be scale in only one factor

To solve the challenges the authors contributions can be classified into following points :

  1. BiFPN for easy and fast multi feature fusion
  2. Compound scaling to jointly scale backbone , network and resolution
  3. EfficientDet = BiFPN + Compound Scaling for better accuracy and efficiency across wide variety resource constraints

2. Related Work

Most of the work taken as reference based on which the progress has been made

Object Detectors : Two stage detectors are more flexible and accurate but one stage detectors are more simpler and efficient due to predefined anchors

Multi-Scale Feature Representations: Most of the backbone uses FPN i.e Feature Pyramid Network which is top-down pathway to combine multi-scale features , based on this idea PANet adds an extra bottom up path aggregation network on top of FPN ,STDL architecture does a scale transfer module to exploit cross scale, NAS_FPN automatically design feature network topology

Model Scaling : To obtain better accuracy it is common to scale up the backbone network ,increase the input image size ,increase the channel size or repeat the features network . The EfficientNet jointly scales up network width,depth and resolution where the model efficiency is high for image classification

3. BiFPN

The underlying concept of BiFPN is taken from FPN ,the pros of the FPN are retained and the cons are upgraded.The major aspect of the paper the BiFPN which is formulated by addressing in the following sections.

  1. Multi-scale Feature Fusion
  2. Efficient bi-directional Cross Scale connections
  3. Weighted Feature Fusion
Fig1 Feature network design — (a) FPN introduces a top-down pathway to fuse multi-scale features from level 3 to 7 (P 3 — P 7 ); (b) PANet adds an additional bottom-up pathway on top of FPN;NAS-FPN use neural architecture search to find an irregular feature network topology; (d)-(f) are three alternatives studied in this paper. (d) adds expensive connections from all input feature to output features; (e) simplifies PANet by removing nodes if they only have one input edge; (f) is our BiFPN with better accuracy and efficiency trade-offs.

3.1 Multi-scale Feature Fusion : It aims to aggregate the features at different resolutions at feature level Eq-a , in Fig1.(a)a shows the FPN with top-down approach which take features from level 3–7.

Eg: If the image resolution is 640x480, feature level 3 is 640/2³ ie 80 which outputs the resolution of 80x80 similarly feature level 7 is 5x5

Eq-a: Multi-scale fusion formula

So the conventional FPN aggregates the multi-scale feature with top down approach where Resize is usually a upsampling or downsampling op for resolution matching, and Conv is usually a convolutional op for feature processing where information flows in one direction. To improvise this PANnet Fig1.(b) and NAS-FPN Fig1.(c) performs different ways but they are computationally expensive .

3.2 Efficient bi-directional Cross-Scale Connections: The author studied the performance and accuracy of the three architectures FPN, PAN net and NAS_FPN in which NAS_FPN achieved better accuracy but at the cost of more params and expensive computations , so the author proposed optimisation method for cross scale connection.

  1. Remove the nodes that have one single input edge , the intuition is since one input edge with no feature fusion then it will have less contribution to feature network that aims at fusion of different features.
  2. Connect an additional edge from the original input to output node if they are at the same level in order to fuse more features without adding cost
  3. We can bi-directional path i.e top-down and bottom-up as one feature network layer and repeat this multiple times to enable more high-level feature fusion , compound scaling method determines the number of layers for different resource constraint Fig1.(f)

3.3 Weighted Feature Fusion : As mentioned the common approach of fusing the multiple features is by resizing them to the same resolution and then summing them up since this creates un-equality among the features the author came up with an approach of adding additional weight for each input during feature fusion and let the network learn the importance of each input feature , there are three different ways to implement the idea.

  1. Unbound Fusion :

wᵢ is a learnable weight that can be a scalar (per-feature), a vector (per-channel), or a multi-dimensional tensor (per-pixel). The authors found out that a scale can achieve comparable accuracy to other approaches with minimal computational costs. However, since the scalar weight is unbounded, it could potentially cause training instability. Therefore resorting to weight normalisation to bound the value range of each weight .

2. Softmax-based Fusion:

An intuitive idea was to apply softmax to each weight so as to normalise all the weights to the probability of range of 0–1. The ablation study from the author showed the approach is is computationally expensive and increases the latency of the network.

3.Fast Normalized Fusion :

We want our weights to be in the range 0–1. where wᵢ ≥ 0 if obtained by applying ReLU after each wi and epsilon value is small to avoid the numberical instability. The weight fall in the range 0–1, since no softmax operation it is much more efficient . The abalation study from the author proved that fast fusion approach has similar learning approach that of softmax based fusion but runs 30% faster on GPU .

The final Bi-FPN integrates both bi-directional cross scale connections and the fast normalized fusion . The below example shows two fused features at level 6 .

where P 6 td is the intermediate feature at level 6 on the top-down pathway, and P 6 out is the output feature at level 6 on the bottom-up pathway. To further improve the efficiency, the author uses depthwise separable convolution for feature fusion, and add batch normalization and activation
after each convolution.

4. EfficientDet

In this section, we will discuss the network architecture and a new compound
scaling method for EfficientDet and the new family of detectors

Figure 2: EfficientDet architecture — It employs EfficientNet [5] as the backbone network, BiFPN as the feature network,and shared class/box prediction network. Both BiFPN layers and class/box net layers are repeated multiple times based on different resource constraints as shown in Table 1.

4.1. EfficientDet Architecture : Figure 2 shows the overall architecture of EfficientDet,which largely follows the one-stage detectors paradigm
The author has employed ImageNet-pretrained EfficientNets as the backbone network. The author proposed BiFPN serves as the feature network, which takes level 3–7 features {P 3 , P 4 , P 5 , P 6 , P 7 } from the backbone network and repeatedly applies top-down and bottom-up bidirectional feature fusion. These fused features are fed to a class and box network to produce object class and bounding box predictions respectively.

4.2. Compound Scaling : The authors based on the remarkable performance from EfficientNet i.e on the image classification by jointly scaling up all dimensions of network width, depth, and input resolution.The author have combined the new compound scaling method for object detection, which uses a simple compound coefficient φ to jointly scale up all dimensions of backbone network, BiFPN network, class/box network, and resolution.

Object detectors have much more scaling , so grid search for all dimensions is an expensive process but the author use a heuristic-based scaling approach, but still follow the main idea of jointly scaling up all dimensions.

  1. Backbone network : The author have reused the same width/depth scal-
    ing coefficients of EfficientNet-B0 to B6 [5] such that they can easily reuse their ImageNet-pretrained checkpoints.
  2. BiFPN network :The authors exponentially grow BiFPN width (#channels) as done in EfficientNets, but linearly increase the depth (#layers) since depth needs to be rounded to small integers.
Eq1: Width equation of BiFPN network

3. Box/class prediction network: The width is kept same as the BiFPN but the depth (#layers) is linearly increased.

Eq2: Depth equation of BiFPN network

4. Input image resolution: Since feature level 3–7 are used in BiFPN, the input resolution must be dividable by ²⁷ = 128, so we linearly increase resolutions using equation:

Eq3: Resolution equation of BiFPN network

Using equations (1), (2), and (3), and different values of ϕ , we can go from Efficient-D0 (ϕ=0) to Efficient-D6 (ϕ=6),models scaled up with ϕ ≥ 7 could not fit memory unless changing batch size or other settings. Therefore, the authors expanded D6 to D7 by only enlarging input size while keeping all other dimensions the same, such that we can use the same training settings for all models. Table 1 summarized all these configs:

Table 1: Scaling configs for EfficientDet D0-D7 — φ is the compound coefficient that controls all other scaling dimensions; BiFPN, box/class net, and input size are scaled up using equation 1, 2, 3 respectively. D7 has the same settings as D6 except using larger input size

5. Experiments

Training Details :

Dataset -COCO2017




Preprocessing- RetianNet [1]method

Augementation- Auto augmentation similar to AmoebaNet based NAS-FPN detector

LearningRate- First linearly increased from 0–0.08 in the intial 5% warmup training and then annealed down using cosine decay

Features -Batch normalization is added after every convolution with batch norm decay 0.997 and epsilon 1e-4.

Exponential moving average with decay 0.9998.

Employ commonly-used focal loss [3] with α = 0.25 and γ = 1.5, and aspect ratio {1/2,1, 2}. The models are trained with batch size 128 on 32 TPUv3 chips.

Testing Details :

Test Set: COCO 2017 test set

Performance: EfficientDet-D1 achieves similar accuracy with up to 8x
fewer parameters and 25x fewer FLOPS compared to Retina Net and MaskRCNN

EfficientDet-D7 achieves a new state-of-the-art 51.0 mAP for single-model single-scale, while still being 4x smaller and using 9.3x fewer FLOPS

Table2 shows the performance results

Table2: EfficientDet performance on COCO — Results are for single-model single-scale. #Params and #FLOPS
denote the number of parameters and multiply-adds. LAT denotes inference latency with batch size 1. AA denotes auto-augmentation [6]. We group models together if they have similar accuracy, and compare the ratio or speedup between EfficientDet and other detectors in each group

The author have compared the real-world latency on Titan-V GPU and
single-thread Xeon CPU. Fig3 illustrates the comparison on model size, GPU latency, and single-thread CPU latency. For fair comparison,these figures only include results that are measured on the same machine.

Fig3: Model size and inference latency comparison — Latency is measured with batch size 1 on the same machine equipped with a Titan V GPU and Xeon CPU. AN denotes AmoebaNet + NAS-FPN trained with auto-augmentation [6].Our EfficientDet models are 4x — 6.6x smaller, 2.3x — 3.2x faster on GPU, and 5.2x — 8.1x faster on CPU than other detectors.

6. Ablation Study

The author has done the ablation study on the various design choices for the EfficientDet architecture

6.1. Disentangling Backbone and BiFPN

EfficientDet uses both a powerful backbone and a new BiFPN .Table 3 compares the impact of backbone and BiFPN. Starting from a RetinaNet detector [1] with ResNet-50 backbone and top-down FPN [1], we first replace the backbone with EfficientNet-B3, which improves accuracy by
about 3 mAP with slightly less parameters and FLOPS.

These results suggest that EfficientNet backbones and BiFPN are both crucial for our final models

Table3: Disentangling backbone and BiFPN

6.2. BiFPN Cross-Scale Connections

Table 4: Comparison of different feature networks

Table 4 shows the accuracy and model complexity for feature networks with different cross-scale connections. The table shows the conventional top-down FPN is inherently limited by the one-way information flow and thus has the lowest accuracy. The additional weighted feature fusion, our BiFPN further achieves the best accuracy with fewer parameters and FLOPS.

6.3. Softmax vs Fast Normalized Fusion:

Table 5: Comparison of different feature fusion

Table 5 compares the softmax and fast normalized fusion approaches in three detectors with different model sizes. The table shows the results, our fast normalized fusion approach achieves similar accuracy as the softmax-based fusion, but runs 1.26x — 1.31x faster on GPUs.

6.4. Compound Scaling

Fig4: Comparison of different scaling methods

Fig4 compares the compound scaling with other alternative methods that scale up a single dimension of resolution/depth/width.Though starting from the same baseline detector, the compound scaling method achieves better efficiency than other methods, suggesting the benefits of jointly scaling by better balancing difference architecture dimensions.

7. Conclusion

  1. The proposed weighted bidirectional feature network and a customized compound scaling method, improves the accuracy and efficiency of the object detector model
  2. The optimizations technique helped to develop a new family of detectors, named Efficient Det,
  3. EfficientDet-D7 achieves state-of-the-art accuracy with an order-of-magnitude fewer parameters and FLOPS than the best existing detector.
  4. EfficientDet is also up to 3.2x faster on GPUs and 8.1x faster on CPUs.


[1] Hanchao Li, Pengfei Xiong, Jie An, and Lingxue Wang. Pyramid attention networks. BMVC, 2018. 4

[2] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. CVPR, 2017. 2, 3, 4, 7

[3] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,Bharath Hariharan, and Serge Belongie. Focal loss for dense object detection. ICCV, 2017. 1, 2, 4, 6, 7

[4] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia.Path aggregation network for instance segmentation. CVPR,2018. 2, 3, 4, 7

[5] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. ICML, 2019. 1, 2, 5

[6]Barret Zoph, Ekin D. Cubuk, Golnaz Ghiasi, Tsung-Yi Lin,Jonathon Shlens, and Quoc V. Le. Learning data augmentation strategies for object detection. arXiv preprint arXiv:1804.02767, 2019. 1, 2, 6, 7

Final Words ….

If any errors found please mail me at abhigoku10@gmail.com…