[CVPR2021/PaperSummary]YOLOX: Exceeding YOLO Series in 2021

10 min readAug 8, 2021

Please note that this post is for my future self to look back and review the materials on this paper without reading it all over again….

Fig1:Speed-accuracy trade-off of accurate models (top) and Size-accuracy curve of lite models

In all the different versions of Yolo we are having, from my perspective after Yolov3 the YoloX has the max improvement in all aspects. As we are aware of the popularity of Yolo which became instantly popular for object detection which was deployed for real-time applications, people started creating different versions of Yolo with so many improvements for accuracy and inference time. Let's take a look at what this current version of Yolo has to offer to us.

In this paper, the author has enhanced the performance of the detector by having an anchor-free and implemented advanced techniques like decoupled head and leading label assignment strategy ie SimOTA. The author was also able to reduce the number of computation parameters, get an improved FPS with increased accuracy on the standard COCO dataset. With the current trend different series of YoloX has been also released ie YoloX-L, YoloX-M, YoloX-S, YoloX-X, YoloX-Nano, YoloX-Tiny which can be used based on the required use case.


There is always a trade-off between optimal speed and accuracy for real-time object detection applications. Yolov5 holds the best trade-off performance with 48.2% AP on COCO at 13.7 ms, but with recent advancements from researchers the focus is to develop object detection architectures which are mentioned below.

a. anchor free

b.label assignment strategies

c.end-to-end NMS free detectors

The author has implemented these advanced methods on Yolov3-SPP architecture since Yolov4[1] and Yolov5[11] are over-optimized or dependent on anchor-based methods, with this implementation they were able to boost both in terms of accuracy and timing with reference to Yolov3 they also implemented these methods on Yolov3-Tiny, Yolov3-Nano and S/M/L versions with ONNX, TensorRT, NCNN, and OpenVINO support also.

The author's team has won the 1st Place on Streaming Perception Challenge (Workshop on Autonomous Driving at CVPR 2021) using a single YOLOX-L model.

2. Proposed Architecture

2.1. YOLOX-DarkNet53

The author has chosen Yolov3[25] with Darknet-53 as the baseline architecture. In the following sections, we shall see the step-by-step changes in the base architecture.

a. Implementation details: Training settings are similar to the Yolov3, the author has trained models for 300 epochs, with 5epoch warmup, using SGD with a learning rate of lr X Batchsize/64 ie linear scaling [8], with initial lr=0.01 and cosine lr scheduler, the weight decay =0.0005 and momentum =0.9 with the Batch size of 128 .

The input image size was between 448 to 832 with 32 strides and inference on F16 on Tesla V100 machine.

b.YOLOv3 baseline: For the base Darknet-53 backbone with SPP layer is added, the author has made slight changes compared to original implementations which are the key points like

  1. EMA updatation
  2. Cosine LR scheduler
  3. IOU loss with IOU aware branch
  4. BCE loss for training cls and obj branch
  5. IOU loss for training reg branch
  6. RandomHorizontalFlip, ColorJitter, and Multiscale are considered for data augmentation.

With those enhancements, our baseline achieves 38.5% AP on COCO val,
as shown in Tab. 1.

Tab 1: Roadmap of YOLOX-Darknet53 in terms of AP (%) on COCO val.

c.Decoupled head: The author’s two analytical experiments indicate that the coupled detection head may harm the performance.

1). Replacing YOLO’s head with a decoupled one greatly improves the
converging speed as shown in Fig. 2.

Fig2: Training curves for detectors with YOLOv3 head or decoupled head. We

2). The decoupled head is essential to the end-to-end version of YOLO. Tab. 2, shows the end-to end property decreases by 4.2% AP with the coupled head, while the decreasing reduces to 0.8% AP for a decoupled head.

Tab 2: The effect of decoupled head for end-to-end YOLO in terms of AP (%) on COCO

The author replaced the YOLO detect head with a lite decoupled head as in Fig. 3. Concretely, it contains a 1 × 1 conv layer to reduce the channel dimension, followed by two parallel branches with two 3 × 3 conv layers respectively. We report the inference time with batch=1 on V100 in Tab. 1and the lite decoupled head brings an additional 1.1 ms (11.6 ms v.s. 10.5 ms).

Fig3: Illustration of the difference between YOLOv3 head and the proposed decoupled head

d.Strong data augmentation: The author added Mosaic and MixUp
into our augmentation strategies to boost YOLOX’s performance.

Mosaic is an efficient augmentation strategy proposed by ultralytics-YOLOv3 and MixUp [12] is originally designed for image classification task but then modified in BoF [8] for object detection training.

MixUp[12] and Mosaic implementation in the model training was applied for the last 15 epochs, achieving 42.0% AP in Tab. 1 and all the training is done from scratch.

e.Anchor-free: The anchor mechanism has many known problems and moving data from GPU to CPU on edge devices causes the major bottleneck in latency.

First, to achieve optimal detection performance, one needs to conduct clustering analysis which is domain-specific to determine a set of optimal anchors before training.

Second, anchor mechanism increases the complexity of detection heads, as
well as the number of predictions for each image.

Anchor-free mechanism significantly reduces the number of design parameters that need heuristic tuning and many tricks involved (e.g., Anchor Clustering , Grid Sensitive .) for good performance, making the detector, especially its training and decoding phase, considerably simpler [10].

Applying anchor free manner to YOLO is quite simple,

1)Just reduce the predictions for each location from 3 to 1 and make them directly predict four values, i.e., two offsets in terms of the left-top corner of the grid, and the height and width of the predicted box

2)The center location of each object as the positive sample and pre-define a scale range, as done in [10], to designate the FPN level for each object which greatly reduces the computation cost and improves accuracy.

f.Multi positives: The author assigns the center 3×3 area as positives, also named “center sampling” in FCOS [10]. The performance of the detector
improves to 45.0% AP as in Tab. 2, already surpassing the current best practice of ultralytics-YOLOv3 (44.3% AP2).

g.SimOTA: The author studied OTA [4], and conclude four key insights for an
advanced label assignment: 1). loss/quality aware, 2). center prior, 3). dynamic number of positive anchors4 for eachground-truth (abbreviated as dynamic top-k), 4). global view.

OTA meets all four rules above, hence it is choosen for as a candidate label assigning strategy and more over it considers the global perspective and formulates the assigning process as an Optimal Transport (OT)

In practice the author found solving OT problem via Sinkhorn-Knopp algorithm brings 25% extra training time, which is quite expensive for training 300 epochs, thus simplifying it to dynamic top-k strategy, named SimOTA, to get an approximate solution.

SimOTA first calculates pair-wise matching degree, represented by cost [4, 5,
12, 2] or quality for each prediction-gt pair. For example, in SimOTA, the cost between gt gi and prediction pj is calculated as in Eq1:

Eq1:SimOTA cost function

where λ is a balancing coefficient. Lcls ij and Lreg ij are classficiation loss and regression loss between gt gi and prediction pj. Then, for gt gi, we select the top k predictions with the least cost within a fixed center region as its positive

SimOTA not only reduces the training time but also avoids additional solver hyperparameters in SinkhornKnopp algorithm. As shown in Tab. 1, SimOTA raises the detector from 45.0% AP to 47.3% AP, higher than the SOTA
ultralytics-YOLOv3 by 3.0% AP, showing the power of the advanced assigning strategy.

h.End-to-end YOLO : Considering the paper [3] for the given archtiecture the author adds two additional conv layers, one-to-one label assignment, and stop gradient.
These enable the detector to perform an end-to-end manner, but slightly decreasing the performance and the inference speed, as listed in Tab. 1.

2.2. Other Backbones

Besides Darknet-53 the authors have tested with different backbones and different sizes where YoloX recieves the best performance.

a.Modified CSPNet in YOLOv5 : For comparsion the author obtains the exact YOLOv5’s[11] backbone including modified CSPNet [6], SiLU activation, and the PAN [7] head and apply its scaling rule for different series , comparision results are shown in Tab3.

Table 3: Comparison of YOLOX and YOLOv5

b.Tiny and Nano detectors: For mobile devices, the author adopts depth wise convolution to construct a YOLOX-Nano model, which has only 0.91M parameters and 1.08G FLOPs.

The comparision results of YoloX-Tiny with Yolov4-Tiny [1]is shown in the below Tab4.

Table 4: Comparison of YOLOX-Tiny and YOLOX-Nano

c.Model size and data augmentation : Though the author tried to keep the same parameters for training , yet slight changes in the augmentation lead to improved results shown in Tab5 .

By applying MixUp for YOLOX-L can improve AP by 0.9%, it is better to weaken the augmentation for small models,specifically by removing the mix up augmentation and weaken the mosaic (reduce the scale range from [0.1, 2.0] to [0.5, 1.5]) when training small models,i.e., YOLOX-S, YOLOX-Tiny, and YOLOX-Nano. Such a modification improves YOLOX-Nano’s AP from 24.0% to 25.3%.

Larger Models needed stronger augmenetation methods like MixUP[12], Copypaste [13] method is used to compare the power of MixUP[12] and scale jittering, as shown in Tab. 5, these two methods achieve competitive
performance, indicating that MixUp with scale jittering is a qualified replacement for Copypaste when no instance mask annotation is available.

Table 5: Effects of data augmentation under different model sizes.

3. Comparison with the SOTA

Tab. 6 shows the SOTA comparing table with Fig. 1, plotting the somewhat controlled speed/accuracy curve,some high performance YOLO series with larger model sizes like Scale-YOLOv4 [5] and YOLOv5-P6 [11] are observed . And the current Transformer based detectors [9] push the accuracy-SOTA to ∼60 AP.

Table 6: Comparison of the speed and accuracy of different object detectors on COCO 2017 test-dev

4. Conclusion

The author has used some recent advancement techniques i.e., decoupled head,anchor-free, and advanced label assigning strategy, and implement it on YOLOv3 architecture, which is still one of the most widely used detectors in industry due to its broad compatibility and from new architecute ie.YOLOX achieves a better trade-off between speed and accuracy than other counterparts across all model sizes.

Writer’s Conclusion


  • Implementation of adavanced techniques which boost the performance .
  • Optimal trade off between accuracy and performance which can be used for real time applications.
  • The architecture implementation is similar across all series of models ie.for both large models and smaller models.


  • Incroporating of attention based methods would have boosted the performance much better .
  • Need to have good accuracy on Small Object Detection dataset and smaller objects .
  • Fails in night scenes and more occluded scenarios of the object.


[1]Alexey Bochkovskiy, Chien-Yao Wang, and HongYuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020. 1, 2, 3, 6

[2]Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, and Jian Sun. Ota: Optimal transport assignment for object detection. In CVPR, 2021. 1, 4

[3]Qiang Zhou, Chaohui Yu, Chunhua Shen, Zhibin Wang,and Hao Li. Object detection made simpler by eliminating heuristic nms. arXiv preprint arXiv:2101.11782, 2021. 1, 4

[4]Jianfeng Wang, Lin Song, Zeming Li, Hongbin Sun, Jian Sun, and Nanning Zheng. End-to-end object detection with fully convolutional network. In CVPR, 2020

[5]Chien-Yao Wang, Alexey Bochkovskiy, and HongYuan Mark Liao. Scaled-yolov4: Scaling cross stage partial network. arXiv preprint arXiv:2011.08036, 2020. 1, 5, 6

[6]Chien-Yao Wang, Hong-Yuan Mark Liao, Yueh-Hua Wu, Ping-Yang Chen, Jun-Wei Hsieh, and I-Hau Yeh. Cspnet: A new backbone that can enhance learning capability of cnn.In CVPR workshops, 2020. 2, 5

[7]Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia.Path aggregation network for instance segmentation. In CVPR, 2018. 2, 5

[8]Zhi Zhang, Tong He, Hang Zhang, Zhongyuan Zhang, Junyuan Xie, and Mu Li. Bag of freebies for training object detection neural networks. arXiv preprint arXiv:1902.04103, 2019

[9]Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021

[10]Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos:Fully convolutional one-stage object detection. In ICCV,2019. 1, 2, 3, 4

[11]glenn jocher et al. yolov5. https://github.com/ultralytics/yolov5, 2021. 1, 2, 3, 5, 6

[12]Zhang Hongyi, Cisse Moustapha, N. Dauphin Yann, and David Lopez-Paz. mixup: Beyond empirical risk minimization. ICLR, 2018. 3

[13]Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, TsungYi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR, 2021

Github Link : https://github.com/Megvii-BaseDetection/YOLOX

If any errors found please mail me at abhigoku10@gmail.com…*\(^o^)/*