14 min readJul 25, 2022


[CVPR2021/PaperSummary]Rethinking Keypoint Representations:Modeling Keypoints and Poses as Objects for Multi-Person Human Pose Estimation

Please note that this post is for my future self to look back and review the materials on this paper without reading it all over again…

KeyPoint Estimation is a challenging task in the domain of computer vision as it helps to understand the fine details of human behavior more in detail. It very tough problem statement due to different factors like the small pixel size of the person which makes it difficult in seeing the joints of the person, different poses of the people ..etc. In this paper, the author has done an amazing job in performing both object detection and keypoint estimation with a single network.

In this paper, the author has proposed a novel method of heatmap-free keypoint estimation method in which individual key points and sets of spatially related key points ie poses are modeled as objects within a dense single-stage anchor-based detection framework.

The author terms his model as KAPAO (pronounced “Ka-Pow!”) for Keypoints And Poses As Objects, there are different versions of the architecture but, KAPAO-L, achieves an AP of 70.6 on the COCO Keypoints validation set without test-time augmentation while being 2.5× faster than the next best single-stage model, whose accuracy is 4.0 AP less. On the CrowdPose test set, KAPAO-L achieves new state-of-the-art accuracy for a single-stage method with anAP of 68.9.

1. Introduction

In a computer vision task, keypoint estimation involves localizing points of interest in images . Keypoint estimation plays an important role in several related applications, including human pose estimation, hand pose estimation, action recognition, object detection, multiperson tracking, facial and object landmark detection, and sports analytics.

The most common method for estimating keypoint locations is a generation of target fields, referred to as heatmaps, by centering 2D Gaussians with small variances on the target keypoint coordinates. A deep convolutional neural network is then used to regress the target heatmaps on the input images, and keypoint predictions are made via the arguments of the maxima of the predicted heatmaps [1]. If a peak in the heatmap surpasses a predefined confidence threshold, then a key point is detected. The drawbacks of these methods are

  1. First, these methods suffer from quantization error; the precision of a keypoint prediction is inherently limited by the spatial resolution of the heatmap. Larger heatmaps are advantageous but costly in processing at higher resolutions.
  2. Second, when two key points of the same type (i.e., class) appear in close proximity to one another, the overlapping heatmap signals may be mistaken for a single key point.

The author introduces a new heatmap-free keypoint detection method that can be modeled as objects within a dense anchor-based detection framework by representing key points at the center of small square keypoint bounding boxes and applying it to single-stage multi-person human pose estimation. KAPAO (for key points and poses as objects), builds on a recent implementation of the “You Only Look Once” (YOLO) dense detection.

2. Related Work

  1. Heatmap-free Keypoint Detection: DeepPose regressed keypoint coordinates directly from images using a cascade of deep neural networks that iteratively refined the keypoint predictions, heatmap regression has remained prevalent in human pose estimation, and the computational inefficiencies associated with generating heatmaps with the inherent issue of quantization error. Direct keypoint regression has also been attempted using Transformers [2].
  2. Single-stage Human Pose Estimation: Human pose estimation generally falls into two categories:

Single-stage methods [Bottom-Up]: The method predicts the poses of every person in an image following a single forward pass through a single network, they are less accurate than their two-stage counterparts, but usually perform better in crowded scenes and are often preferred because of their simplicity and efficiency.

Two-stage methods [Top-down]: These methods detect the people in an image using an off-the-shelf person detector e.g., Faster R-CNN , YOLOv3, etc, and the poses are estimated for each person detection.

3 . Single-stage Object Detection: Deep learning-based object detection models can be categorized into two groups based on whether they use one or two stages.

Two-stage detectors, popularized by the R-CNN family, generate a sparse set of candidate object locations in the first stage and classify each candidate into a foreground object class or background in the second stage.

Single-stage detectors unify the process by simultaneously classifying objects and regressing their locations over a dense grid. Popular single-stage object detectors include the Single Shot MultiBox Detector (SSD) and YOLO.

3. Proposed Method

In the proposed approach to multi-person human pose estimation, the author trains a dense detection network to simultaneously predict a set of keypoint objects {O^k belongs to O^ kg} and a set of pose objects {O^p belongs to O^ pg} collectively O^ = O^ k union of O^ p.

A keypoint object Ok is an adaptation of the conventional object representation in which the coordinates of a keypoint are represented at the center (bx; by) of a small bounding box b with equal width bw and height bh: b = (bx; by; bw; bh).

The hyperparameter bs is the keypoint bounding box size (i.e., bs = bw = bh). There are K classes of keypoint objects, one for each keypoint type in the labeled dataset.

The author considers a pose object Op to be an extension of the conventional object representation that additionally includes a set of key points associated with the object. The author applies to pose objects to human pose estimation via detection of human pose objects, comprising a bounding box of class “person,” and a set of keypoints z = f(xk; yk)gK k=1 that coincide with anatomical landmarks.

The object representations possess unique advantages, keypoint objects are specialized for the detection of individual key points that are characterized by strong local features, keypoint objects carry no information regarding a person or pose

If used on their own for multi-person human pose estimation, a bottom-up grouping method would be needed to parse the detected key points into human poses, pose objects are better suited for localizing key points with weak local features as they enable the network to learn the spatial relationships within a set of keypoints.

The author designs a network to simultaneously detect both object types with minimal computational overhead using a single, shared network head.

During inference, the more precise keypoint object detections are fused
with the human pose detections using a simple tolerance-based matching algorithm that improves the accuracy of the human pose predictions without sacrificing any significant amount of inference speed.

3.1 Architectural Details

It uses a deep convolutional neural network N to map an RGB input image I belongs Rh×w×3 to a set of four output grids G^ = fG^sg containing the object predictions O^ , where s belongs {8; 16; 32; 64} and G^s belongs to R hs × ws ×Na×No:

Na is the number of anchor channels

No is the number of output channels for each object.

N is a YOLO-style feature extractor that makes extensive use of Cross-StagePartial (CSP) bottlenecks within a feature pyramid

To provide flexibility for different speed requirements, we train three sizes of KAPAO models (i.e. S, M, L) by scaling the number of layers and channels in N

The features in an output grid cell G^i;j s are conditioned on the image patch
Ip = Isi:s(i+1);sj:s(j+1). Therefore,if the center of a target object (bx; by) is situated in Ip, the output grid cell G^i;j s is responsible for detecting it. The receptive field of an output grid increases with s, so smaller output grids are better suited for detecting larger objects.

The output grid cell G^i;j s contains Na anchor channels corresponding to anchor boxes As = f(Awa; Aha)gN a=1 a .A target object O is assigned to an anchor channel via tolerance-based matching of the object and anchor box

This provides redundancy such that a single grid cell G^i;j s can detect multiple objects and detection of different object sizes.

The No output channels of G^i;j;a s contain the properties of a predicted object O^, including the “objectness” p^o, or the probability that an object exists, the intermediate bounding boxes ^t0 = (t⁰ x; t⁰ y; t⁰ w; t⁰ h), the object class scores ^c = (^ c1; :::; c^K+1), and the intermediate keypoints
v⁰ = f(^ vxk 0 ; v^yk 0 )gK k=1 for the human pose objects. Hence, No = 3K + 6.

An object’s bounding box ^t is predicted in the grid coordinates and relative to the grid cell origin (i; j) using the below formula

A pose object’s keypoints v^ are predicted in the grid coordinates and relative to the grid cell origin (i; j) using:

The sigmoid function σ is used to facilitate learning by constraining the ranges of the object properties. To learn ^t and v^, losses are applied in the grid space. Sample targets t and v are shown in Figure 4.

3.2 Loss Function

A target set of grids G is constructed and a multi-task loss L(G^ ; G) is applied to learn p^o (Lobj), ^t (Lbox), c^(Lcls), and v^ (Lkps) if the object is a pose object.

The loss components are computed for a single image as follows

where ws is the grid weighting, BCE is the binary cross-entropy,

IoU is the complete intersection over union (CIoU) [3]

vk is the visibility flag of the target keypoint vk. When Gi;j;a s is a target object O, the target objectness po = 1 is multiplied by the IoU score to promote specialization amongst the anchor channel predictions. When Gi;j;a s is not a target object, po = 0.

The total loss L is a weighted summation of the loss components, scaled by the batch size Nb:

3.3 Inference

The predicted bounding boxes ^t and keypoints v^ are mapped back to the original image coordinates using the following transformation:

G^i;j;a s represents a positive pose object detection O^p if its confidence p^o · max(c^) is greater than a threshold τcp, and arg max(c^) = 1.

G^i;j;a s represents a positive keypoint object detection O^k if p^o · max(c^) > τck and arg max(c^) 6= 1, where the keypoint object class is
arg max(c^) — 1

To remove redundant detections and obtain the candidate pose objects O^ p0 and keypoint objects O^ k0, the positive pose object detections O^ p and positive keypoint object detections O^ p are filtered using non-maximum suppression (NMS) with the IoU thresholds τbp and τbk.

The human pose predictions P^ = fP^ i 2 RK×3g for i belongs to f1:::n(O^ p0)g are obtained by fusing the candidate keypoint objects with the candidate pose objects using a distance tolerance τfd.

To promote correct matches of keypoint objects to poses, the keypoint objects are only fused to pose objects with p^o · max(c^) > τfc:

The keypoint object fusion function ’ is provided in Algorithm 1

3.4 Limitations

1.The author has mentioned the limitation of the method is that pose objects do not include individual keypoint confidences, so the human pose
predictions typically contain a sparse set of keypoint confidences P^ i[:; 3] populated by the fused keypoint objects

2.Training requires a considerable amount of computing due to the large input size used

4. Experiments

KAPAO is evaluation on two multi-person human pose estimation datasets: COCO Keypoints and CrowdPose .

4.1 COCO Keypoints

Microsoft COCO Keypoints is a large-scale multi-person human pose estimation dataset

# of images =200k images

# of person = 250k person instances

# of keypoints = 17 keypoints (K=17).
The train2017 split, containing 118k images and 150k person instances, and validate on the 5k images in val2017.

The evaluation is done on the test-dev split, which contains 20k images.

The accuracy metrics is the average precision (AP) and average recall (AR) scores based on the Object Keypoint Similarity (OKS)1: AP (mean AP over OKS 2 f0:50; 0:55; : : : ; 0:95g), AP50 (AP at OKS = 0.50), AP75, APM (medium objects), APL (large objects), and AR (mean AR over OKS 2 f0:50; 0:55; : : : ; 0:95g).


# of epochs = 500 epochs

Optimizer = stochastic gradient descent with Nesterov momentum.

The input images were resized and padded to 1280×1280, keeping the original aspect ratio.

Data augmentation used during the training included mosaic, HSV color-space perturbations, horizontal flipping, translations, and scaling.
Many of the training hyperparameters were inherited from Yolov5, including the anchor boxes A and the loss weights w, λobj, λbox, and λcls.

The keypoint bounding box size bs and the keypoint loss weight λkps, were manually tuned using a small grid search.

The models were trained on four V100 GPUs with 32 GB memory each using
batch sizes of 128, 72, and 48 for KAPAO-S, KAPAO-M,and KAPAO-L, respectively.

Validation was performed after every epoch, saving the model weights that provided the highest validation AP.


The inference parameters (τcp, τck, τbp, τbk, τfd, and τfc) were manually tuned on the validation set.

For TTA, we scale the input image by factors of 0.8, 1, and 1.2, and horizontally flip the unscaled image.

During postprocessing, the multi-scale detections are concatenated before running NMS. When not using TTA, we feed the network rectangular images (i.e., 1280 px on the longest side),which marginally reduces the accuracy but increases the inference speed.


Table 1 compares the accuracy and latency (sumof forward pass and post-processing) of KAPAO of-the-art single-stage methods DEKR and HigherHRNet on val2017.

The latency is measured on TITAN Xp GPU and batch size of 1.KAPAO does not use heatmaps, post-processing requires ∼1 to 2 orders of magnitude less time than DEKR and HigherHRNet.

When not using TTA, KAPAO-L is 2.5× faster and 4.0 AP more accurate than the next best model HigherHRNet-W48.

When using TTA, KAPAO-L provides competitive AP, state-of-the-art AR, and is roughly 5× faster than both DEKR-48 and HigherHRNet-W48.

In Table 2, the author compares the accuracy of KAPAO with single-stage and two-stage methods on test-dev

KAPAO-L achieves state-of-the-art AR and falls within 0.2 AP of previous single-stage methods, noting that DEKR uses a model-agnostic rescoring network that adds ∼0.5 AP.

4.2 CrowdPose

It is a multi-person human pose estimation dataset comprising images of crowded scenes with heavy occlusion.

# of images =20k images

# of person = 80k person instances

# of keypoints = 14 keypoints (K=14)

The train2017 split, containing 12k images and the evaluation is done on the test-dev split, which contains 8kimages.

The author trains 300 epochs instead of 500 and do not perform any validation

The dataset accuracy metrics are similar to COCO and include AP, AP50, and AP75. APE, APM, and APH are additionally considered for images with easy, medium, and hard Crowd Index.

Table 3 compares the accuracy of KAPAO against of-the-art methods.

KAPAO excels in the presence of occlusion, improving upon all previous single-stage methods across all metrics.

KAPAO’s proficiency in crowded scenes is clear when analyzing APE, APM, and APH: KAPAO-L and DEKR-W48 perform equally on images with easy Crowd Index.

KAPAO-L is 1.1 AP more accurate for both medium and hard Crowd Indexes (more occlusion).

4.3 Ablation Studies

The author analyzes the influence of one of KAPAO’s important hyperparameters:

The keypoint bounding box size bs : Five KAPAO-S models were trained on COCO train2017 for 50 epochs using normalized keypoint bounding box sizes bs=w belongs to {0:01; 0:025; 0:05; 0:075; 0:1}.The validation AP is plotted in Figure 5.

The results are consistent with the prior work of McNally et al. [34]: bs=w < 2.5% destabilizes training leading to poor accuracy, and optimal bs=w is observed around 5% (used in Section 4 experiments).

Contrary to McNally et al., the accuracy degrades quickly for bs=w > 5%. the author hypothesizes that large bs in our application interfere with pose object learning.

Table 4. Keypoint object fusion adds no less than 1.0 AP and over 3.0 AP in some cases. Moreover, keypoint object fusion is fast; the added post-processing time per image is ≤1.7 ms on COCO and ≤ 4.5 ms on CrowdPose.

Relative to the time required for the forward pass of the network, these
are small increases.

Fusion of keypoint objects by class : Figure 6 plots the fusion rates for each keypoint type for KAPAO-S with no TTA on COCO val2017.

The fusion rate is equal to the number of fused keypoint objects divided by the number of keypoints of that type in the dataset

The #of human pose predictions is generally greater than the actual #of person instances in the dataset, the fusion rate can be greater than 1.

Keypoints that are characterized by distinct local image features (e.g., the eyes, ears, and nose) have higher fusion rates as they are detected more precisely as keypoint objects than as pose objects.

5. Conclusion

KAPAO, a new heatmap-free keypoint estimation method based around modeling keypoints and poses as objects.

KAPAO can be effectively applied to the problem of single-stage multiperson human pose estimation by detecting human pose objects directly.

KAPAO is significantly faster than previous single-stage methods, which are impeded by heatmap post-processing and bottom-up keypoint grouping.

KAPAO performs well in the presence of heavy occlusion, as evidenced by new state-of the-art accuracy for a single-stage method on CrowdPose.


[1] Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In NeurIPS,2014.

[2]Ke Li, Shijie Wang, Xiang Zhang, Yifan Xu, Weijian Xu, and Zhuowen Tu. Pose recognition with cascade transformers. InCVPR, 2021

[3]Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. In CVPR, 2018.

Writer’s Conclusion


1.A single-stage detector + keypoint estimator architecture

2.High inference speed and less computation due to single-stage pipeline

3.Can be trained with other use cases like animal pose estimation and vehicle key point estimation datasets


1.Accuracy of the model drop with different versions of Kapao ie S/L/M

2.Requires huge dataset for training for different class objects

3.Accuracy drops with an increase in the number of key points of the class objects

If any errors found please mail me at abhigoku10@gmail.com…*\(^o^)/*