[CVPR2019/PaperSummary]Improving Pedestrian Attribute Recognition With Weakly-Supervised Multi-Scale Attribute-Specific Localization

14 min readApr 30, 2021

Please note that this post is for my future self to look back and review the materials on this paper without reading it all over again….

Person Attribute Recognition

Attribute recognition is an important topic in the video surveillance domain especially detection of a Pedestrian and recognition of the person's attributes gives an added advantage. Localization of the recognized attributed is just icing on the cake which has varied application usage for surveillance systems, but the most challenging part is the annotation of the attribute dataset and detection dataset which is quite expensive for building any systems. The author of this paper has proposed a method that has less annotation involved and localizes the recognized attribute effectively.

In the paper the author has proposed a flexible Attribute Localization Module(ALM) which discovers the discriminative regions and learn the local features for each attribute at multiple levels, to obtain a localization of specific attributes the author uses feature pyramid architecture(FPN)[1] which provides localization at low levels with high level semantic. The proposed architecture is end-to-end trainable as it employs multi-level -deep supervision[2][3] and has been tested on pedestrian attribute datasets like PETA, RAP, PA-100k.

  1. Introduction

Pedestrian attributes like gender, clothing color, clothing type, age, body type have great potential in video surveillance applications such as face verification, person identification /re-identification, person retrieval, due to the huge success of CNN many methods are used for attribute recognition by learning powerful features from images.

Some methods recognize the attributes of a person by considering it as a multi-label classification problem and extracting feature representations from the whole image.

Some other methods use the attention mechanism to learn the discriminative features in the image where they generate masks for certain layers and then multiply them with the corresponding features to extract the attentive features.

The remaining methods leverage predefined rigid parts or external localization modules where they extract the local features from the localized body parts and fuse the extracted part-based features with global features but they fail to indicate which attribute belongs to which region.

Considering the above-mentioned limitations the author has proposed a flexible Attribute Localization Method (ALM) that can automatically locate the discriminative features and extract the region based on the feature representation in an attribute-specific manner. ALM incorporates a channel-based attention mechanism along with a spatial transformer[4] to localize the different attributes and at different feature levels training is performed by deep supervision[2][3] with a maximum voting method for obtaining the final predictions across all the feature levels.

The contribution from the author in this paper is given as follows

  1. An end-to-end trainable framework that performs attribute-specific localization at multiple scales to discover the most discriminative attribute regions in a weakly supervised manner.
  2. A feature pyramid architecture by leveraging both low-level details and high-level semantics to enhance the multi-scale attribute localization and
    region-based feature learning in a mutually reinforcing manner. The multi-scale attribute predictions are further fused by an effective voting scheme.
  3. Extensive experimentation on three publicly available pedestrian attribute datasets (PETA, RAP, and PA-100K ) and achieve significant improvement over the previous state-of-the-art methods.

2. Related Work

Pedestrian Attribute Recognition: Pedestrian attribute recognition methods rely on handcrafted features such as color and texture histograms and trained separately, but the performance of these traditions is not satisfactory. Convolutional Neural Networks achieved great success in pedestrian attribute recognition where recent approaches attempt to exploit the spatial relations and semantic relations among attributes to further improve the recognition performance. These methods can be classified into three basic categories:

(1) Relation-based: They exploit semantic relations to assist attribute recognition. Wang et al. [5] propose a CNN-RNN based framework to exploit the interdependency and correlation among attributes, these methods require manually defined rules, e.g. prediction order, attribute group, which are hard to determine in real applications.

(2) Attention-based: They introduce the visual attention mechanism in attribute recognition. Liu et al. [6] propose a multi-directional attention
model to learn multi-scale attentive features for pedestrian analysis, these methods are attribute-agnostic and fail to take the attribute-specific information into consideration.

(3)Part-based: The part-based methods usually extract features from some localized body parts. Zhu et al. [7] divide the whole image into 15 rigid patches and fuse features from different patches. These methods rely either on predefined rigid parts or on sophisticated part localization mechanisms, which are less robust to pose variances and require extra computational resources, the proposed method localizes the most discriminative regions in an attribute-specific manner, which is not considered in most of the existing works.

Weakly Supervised Attention Localization: In this method, the idea is to perform attention localization without region annotations is also
extensively investigated in other visual tasks. Jaderberg et al. [4] propose the well-known Spatial Transformer Network (STN) which can extract attentional regions with any spatial transformation in an end-to-end trainable manner. These works but can adaptively localize the individual informative regions for each attribute.

Feature Pyramid Architecture: The proposed feature pyramid architecture is similar to Feature Pyramid Networks (FPN) [1], which have been studied in various object detection and segmentation models. In this work, for the first time, the author has attempted to employ the idea to localize attentive regions for pedestrian attribute recognition

3. Proposed Method

The overview of the proposed method is shown below Figure1. The proposed framework consists of the following

a.Main network with feature pyramid structures and

b.A group of Attribute Localization Modules (ALM) Figure2applied at different feature levels

Figure1 . Overview of the proposed framework. The input pedestrian image is fed into the main network with both bottom-up and topdown pathways. Features combined from different levels are fed into multiple Attribute Localization Modules (Figure 2), which perform attribute-specific localization and region-based feature learning. Outputs from different branches are trained with deep supervision and
aggregated through an element-wise maximum operation for inference. M is the total number of attributes. Best viewed in color

The input pedestrian image is first fed into the main network without additional region annotations, and a prediction vector is obtained at the end of the bottom-up pathway. Each ALM only performs attribute localization and region-based feature learning for one attribute at a single feature level. The ALMs at different feature levels are trained in a deep supervision manner.

Given an input pedestrian image I along with its corresponding attribute labels y = y1; y2; : : : ; yM T where M is ths total number of attributes in the dataset and ym; m 2 1; : : : ; M is a binary label that indicates the presence of the m-th attribute if ym = 1, and ym = 0 otherwise.

3.1. Network Architecture

The author has adopted BN-Inception [8] architecture as the backbone network in the framework, the backbone can be replaced with any other CNN architecture. As we are aware the features in deeper CNN layers have lesser resolutions, even though they can specifically localize the attribute based on semantically strong features but difficult to extract region-based discriminative features as some details may not be available but on contrary higher layers of CNN capture richer details but poor contextual information. Seeing the complementary natures in CNN the author used FPN to enhance the attribute localization and region-based feature learning in a mutually reinforcing manner by bottom-up pathway and a top-down pathway.

Bottom-up pathway: It is implemented by BN-Inception network consists of multiple inception blocks with different feature levels, author has conducted attribute localization with bottom-up features generated from three different levels: the incep_3b, incep_4d, and incep_5b block respectively, where they have strides of {8; 16; 32} pixels with respect to the input image.

The selected inception blocks are both at the end of their corresponded stages, where blocks of the same stage keep the same feature maps resolution.

Given an input image I, we denote the bottom-up features generated from the
above blocks as φi(I) 2 RHi×Wi×Ci; i belongs {1; 2; 3}. For 256 × 128 RGB input images, the spatial size Hi × Wi equal to 32 × 16, 16 × 8, and 8 × 4 respectively.

Top-down pathway: In this pathway contains three lateral connections and two top-down connections, as shown in Figure 2.

The lateral connections are simply used to reduce the dimensionalities of bottom-up features to d, where d = 256 is used in the current implementation.

The higher-level features are transmitted through the top-down connections and go through an upsampling operation. Afterward, features from adjacent levels are concatenated by Eq1:

Eq1: Feature concatenation

where f is a 1×1 convolutional layer for dimensionality reduction, g refers to upsampling with nearest-neighbor interpolation.

The author performs dimensionality reduction for φ3(I) using Eq2:

Eq2: Dimensionality reduction

The channel size of Xi equal to d; 2d; 3d for i belongs to {1; 2; 3}.The combined features Xi are used for attribute-specific localization.

3.2. Attribute Localization Module

Figure 2. Details of the proposed Attribute Localization Module(ALM), which consists of a tiny channel attention sub-network and a simplified spatial transformer. The ALM takes the combined
features Xi as input and produces an attribute-specific prediction. Each ALM only serves one attribute at a singe level

The attribute-specific localization is a better choice since it can disentangle the confused attention masks into several individual regions, where each region for a specific attribute shown in Figure2. Moreover, the learned attribute-specific regions are more interpretable since we can observe the attribute-region correspondence intuitively.

There are two mechanisms that can be used to learn the discriminative regions in the feature map by the bounding box method.

1.RoI pooling technique: It requires region annotations, which are not available in pedestrian attribute datasets.

2.Spatial Transformer Network (STN) [4]: The proposed flexible Attribute Localization Module (ALM) to automatically discover the discriminative regions for each attribute in a weakly supervised manner contain STN.

STN is a differentiable module that is capable of applying a spatial transformation to a feature map, e.g. cropping, translation, and scaling.

In this paper, the author adopted a simplified version of STN where
the attribute region as a simple bounding box, can be realized through the following transformation Eq3:

Eq3: Transformation matrix

where sx, sy are scaling parameters, and tx, ty are translation parameters, the expected bounding box can be obtained through these four parameters. (xs i ; yis) and (xt i; yit) are the source coordinates and target coordinates of the i-th pixel.

To accelerate the convergence, the author constrained sx,sy to (0; 1) and tx; ty to (-1; 1) by a sigmoid and tanh activation, respectively and also tiny channel-attention sub-network, as shown in Figure 3.

Channel-Attention Network: The author introduces a tiny channel-attention sub-network in ALM where the features combined from adjacent levels
as input, where both finer details and strong semantics take the same proportion (both have d channels), which means they equally contribute to attributing localization.

More details should be paid when recognizing finer attributes and modulating the inter-channel dependencies. Especially the input features Xi pass through a series of linear and nonlinear layers, producing a weight vector for feature recalibration across channels.

The reweighted features are obtained by channel-wise multiplying the weight
vector with Xi, and an extra residual link is applied to preserve the complementary information.

A fully connected layer is applied to estimate the transformation matrix, denoted as R, and then the region-based features sampled by bilinear interpolation are used for attribute classification in Eq4.

Eq4:Classification formulae

3.3. Deep Supervision

In this [2, 3] mechanism for training where the four individual predictions are directly supervised by ground-truth labels.

During inference, multiple prediction vectors are aggregated through an
effective voting scheme that producing the maximum responses across different feature levels.

The intuition behind this design is that each ALM should directly take feedback about whether the localized region is accurate. If only to preserve the supervision of the fused predictions (maximum or averaging), the gradients are not informative enough of how each level performs, such that some branches are trained insufficiently.

The maximum voting scheme is applied to choose the best predictions from different levels with the most accurate attribute region.

Weighted binary cross-entropy loss function at each stage, formulated by Eq5

Eq5: Loss calculation

where γm = e-am is the loss weight for m-th attribute and am is the prior class distribution of m-th attribute, M is the number of attributes, i represents the i-th branch, where i 2 {1; 2; 3; 4}, and σ refers to the sigmoid activation.

The total training loss is calculated by summing over the four
individual loss Eq6

Eq6: Total Loss

4. Experiments

4.1. Datasets and Evaluation Metrics

The proposed method is evaluated on three publicly available pedestrian attribute datasets:

(1) The PETA dataset consists of 19,000 images with 61 binary attributes and 4 multi-class attributes 9,500 for training, 1,900 for verification, and 7,600 for testing. 35 attributes which the positive ratio is higher than 5% for evaluation

(2) The RAP dataset contains 41,585 images which are collected from 26 indoor surveillance cameras,33,268 training images and 8,317 test images. Only 51 binary attributes with a positive ratio higher than 1% are
selected for evaluation.

(3) The PA-100K dataset is to date the largest dataset for pedestrian attribute recognition, which contains 100,000 pedestrian images in total collected from outdoor surveillance cameras, randomly split into 80,000 training images, 10,000 validation images, and 10,000 test images.

Metrics: Two types of metrics for evaluation

(1)Label-based: we calculate the mean accuracy (mA) as the mean of positive accuracy and negative accuracy for each attribute. The mA criterion can be formulated as Eq7

Eq7: mean Accuracy per attribute

where N is the number of examples and M is the number of attributes; Pi and TPi are the number of positive examples and correctly predicted positive examples of the i-th attribute respectively; Ni and T N.

(2)Instance-based: Four well-known criteria-accuracy, precision, recall and F1 score

4.2. Effectiveness of Critical Components

Table 1. Performance comparisons on RAP dataset

Table 1, starting with the BN-Inception baseline, were gradually append each component and meanwhile compare it with several variants.

(1) Attribute Localization Module: By further embedding multiple ALMs at different feature levels (incep_3b,4d,5b), and a greater improvement is achieved (3:1% and 1:3% in mA and F1, respectively)

Considering the model complexity, we limit the number of levels to three.

(2) Top-down Guidance: Elementwise adding the features from different levels, like the original FPN [1], but the performance decreases.

The concatenation version achieves better results (improves 1:0% in mA), which shows the success of high-level top-down guidance.

Channelattention sub-network further improves mA a lot to 80:61%
by modulating the inter-channel dependencies.

(3) Deep Supervision: For inference, the experimental results suggest that element-wise maximum is a superior ensemble method than averaging since some weaker existences are ignored in averaging

(4)Removing all ALMs while keeping others unchanged results in a significant drop (last row in Table 1), which further confirmed the effectiveness of ALMs.

Figure 3 shows the attribute-wise mA comparison between the proposed method and baseline model on the RAP dataset. As shown, the proposed method achieves significant improvement on a number of attributes, especially some fine-grained attributes, e.g. BaldHead(23:1%), Hat(12:4%) and Muffler (13:5%).

Figure3 Attribute-wise mA comparison on RAP dataset between our proposed method and the baseline model.

4.3. Visualization of Attribute Localization

In this subsection, we visualize the localized attribute regions from different feature levels for qualitative analysis. The attribute regions are located within the feature maps, while the correspondence between a feature map pixel and an image pixel is not unique.

For a relatively coarse visualization, we simply map a feature-level pixel to the center of the receptive field on the input image.

As shown in Figure 5, we display several examples belong to six different attributes, covering both abstract and concrete attributes.

Figure 4. Visualization of attribute localization results at different
feature levels. Best viewed in color

Figure 4(d). The ALMs fail to localize the expected regions at two lower levels when recognizing BaldHead.

Figure 4(a,c)(e) ALMs can successfully localize these concrete attributes, e.g. Backpack, PlasticBag, and Hat, into the corresponded informative regions, despite the extreme occlusions (a, c) or pose variances (e)

4.4. Different Attribute-Specific Methods

In this subsection, the author conducts experiments to demonstrate the advantages of our proposed method by comparing it with other attribute-specific localization methods, such as visual attention and predefined parts.

For the first method, the author replaces the proposed ALM with a spatial attention module while keeping others unchanged for a fair comparison. In detail, we generate individual attention masks for each attribute through a global cross-channel averaging layer and a 3 × 3 convolutional layer

For another comparison model, the author divides the whole image into three
rigid parts (head, torso, and legs) and extract part-based features with an RoI pooling layer, then manually define the attribute-part relations, e.g. recognizing hat only from the head part.

For better understanding the differences, the visualize these localization results in Figure 5. As we can see, the attribute regions generated by ALMs are the most accurate and discriminative one.

Figure 5. Case studies of different attribute-specific localization methods on three different attributes: Boots (Top), Glasses (Middle), and Box (Bottom)

4.5. Comparison with State-of-the-art Methods

Table 2. Quantitative comparisons against previous methods on PETA and RAP datasets. We divide these methods into four groups: holistic
methods, relation-based methods, attention-based methods, and part-based methods
Table 3. Quantitative comparisons on PA-100K dataset

Table 2 and Table 3 show the comparison results on three different datasets. The results suggest that our proposed method achieves superior performance compared with existing works under both label-based and instance-based metrics on all three datasets.

Compared with the previous methods relying on attribute-agnostic attention or extra part localization mechanism, the proposed method can achieve a significant improvement across all datasets.

The proposed method often achieves a lower precision but higher recall, while these two metrics are not so reliable, especially in class-imbalanced datasets.

The mA and F1 metrics are more appropriate in measuring the performance
of an attribute recognition model. The proposed method consistently achieves the best results in these two metrics.

The number of parameters, theoretically, there are totally ( C82 +4C) trainable parameters in each ALM: 4C from the STN module, C82 from the channel-attention module, where C is the number of input channels.

In terms of model complexity, even with 51 attributes, the proposed model is still lightweight as only 0.17 GFLOPs are added to the backbone network.
The reason is that ALM contains only FC-layers (or 1×1Conv), which involves much fewer FLOPs than 3×3 Convlayers.

5. Conclusion

The author has proposed an end-to-end framework for pedestrian attribute recognition, which can automatically localize the attribute-specific regions at multiple feature levels.

The extensive analysis suggests that the proposed method can successfully localize the most informative region for each attribute in a weakly supervised manner.

Writer’s Conclusion


  • Light Weight model with high fps speed


  • Distinct classification of similar attributes is poor .
  • Need to have an agnostic voting scheme of attributes


[1]Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He,Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017

[2]Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In Artificial Intelligence and Statistics, pages 562–570, 2015

[3]Yan Wang, Lequn Wang, Yurong You, Xu Zou, Vincent Chen, Serena Li, Gao Huang, Bharath Hariharan, and Kilian Q Weinberger. Resource aware person re-identification across multiple resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
8042–8051, 2018

[4]Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. In Advances in Neural Information Processing Systems, pages 2017–2025, 2015.

[5]Jingya Wang, Xiatian Zhu, Shaogang Gong, and Wei Li. Attribute recognition by joint recurrent learning of context and correlation. In Proceedings of the IEEE International Conference on Computer Vision, pages 531–540, 2017.

[6]Xihui Liu, Haiyu Zhao, Maoqing Tian, Lu Sheng, Jing Shao,Shuai Yi, Junjie Yan, and Xiaogang Wang. Hydraplus-net:Attentive deep features for pedestrian analysis. In Proceedings of the IEEE International Conference on Computer Vision, pages 350–359, 2017

[7]Jianqing Zhu, Shengcai Liao, Dong Yi, Zhen Lei, and Stan Z Li. Multi-label cnn based pedestrian attribute learning for soft biometrics. In Proceedings of the International Conference on Biometrics, pages 535–540, 2015

[8]Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.

Github Link