[08Mar2021/ PaperSummary]End-to-End Human Object Interaction Detection with HOI Transformer

12 min readSep 3, 2021

Please note that this post is for my future self to look back and review the materials on this paper without reading it all over again….

If a researcher a few years back would have told me transformers are going to change the way we do most of the CNN applications I would not have believed him one bit, but nowadays Transformers are used in all the different use-cases, one such use case is Human Object Interaction(HOI) detection which plays an important role in a high-level human-centric scene- understanding which has picked up the research interest area, this can be extending to different tasks like action analysis, visual question, and answering, and many more. The human and object detection and its interaction classification was previously solved as a two-stage network problem statement which leads to sub-optimal solutions due to high computation cost. In HOI detection capturing the dependencies between human and object in the image space is the main problem. Previous methods use either complex sub-optimal strategies ie decoupling into two stages or introduce surrogate proposals to empower models to capture their dependencies, with the current progress in the Transformer architecture it can be designed to exhaustively capture the long-range dependencies in HOI.

In this current article, we shall have a view of how transformers are used to easily solve the complex use case.

Table of Contents

  1. Abstract
  2. Related Work
  3. Method
  4. Experiments and Results
  5. Conclusion
  6. Writer’s Conclusion
  7. References


In the paper, the author proposes an HOI transformer that streamlines by eliminating the need for many hand-designed components. The architecture first uses CNN backbone to extract high-level image features, then the encoder is leveraged to generate a global memory feature, which models the relationship between the image feature explicitly. The global memory from the encoder and the HOI queries are sent to the decoder to generate the output embeddings. Finally, a multi-layer perception is used to predict HOI instances based on the output embeddings of the decoder. A quintuple HOI matching loss is proposed to supervise the learning of HOI instance prediction.

HOI Transformer achieves 26:61% AP on HICODET and 52:9% AProle on V-COCO, surpassing previous methods with the advantage of being much simpler.

2. Related Work

Two-Stage HOI Detection:

In the first stage, a fine-tuned object detector is used to get the humans and objects bounding boxes and class labels. In the second stage, a multi-stream architecture is used to predict the interactions for each human-object pair.

A two-channel binary image representation is first advocated in iCAN [1] to encode the spatial relation, but in FCMNet [2], a fine-grained version from human parsing is proposed to amplify the key cues.

Auxiliary models can be easily introduced to the two-stage pipeline to help to improve HOI . However, these methods suffer from heavy complexity and low efficiency due to the sequential and separated two-stage architecture.

One-Stage HOI Detection:

In UnionDet [3], HOI detection is regarded as a union box detection problem and based on the popular RetinaNet , another unified one-stage HOI detection model is proposed, an extra union branch for detecting union box is added parallel to the conventional object detection branch.PPDM [5] and
IPNet [6] treated HOI as a point detection problem and directly detect interactions in a one-stage manner by introducing a novel definition of the interaction point.

End-to-End Object Detection:

DETR [4] uses a transformer, which decodes N objects in parallel by leveraging the recent transformers with parallel decoding Unlike traditional object detectors, the end-to-end methods, usually have an NMS free architecture, and to make this reality, a good one-to-one matching strategy for duplicates reduction is important, and Hungarian matching seems to be a better choice so far.

3. Method

The author has proposed a method that consists of two main parts,

a) an end-to-end transformer encoder-decoder architecture

b)a quintuple HOI instance matching loss.

Figure 1: Overall architecture.

Network Architecture

The proposed architecture illustrated in Fig. 1consists of three main parts:

(i) a backbone to extract visual features from the input image,

(ii) a transformer encoder-decoder to digest backbone feature and produce output embeddings, and

(iii) a multi-layer perception (MLP) to predict HOI instances


First, a color image is fed into the backbone and generate a feature map of shape (H; W; C)which contains high-level semantic concepts

A 1 × 1 convolution layer is used to reduce the channel dimension from
C to d.

A flatten operator is used to collapse the spatial dimension into one dimension.

A feature map of shape [H × W; d] is obtained, denoted as flatten feature
shown in Fig. 1.

The spatial dimension transformation is important because the following transformer encoder requires a sequence as input, thus the feature map can be interpreted as a sequence of length H ×W , and the value at each time step
is a vector of size d.

The author uses ResNet as the backbone and reduces the dimension of feature conv-5 from C = 2048 to d = 256.


The encoder layer is built upon standard transformer architecture with a multi-head self-attention module and a feed-forward network (FFN)

To enable it to distinguish relative position in the sequence, position encoding is added to the input of each attention layer.

The sum of flattening feature and positional encoding is fed into the transformer encoder to summarize global information

The output of the encoder is denoted as global memory shown in Fig. 1.


The decoder layer is also built upon the transformer architecture,it contains an additional multi-head cross attention layer. The decoder transforms N learnt positional embeddings (denoted as HOI queries in Fig. 1) into N output embeddings.

Then decoded into HOI instances by the following MLP, the decoder has three inputs, one is the global memory from the encoder, one is HOI queries, and one is positional encoding.

For multi-head cross attention layer, the Value comes from global memory directly. The Key is the sum of global memory and the input position encoding. The Query is the sum of input position encoding and the input HOI queries.

The output of the decoder is denoted as output embeddings as shown in Fig 1.

(iii)MLP for HOI Prediction

The author defines each HOI instance as a quintuple of (human class, interaction class, object class, human box, object box).

There are three one-layer MLP branches to predict the human confidence, object confidence and interaction confidence respectively

Two three-layer MLP branches to predict human box and object box. All one-layer MLP branches for predicting confidence use a softmax function.

For the human confidence branch, the output size is 2, which implies the confidence for foreground and background.

Object confidence branch and interaction confidence branch, the output size is C + 1, which implies the confidences for all C kinds of objects or verbs defined in the dataset plus one for background.

Both human and object box branches, the output size is 4, implies the normalized center coordinates (xc; yc), height and width of the box.

(iii)HOI Instance Matching

The HOI instance is a quintuple of (ch; cr; co; bh; bo), where (ch; cr; co) denotes human, interaction and object class confidence, (bh; bo) is the bounding box of the human and object.

Two-stage HOI detectors first predict the object proposals (ch; bh); (co; bo) with an object detector, then enumerate the detected (human, object) pairs to predict the cr by interaction classification.

Trying to approximate the following probability in a given dataset, by equation
p(h; r; o) = p(h; o)p(r|h; o)
≈ p(h)p(o)p(r|h; o)
where p(h) and p(o) indicate the confidence of human and object bounding box, respectively. p(r|h; o) denotes the probability of interaction r given human box h and object box o,

In this method, the object detector and the interaction classifier are separately optimized, the author treats HOI detection as a set prediction problem of bipartite matching between predictions and ground truth the method directly predicts the elements in HOI set and optimizes the proposed HOI matching loss in a unified way.

Fig2:Illustration of the matching strategy between HOI ground-truth (black) and prediction (other colors).

As shown in Fig. 2(a), suppose a ground truth (human,fly, object) is in the image, and the model predicts two HOI instances: the yellow one (human, fly, object), and the blue one (human, hold, object).

The yellow one not only predicts the interaction correctly but localizes the human and object more accurately. To minimize the matching cost, it
is more suitable to assign the black one to the yellow one, and assign ; (implies nothing) to the blue one.

If a model outputs a set of N predictions where N is larger than HOI relations for a given image then zero is padded to the ground truth to make the length of two set equal.

The matching cost function is defined by eq1

Eq1: cost function

where Lmatch gi; pσ(i) is a matching cost between ground truth gi and prediction pσ(i).

In each step of training, we should first find an optimal one-to-one matching between the ground truth set and the current prediction set.

Eq2: matching loss function

The author has used standard softmax cross entropy loss in the paper. Lk box is box regression loss for human box and object box, the weighted sum of GIoU [7] loss and L1 loss are used. α and β are hyper-parameters of loss weights.

Hungarian algorithm [8, 4] to solve the following problem to find a bipartite matching.

Eq3: Hungarian algorithm

where SN denotes the one-to-one matching solution space

Once optimal one-to-one matching is found the network loss is calculated between the matched pairs using the same loss as Eq2 with different hyperparams.

4. Experiments


HICO-DET consists of 47,776 images with more than 150K human-object pairs (38,118 images in the training set and 9,658 in the test set). It has 600 HOI categories over 117 interactions and 80 objects.

V-COCO is a subset of MSCOCO , consists of 5,400 images in the trainval dataset and 4946 images in the test set. Each human is annotated with
binary labels for 29 different action categories (five of them
do not involve associated objects)

Evaluation Metric:

mean average precision (mAP) to examine the model performance for both dataset

HOI detection is considered as true positive if and only if it localizes the human and object accurately (i.e. the Interaction-over-Union (IOU) ratio between the predicted box and ground-truth is greater than 0.5) and predict the interaction correctly

Implementation Details

Data Augmentation: Brightness and contrast with a probability of 0.5 as image , both brightness and contrast, a parameter is randomly chosen from the range [0.8, 1.2]

Scale augmentation, scaling the input image such that the shortest side is at least 480 and at most 800 pixels while the longest at most 1333

Random flip with a probability of 0.5. Finally, we apply random crop
augmentations: an image is cropped with probability 0.5 to a random rectangular

Training Settings :

The input image to the model is first scaled to [0, 1] and then normalized by channel-wise mean and std.

ResNet-50 and ResNet-101 backbon are used

AdamW setting the transformer’s learning rate to 1e-4, the backbone’s to 1e-5, and weight decay to 1e-4.

The number of encoder layer and decoder layer are both set to 6, the number of HOI query is set to 100

COCO pre-trained DETR [4] model is used to initialize the weights of both backbone and transformer encoder-decoder

The batch size for ResNet-50 is set to 16 while 8 for ResNet-101.

Models are trained for 250 epochs with once learning rate decay at epoch 200

Comparisons with State-of-the-Art methods

The author has reported the main quantitative results in terms of AP on HICO-DET in Table 1 and AProle on V-COCO in Table 2.

For HICO-DET dataset, the author’s method achieves 4:88% point gain over one-stage methods on Full categories, especially 5:37% point on Rare categories.

Table 1: Comparison with the state-of-the-art methods on HICO-DET test set.

For V-COCO dataset, the author’s method achieves 1:9% point gain over the previous one-stage method.

Table 2: Comparisons of the state-of-the-art on V-COCOtest set

Ablation Study

The ablation study by the author is conducted on ResNet-50 backbone models, and the models are trained for 250 epochs with once learning rate decay at epoch 200.

The number of encoder layer and decoder layer are both set to 6, the number of HOI query is set to 100, and the batch size is set to 16, using COCO pre-trained DETR model to initialize the weights of both backbone and transformer encoder-decoder

Table 3: Ablation experiments for HOI Transformer

Matching Strategy: The author conducted an ablation study to find the relative importance in matching. In Eq. 2, β1, β2 dominates the weight of classification and localization respectively.

As shown in Table. 3b, the best result is obtained under β1 = 2:0; β2 = 1:0, which reflects that classification plays a more important role than localization during the matching process.

Loss Ablation: Since HICO dataset has enough training images ah=1 is assumed and the author’s method obtains the best result when αr = 2:0 and αo = 1:0, indicating that interaction tends to be more important than an object in our framework

Data Augmentation: From Table. 3c. Considerable improvements have been made, multi-scale training attains 4:29% point gain on Full categories and random crop achieves 5:08%.

The combination of them gets even better results, mainly because these two augmentations help the attention layers to learn scale-invariant and shift-invariant features much easier on a small dataset.

Qualitative Analysis

Figure 3: Visualization of attention map in decoder for predicted HOI instance

As can be seen from Fig. 3, the decoder attention map for predicted HOI instances. The interaction heatmap highlights both the human and object area, meaning that our model reasons about the relations between human and object from a more global image context, not focusing on human or object only

Meanwhile, some local area with relatively higher attention may indicate the localization (boundaries) of human or object, because the visualized attention map is immediately followed by the MLP head for classification as well as regression.

5. Conclusion

The author has proposed a novel HOI Transformer to directly predict the HOI instances in an end-to-end manner. The core idea is to build a transformer encoder-decoder architecture to directly predict HOI instances, and a quintuple matching loss for HOI to enable supervision in a unified way. The model has ability to dynamically attains the discerning feature for different HOI queries.

6. Writer’s Conclusion


  • End to End trainable single-stage network which reduces computation cost.
  • Inference speed of 24 fps on a single 2080Ti which is quite good
  • Interaction classifier is simpler and efficient


  • Huge effort on annotation for custom data with interaction
  • Need better interaction classifier taking into account multiple object detections
  • Accuracy drops considerably when the backbone changed from Resnet to EfficientNet , MobileNet ..etc

7. References

[1] Chen Gao, Yuliang Zou, and Jia-Bin Huang. ican: Instancecentric attention network for human-object interaction detection. arXiv preprint arXiv:1808.10437, 2018. 1, 2, 6

[2] Yang Liu, Qingchao Chen, and Andrew Zisserman. Amplifying key cues for human-object-interaction detection. InEuropean Conference on Computer Vision, pages 248–265.Springer, 2020. 1, 2, 6, 7

[3] Kim Bumsoo, Choi Taeho, Kang Jaewoo, and J. Kim Hyunwoo. Uniondet: Union-level detector towards real-timehuman-object interaction detection. In European Conferenceon Computer Vision. Springer, 2020. 1, 2, 6, 7

[4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, NicolasUsunier, Alexander Kirillov, and Sergey Zagoruyko. Endto-end object detection with transformers. arXiv preprintarXiv:2005.12872, 2020. 2, 4, 5

[5] Yue Liao, Si Liu, Fei Wang, Yanjie Chen, Chen Qian, and Jiashi Feng. Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 482–490, 2020. 1, 2, 5, 6

[6] Tiancai Wang, Tong Yang, Martin Danelljan, Fahad ShahbazKhan, Xiangyu Zhang, and Jian Sun. Learning human-objectinteraction detection using interaction points. In Proceedingsof the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4116–4125, 2020. 2, 5, 6, 7

[7] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, AmirSadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression.

[8]Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1–2):83–97,1955. 5

Github Link: https://github.com/bbepoch/HoiTransformer

If any errors are found pls mail me at abhigoku10@gmail.com…*\(^o^)/*