[08Mar2021/ PaperSummary]End-to-End Human Object Interaction Detection with HOI Transformer

Table of Contents

  1. Abstract
  2. Related Work
  3. Method
  4. Experiments and Results
  5. Conclusion
  6. Writer’s Conclusion
  7. References


In the paper, the author proposes an HOI transformer that streamlines by eliminating the need for many hand-designed components. The architecture first uses CNN backbone to extract high-level image features, then the encoder is leveraged to generate a global memory feature, which models the relationship between the image feature explicitly. The global memory from the encoder and the HOI queries are sent to the decoder to generate the output embeddings. Finally, a multi-layer perception is used to predict HOI instances based on the output embeddings of the decoder. A quintuple HOI matching loss is proposed to supervise the learning of HOI instance prediction.

2. Related Work

Two-Stage HOI Detection:

3. Method

The author has proposed a method that consists of two main parts,

Figure 1: Overall architecture.


First, a color image is fed into the backbone and generate a feature map of shape (H; W; C)which contains high-level semantic concepts


The encoder layer is built upon standard transformer architecture with a multi-head self-attention module and a feed-forward network (FFN)


The decoder layer is also built upon the transformer architecture,it contains an additional multi-head cross attention layer. The decoder transforms N learnt positional embeddings (denoted as HOI queries in Fig. 1) into N output embeddings.

(iii)MLP for HOI Prediction

The author defines each HOI instance as a quintuple of (human class, interaction class, object class, human box, object box).

(iii)HOI Instance Matching

The HOI instance is a quintuple of (ch; cr; co; bh; bo), where (ch; cr; co) denotes human, interaction and object class confidence, (bh; bo) is the bounding box of the human and object.

Fig2:Illustration of the matching strategy between HOI ground-truth (black) and prediction (other colors).
Eq1: cost function
Eq2: matching loss function
Eq3: Hungarian algorithm

4. Experiments


Table 1: Comparison with the state-of-the-art methods on HICO-DET test set.
Table 2: Comparisons of the state-of-the-art on V-COCOtest set
Table 3: Ablation experiments for HOI Transformer

Qualitative Analysis

Figure 3: Visualization of attention map in decoder for predicted HOI instance

5. Conclusion

The author has proposed a novel HOI Transformer to directly predict the HOI instances in an end-to-end manner. The core idea is to build a transformer encoder-decoder architecture to directly predict HOI instances, and a quintuple matching loss for HOI to enable supervision in a unified way. The model has ability to dynamically attains the discerning feature for different HOI queries.

6. Writer’s Conclusion


  • End to End trainable single-stage network which reduces computation cost.
  • Inference speed of 24 fps on a single 2080Ti which is quite good
  • Interaction classifier is simpler and efficient
  • Huge effort on annotation for custom data with interaction
  • Need better interaction classifier taking into account multiple object detections
  • Accuracy drops considerably when the backbone changed from Resnet to EfficientNet , MobileNet ..etc

7. References

[1] Chen Gao, Yuliang Zou, and Jia-Bin Huang. ican: Instancecentric attention network for human-object interaction detection. arXiv preprint arXiv:1808.10437, 2018. 1, 2, 6



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store