Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study

Apr 17, 2022

Gen Luo, Yiyi Zhou, Jiamu Sun, Shubin Huang, Xiaoshuai Sun, Qixiang Ye, Yongjian Wu, Rongrong Ji

Figure 1 for What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study

Figure 2 for What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study

Figure 3 for What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study

Figure 4 for What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study

Share this with someone who'll enjoy it:

Abstract:Most of the existing work in one-stage referring expression comprehension (REC) mainly focuses on multi-modal fusion and reasoning, while the influence of other factors in this task lacks in-depth exploration. To fill this gap, we conduct an empirical study in this paper. Concretely, we first build a very simple REC network called SimREC, and ablate 42 candidate designs/settings, which covers the entire process of one-stage REC from network design to model training. Afterwards, we conduct over 100 experimental trials on three benchmark datasets of REC. The extensive experimental results not only show the key factors that affect REC performance in addition to multi-modal fusion, e.g., multi-scale features and data augmentation, but also yield some findings that run counter to conventional understanding. For example, as a vision and language (V&L) task, REC does is less impacted by language prior. In addition, with a proper combination of these findings, we can improve the performance of SimREC by a large margin, e.g., +27.12% on RefCOCO+, which outperforms all existing REC methods. But the most encouraging finding is that with much less training overhead and parameters, SimREC can still achieve better performance than a set of large-scale pre-trained models, e.g., UNITER and VILLA, portraying the special role of REC in existing V&L research.

View paper on

Share this with someone who'll enjoy it:

Title:What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study

Paper and Code