Alert button
Picture for Weixin Feng

Weixin Feng

Alert button

Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection

May 09, 2022
Weixin Feng, Xingyuan Bu, Chenchen Zhang, Xubin Li

Figure 1 for Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection
Figure 2 for Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection
Figure 3 for Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection
Figure 4 for Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection

Multimodal supervision has achieved promising results in many visual language understanding tasks, where the language plays an essential role as a hint or context for recognizing and locating instances. However, due to the defects of the human-annotated language corpus, multimodal supervision remains unexplored in fully supervised object detection scenarios. In this paper, we take advantage of language prompt to introduce effective and unbiased linguistic supervision into object detection, and propose a new mechanism called multimodal knowledge learning (\textbf{MKL}), which is required to learn knowledge from language supervision. Specifically, we design prompts and fill them with the bounding box annotations to generate descriptions containing extensive hints and context for instances recognition and localization. The knowledge from language is then distilled into the detection model via maximizing cross-modal mutual information in both image- and object-level. Moreover, the generated descriptions are manipulated to produce hard negatives to further boost the detector performance. Extensive experiments demonstrate that the proposed method yields a consistent performance gain by 1.6\% $\sim$ 2.1\% and achieves state-of-the-art on MS-COCO and OpenImages datasets.

* Submitted to CVPR2022 
Viaarxiv icon

Temporal Knowledge Consistency for Unsupervised Visual Representation Learning

Aug 24, 2021
Weixin Feng, Yuanjiang Wang, Lihua Ma, Ye Yuan, Chi Zhang

Figure 1 for Temporal Knowledge Consistency for Unsupervised Visual Representation Learning
Figure 2 for Temporal Knowledge Consistency for Unsupervised Visual Representation Learning
Figure 3 for Temporal Knowledge Consistency for Unsupervised Visual Representation Learning
Figure 4 for Temporal Knowledge Consistency for Unsupervised Visual Representation Learning

The instance discrimination paradigm has become dominant in unsupervised learning. It always adopts a teacher-student framework, in which the teacher provides embedded knowledge as a supervision signal for the student. The student learns meaningful representations by enforcing instance spatial consistency with the views from the teacher. However, the outputs of the teacher can vary dramatically on the same instance during different training stages, introducing unexpected noise and leading to catastrophic forgetting caused by inconsistent objectives. In this paper, we first integrate instance temporal consistency into current instance discrimination paradigms, and propose a novel and strong algorithm named Temporal Knowledge Consistency (TKC). Specifically, our TKC dynamically ensembles the knowledge of temporal teachers and adaptively selects useful information according to its importance to learning instance temporal consistency. Experimental result shows that TKC can learn better visual representations on both ResNet and AlexNet on linear evaluation protocol while transfer well to downstream tasks. All experiments suggest the good effectiveness and generalization of our method.

* ICCV 2021  
* To appear in ICCV 2021 
Viaarxiv icon