Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chunhua Shen

Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

Sep 27, 2022
Chengzhi Lin, Ancong Wu, Junwei Liang, Jun Zhang, Wenhang Ge, Wei-Shi Zheng, Chunhua Shen

Figure 1 for Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

Figure 2 for Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

Figure 3 for Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

Figure 4 for Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web. Generally, a video contains rich instance and event information and the query text only describes a part of the information. Thus, a video can correspond to multiple different text descriptions and queries. We call this phenomenon the ``Video-Text Correspondence Ambiguity'' problem. Current techniques mostly concentrate on mining local or multi-level alignment between contents of a video and text (\textit{e.g.}, object to entity and action to verb). It is difficult for these methods to alleviate the video-text correspondence ambiguity by describing a video using only one single feature, which is required to be matched with multiple different text features at the same time. To address this problem, we propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video by adaptive aggregation of video token features. Given a query text, the similarity is determined by the most similar prototype to find correspondence in the video, which is termed text-adaptive matching. To learn diverse prototypes for representing the rich information in videos, we propose a variance loss to encourage different prototypes to attend to different contents of the video. Our method outperforms state-of-the-art methods on four public video retrieval datasets.

* NIPS2022
* NIPS2022

Via

Access Paper or Ask Questions

Towards Accurate Reconstruction of 3D Scene Shape from A Single Monocular Image

Aug 28, 2022
Wei Yin, Jianming Zhang, Oliver Wang, Simon Nicklaus, Simon Chen, Yifan Liu, Chunhua Shen

Figure 1 for Towards Accurate Reconstruction of 3D Scene Shape from A Single Monocular Image

Figure 2 for Towards Accurate Reconstruction of 3D Scene Shape from A Single Monocular Image

Figure 3 for Towards Accurate Reconstruction of 3D Scene Shape from A Single Monocular Image

Figure 4 for Towards Accurate Reconstruction of 3D Scene Shape from A Single Monocular Image

Despite significant progress made in the past few years, challenges remain for depth estimation using a single monocular image. First, it is nontrivial to train a metric-depth prediction model that can generalize well to diverse scenes mainly due to limited training data. Thus, researchers have built large-scale relative depth datasets that are much easier to collect. However, existing relative depth estimation models often fail to recover accurate 3D scene shapes due to the unknown depth shift caused by training with the relative depth data. We tackle this problem here and attempt to estimate accurate scene shapes by training on large-scale relative depth data, and estimating the depth shift. To do so, we propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image, and then exploits 3D point cloud data to predict the depth shift and the camera's focal length that allow us to recover 3D scene shapes. As the two modules are trained separately, we do not need strictly paired training data. In addition, we propose an image-level normalized regression loss and a normal-based geometry loss to improve training with relative depth annotation. We test our depth model on nine unseen datasets and achieve state-of-the-art performance on zero-shot evaluation. Code is available at: https://git.io/Depth

* 20 pages. Journal version of the conference paper "Learning to Recover 3D Scene Shape from a Single Image". arXiv admin note: substantial text overlap with arXiv:2012.09365

Via

Access Paper or Ask Questions

Towards Domain-agnostic Depth Completion

Jul 29, 2022
Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Simon Chen, Chunhua Shen

Figure 1 for Towards Domain-agnostic Depth Completion

Figure 2 for Towards Domain-agnostic Depth Completion

Figure 3 for Towards Domain-agnostic Depth Completion

Figure 4 for Towards Domain-agnostic Depth Completion

Existing depth completion methods are often targeted at a specific sparse depth type, and generalize poorly across task domains. We present a method to complete sparse/semi-dense, noisy, and potentially low-resolution depth maps obtained by various range sensors, including those in modern mobile phones, or by multi-view reconstruction algorithms. Our method leverages a data driven prior in the form of a single image depth prediction network trained on large-scale datasets, the output of which is used as an input to our model. We propose an effective training scheme where we simulate various sparsity patterns in typical task domains. In addition, we design two new benchmarks to evaluate the generalizability and the robustness of depth completion methods. Our simple method shows superior cross-domain generalization ability against state-of-the-art depth completion methods, introducing a practical solution to high quality depth capture on a mobile device. Code is available at: https://github.com/YvanYin/FillDepth.

Via

Access Paper or Ask Questions

Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

Jul 18, 2022
Wejia Wu, Zhuang Li, Jiahong Li, Chunhua Shen, Hong Zhou, Size Li, Zhongyuan Wang, Ping Luo

Figure 1 for Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

Figure 2 for Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

Figure 3 for Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

Figure 4 for Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

Video text spotting(VTS) is the task that requires simultaneously detecting, tracking and recognizing text in the video. Existing video text spotting methods typically develop sophisticated pipelines and multiple models, which is not friend for real-time applications. Here we propose a real-time end-to-end video text spotter with Contrastive Representation learning (CoText). Our contributions are three-fold: 1) CoText simultaneously address the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework. 2) With contrastive learning, CoText models long-range dependencies and learning temporal information across multiple frames. 3) A simple, lightweight architecture is designed for effective and accurate performance, including GPU-parallel detection post-processing, CTC-based recognition head with Masked RoI. Extensive experiments show the superiority of our method. Especially, CoText achieves an video text spotting IDF1 of 72.0% at 41.0 FPS on ICDAR2015video, with 10.5% and 32.0 FPS improvement the previous best method. The code can be found at github.com/weijiawu/CoText.

* ECCV 2022

Via

Access Paper or Ask Questions

Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization

Jun 24, 2022
Peixian Chen, Kekai Sheng, Mengdan Zhang, Yunhang Shen, Ke Li, Chunhua Shen

Figure 1 for Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization

Figure 2 for Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization

Figure 3 for Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization

Figure 4 for Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization

Open-vocabulary object detection (OVD) aims to scale up vocabulary size to detect objects of novel categories beyond the training vocabulary. Recent work resorts to the rich knowledge in pre-trained vision-language models. However, existing methods are ineffective in proposal-level vision-language alignment. Meanwhile, the models usually suffer from confidence bias toward base categories and perform worse on novel ones. To overcome the challenges, we present MEDet, a novel and effective OVD framework with proposal mining and prediction equalization. First, we design an online proposal mining to refine the inherited vision-semantic knowledge from coarse to fine, allowing for proposal-level detection-oriented feature alignment. Second, based on causal inference theory, we introduce a class-wise backdoor adjustment to reinforce the predictions on novel categories to improve the overall OVD performance. Extensive experiments on COCO and LVIS benchmarks verify the superiority of MEDet over the competing approaches in detecting objects of novel categories, e.g., 32.6% AP50 on COCO and 22.4% mask mAP on LVIS.

Via

Access Paper or Ask Questions

Efficient Decoder-free Object Detection with Transformers

Jun 17, 2022
Peixian Chen, Mengdan Zhang, Yunhang Shen, Kekai Sheng, Yuting Gao, Xing Sun, Ke Li, Chunhua Shen

Figure 1 for Efficient Decoder-free Object Detection with Transformers

Figure 2 for Efficient Decoder-free Object Detection with Transformers

Figure 3 for Efficient Decoder-free Object Detection with Transformers

Figure 4 for Efficient Decoder-free Object Detection with Transformers

Vision transformers (ViTs) are changing the landscape of object detection approaches. A natural usage of ViTs in detection is to replace the CNN-based backbone with a transformer-based backbone, which is straightforward and effective, with the price of bringing considerable computation burden for inference. More subtle usage is the DETR family, which eliminates the need for many hand-designed components in object detection but introduces a decoder demanding an extra-long time to converge. As a result, transformer-based object detection can not prevail in large-scale applications. To overcome these issues, we propose a novel decoder-free fully transformer-based (DFFT) object detector, achieving high efficiency in both training and inference stages, for the first time. We simplify objection detection into an encoder-only single-level anchor-based dense prediction problem by centering around two entry points: 1) Eliminate the training-inefficient decoder and leverage two strong encoders to preserve the accuracy of single-level feature map prediction; 2) Explore low-level semantic features for the detection task with limited computational resources. In particular, we design a novel lightweight detection-oriented transformer backbone that efficiently captures low-level features with rich semantics based on a well-conceived ablation study. Extensive experiments on the MS COCO benchmark demonstrate that DFFT_SMALL outperforms DETR by 2.5% AP with 28% computation cost reduction and more than $10$x fewer training epochs. Compared with the cutting-edge anchor-based detector RetinaNet, DFFT_SMALL obtains over 5.5% AP gain while cutting down 70% computation cost.

* Update metadata, 10 pages

Via

Access Paper or Ask Questions

Point-Teaching: Weakly Semi-Supervised Object Detection with Point Annotations

Jun 01, 2022
Yongtao Ge, Qiang Zhou, Xinlong Wang, Chunhua Shen, Zhibin Wang, Hao Li

Figure 1 for Point-Teaching: Weakly Semi-Supervised Object Detection with Point Annotations

Figure 2 for Point-Teaching: Weakly Semi-Supervised Object Detection with Point Annotations

Figure 3 for Point-Teaching: Weakly Semi-Supervised Object Detection with Point Annotations

Figure 4 for Point-Teaching: Weakly Semi-Supervised Object Detection with Point Annotations

Point annotations are considerably more time-efficient than bounding box annotations. However, how to use cheap point annotations to boost the performance of semi-supervised object detection remains largely unsolved. In this work, we present Point-Teaching, a weakly semi-supervised object detection framework to fully exploit the point annotations. Specifically, we propose a Hungarian-based point matching method to generate pseudo labels for point annotated images. We further propose multiple instance learning (MIL) approaches at the level of images and points to supervise the object detector with point annotations. Finally, we propose a simple-yet-effective data augmentation, termed point-guided copy-paste, to reduce the impact of the unmatched points. Experiments demonstrate the effectiveness of our method on a few datasets and various data regimes.

Via

Access Paper or Ask Questions

Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images

May 27, 2022
Zhi Tian, Xiangxiang Chu, Xiaoming Wang, Xiaolin Wei, Chunhua Shen

Figure 1 for Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images

Figure 2 for Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images

Figure 3 for Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images

Figure 4 for Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images

We present a simple yet effective fully convolutional one-stage 3D object detector for LiDAR point clouds of autonomous driving scenes, termed FCOS-LiDAR. Unlike the dominant methods that use the bird-eye view (BEV), our proposed detector detects objects from the range view (RV, a.k.a. range image) of the LiDAR points. Due to the range view's compactness and compatibility with the LiDAR sensors' sampling process on self-driving cars, the range view-based object detector can be realized by solely exploiting the vanilla 2D convolutions, departing from the BEV-based methods which often involve complicated voxelization operations and sparse convolutions. For the first time, we show that an RV-based 3D detector with standard 2D convolutions alone can achieve comparable performance to state-of-the-art BEV-based detectors while being significantly faster and simpler. More importantly, almost all previous range view-based detectors only focus on single-frame point clouds, since it is challenging to fuse multi-frame point clouds into a single range view. In this work, we tackle this challenging issue with a novel range view projection mechanism, and for the first time demonstrate the benefits of fusing multi-frame point clouds for a range-view based detector. Extensive experiments on nuScenes show the superiority of our proposed method and we believe that our work can be strong evidence that an RV-based 3D detector can compare favourably with the current mainstream BEV-based detectors.

* 12 pages

Via

Access Paper or Ask Questions

Super Vision Transformer

May 26, 2022
Mingbao Lin, Mengzhao Chen, Yuxin Zhang, Ke Li, Yunhang Shen, Chunhua Shen, Rongrong Ji

We attempt to reduce the computational costs in vision transformers (ViTs), which increase quadratically in the token number. We present a novel training paradigm that trains only one ViT model at a time, but is capable of providing improved image recognition performance with various computational costs. Here, the trained ViT model, termed super vision transformer (SuperViT), is empowered with the versatile ability to solve incoming patches of multiple sizes as well as preserve informative tokens with multiple keeping rates (the ratio of keeping tokens) to achieve good hardware efficiency for inference, given that the available hardware resources often change from time to time. Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase. For example, we reduce 2x FLOPs of DeiT-S while increasing the Top-1 accuracy by 0.2% and 0.7% for 1.5x reduction. Also, our SuperViT significantly outperforms existing studies on efficient vision transformers. For example, when consuming the same amount of FLOPs, our SuperViT surpasses the recent state-of-the-art (SoTA) EViT by 1.1% when using DeiT-S as their backbones. The project of this work is made publicly available at https://github.com/lmbxmu/SuperViT.

Via

Access Paper or Ask Questions

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Apr 29, 2022
Yuting Gao, Jinfeng Liu, Zihan Xu, Jun Zhang, Ke Li, Chunhua Shen

Figure 1 for PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Figure 2 for PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Figure 3 for PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Figure 4 for PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Large-scale vision-language pre-training has achieved promising results on downstream tasks. Existing methods highly rely on the assumption that the image-text pairs crawled from the Internet are in perfect one-to-one correspondence. However, in real scenarios, this assumption can be difficult to hold: the text description, obtained by crawling the affiliated metadata of the image, often suffer from semantic mismatch and mutual compatibility. To address these issues, here we introduce PyramidCLIP, which constructs an input pyramid with different semantic levels, and aligns visual elements and linguistic elements in the form of hierarchy via intra-level semantics alignment and cross-level relation alignment. Furthermore, we adjust the objective function by softening the loss of negative samples (unpaired samples) so as to weaken the strict constraint during the pre-training stage, thus mitigating the risk of the model being over-confident. Experiments on three downstream tasks, including zero-shot image classification, zero-shot image-text retrieval and image object detection, verify the effectiveness of the proposed PyramidCLIP. In particular, with the same amount of pre-training data of 15 millions image-text pairs, PyramidCLIP exceeds CLIP by 19.2%/18.5%/19.6% respectively, with the image encoder being ResNet-50/ViT-B32/ViT-B16 on ImageNet zero-shot classification top-1 accuracy. When scaling to larger datasets, the results of PyramidCLIP only trained for 8 epochs using 128M image-text pairs are very close to that of CLIP trained for 32 epochs using 400M training data.

Via

Access Paper or Ask Questions