Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Weiming Hu

Learning Target-aware Representation for Visual Tracking via Informative Interactions

Jan 07, 2022
Mingzhe Guo, Zhipeng Zhang, Heng Fan, Liping Jing, Yilin Lyu, Bing Li, Weiming Hu

Figure 1 for Learning Target-aware Representation for Visual Tracking via Informative Interactions

Figure 2 for Learning Target-aware Representation for Visual Tracking via Informative Interactions

Figure 3 for Learning Target-aware Representation for Visual Tracking via Informative Interactions

Figure 4 for Learning Target-aware Representation for Visual Tracking via Informative Interactions

We introduce a novel backbone architecture to improve target-perception ability of feature representation for tracking. Specifically, having observed that de facto frameworks perform feature matching simply using the outputs from backbone for target localization, there is no direct feedback from the matching module to the backbone network, especially the shallow layers. More concretely, only the matching module can directly access the target information (in the reference frame), while the representation learning of candidate frame is blind to the reference target. As a consequence, the accumulation effect of target-irrelevant interference in the shallow stages may degrade the feature quality of deeper layers. In this paper, we approach the problem from a different angle by conducting multiple branch-wise interactions inside the Siamese-like backbone networks (InBN). At the core of InBN is a general interaction modeler (GIM) that injects the prior knowledge of reference image to different stages of the backbone network, leading to better target-perception and robust distractor-resistance of candidate feature representation with negligible computation cost. The proposed GIM module and InBN mechanism are general and applicable to different backbone types including CNN and Transformer for improvements, as evidenced by our extensive experiments on multiple benchmarks. In particular, the CNN version (based on SiamCAR) improves the baseline with 3.2/6.9 absolute gains of SUC on LaSOT/TNL2K, respectively. The Transformer version obtains SUC scores of 65.7/52.0 on LaSOT/TNL2K, which are on par with recent state of the arts. Code and models will be released.

* 9 pages, 6 figures

Via

Access Paper or Ask Questions

Recursive Least-Squares Estimator-Aided Online Learning for Visual Tracking

Dec 28, 2021
Jin Gao, Yan Lu, Xiaojuan Qi, Yutong Kou, Bing Li, Liang Li, Shan Yu, Weiming Hu

Figure 1 for Recursive Least-Squares Estimator-Aided Online Learning for Visual Tracking

Figure 2 for Recursive Least-Squares Estimator-Aided Online Learning for Visual Tracking

Figure 3 for Recursive Least-Squares Estimator-Aided Online Learning for Visual Tracking

Figure 4 for Recursive Least-Squares Estimator-Aided Online Learning for Visual Tracking

Tracking visual objects from a single initial exemplar in the testing phase has been broadly cast as a one-/few-shot problem, i.e., one-shot learning for initial adaptation and few-shot learning for online adaptation. The recent few-shot online adaptation methods incorporate the prior knowledge from large amounts of annotated training data via complex meta-learning optimization in the offline phase. This helps the online deep trackers to achieve fast adaptation and reduce overfitting risk in tracking. In this paper, we propose a simple yet effective recursive least-squares estimator-aided online learning approach for few-shot online adaptation without requiring offline training. It allows an in-built memory retention mechanism for the model to remember the knowledge about the object seen before, and thus the seen data can be safely removed from training. This also bears certain similarities to the emerging continual learning field in preventing catastrophic forgetting. This mechanism enables us to unveil the power of modern online deep trackers without incurring too much extra computational cost. We evaluate our approach based on two networks in the online learning families for tracking, i.e., multi-layer perceptrons in RT-MDNet and convolutional neural networks in DiMP. The consistent improvements on several challenging tracking benchmarks demonstrate its effectiveness and efficiency.

* Submitted to TPAMI (Major Revision). Extended version of the RLS-RTMDNet tracker (CVPR2020)

Via

Access Paper or Ask Questions

EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

Nov 17, 2021
Yaya Shi, Xu Yang, Haiyang Xu, Chunfeng Yuan, Bing Li, Weiming Hu, Zheng-Jun Zha

Figure 1 for EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

Figure 2 for EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

Figure 3 for EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

Figure 4 for EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

Current metrics for video captioning are mostly based on the text-level comparison between reference and candidate captions. However, they have some insuperable drawbacks, e.g., they cannot handle videos without references, and they may result in biased evaluation due to the one-to-many nature of video-to-text and the neglect of visual relevance. From the human evaluator's viewpoint, a high-quality caption should be consistent with the provided video, but not necessarily be similar to the reference in literal or semantics. Inspired by human evaluation, we propose EMScore (Embedding Matching-based score), a novel reference-free metric for video captioning, which directly measures similarity between video and candidate captions. Benefit from the recent development of large-scale pre-training models, we exploit a well pre-trained vision-language model to extract visual and linguistic embeddings for computing EMScore. Specifically, EMScore combines matching scores of both coarse-grained (video and caption) and fine-grained (frames and words) levels, which takes the overall understanding and detailed characteristics of the video into account. Furthermore, considering the potential information gain, EMScore can be flexibly extended to the conditions where human-labeled references are available. Last but not least, we collect VATEX-EVAL and ActivityNet-FOIl datasets to systematically evaluate the existing metrics. VATEX-EVAL experiments demonstrate that EMScore has higher human correlation and lower reference dependency. ActivityNet-FOIL experiment verifies that EMScore can effectively identify "hallucinating" captions. The datasets will be released to facilitate the development of video captioning metrics. The code is available at: https://github.com/ShiYaya/emscore.

Via

Access Paper or Ask Questions

Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos

Nov 05, 2021
Minglang Qiao, Yufan Liu, Mai Xu, Xin Deng, Bing Li, Weiming Hu, Ali Borji

Figure 1 for Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos

Figure 2 for Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos

Figure 3 for Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos

Figure 4 for Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos

Visual and audio events simultaneously occur and both attract attention. However, most existing saliency prediction works ignore the influence of audio and only consider vision modality. In this paper, we propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video by leveraging visual, audio and face information. Specifically, we first introduce a large-scale database of multi-face video in visual-audio condition (MVVA), containing eye-tracking data and sound source annotations. Using this database, we find that sound influences human attention, and conversly attention offers a cue to determine sound source on multi-face video. Guided by these findings, a visual-audio multi-task network (VAM-Net) is introduced to predict saliency and locate sound source. VAM-Net consists of three branches corresponding to visual, audio and face modalities. Visual branch has a two-stream architecture to capture spatial and temporal information. Face and audio branches encode audio signals and faces, respectively. Finally, a spatio-temporal multi-modal graph (STMG) is constructed to model the interaction among multiple faces. With joint optimization of these branches, the intrinsic correlation of the tasks of saliency prediction and sound source localization is utilized and their performance is boosted by each other. Experiments show that the proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.

* 21 pages, 15 figures

Via

Access Paper or Ask Questions

SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction

Sep 18, 2021
Zekun Li, Yufan Liu, Bing Li, Weiming Hu, Kebin Wu, Pei Wang

Figure 1 for SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction

Figure 2 for SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction

Figure 3 for SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction

Figure 4 for SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction

Although transformer has achieved great progress on computer vision tasks, the scale variation in dense image prediction is still the key challenge. Few effective multi-scale techniques are applied in transformer and there are two main limitations in the current methods. On one hand, self-attention module in vanilla transformer fails to sufficiently exploit the diversity of semantic information because of its rigid mechanism. On the other hand, it is hard to build attention and interaction among different levels due to the heavy computational burden. To alleviate this problem, we first revisit multi-scale problem in dense prediction, verifying the significance of diverse semantic representation and multi-scale interaction, and exploring the adaptation of transformer to pyramidal structure. Inspired by these findings, we propose a novel Semantic-aware Decoupled Transformer Pyramid (SDTP) for dense image prediction, consisting of Intra-level Semantic Promotion (ISP), Cross-level Decoupled Interaction (CDI) and Attention Refinement Function (ARF). ISP explores the semantic diversity in different receptive space. CDI builds the global attention and interaction among different levels in decoupled space which also solves the problem of heavy computation. Besides, ARF is further added to refine the attention in transformer. Experimental results demonstrate the validity and generality of the proposed method, which outperforms the state-of-the-art by a significant margin in dense image prediction tasks. Furthermore, the proposed components are all plug-and-play, which can be embedded in other methods.

Via

Access Paper or Ask Questions

Differentiable Convolution Search for Point Cloud Processing

Aug 29, 2021
Xing Nie, Yongcheng Liu, Shaohong Chen, Jianlong Chang, Chunlei Huo, Gaofeng Meng, Qi Tian, Weiming Hu, Chunhong Pan

Figure 1 for Differentiable Convolution Search for Point Cloud Processing

Figure 2 for Differentiable Convolution Search for Point Cloud Processing

Figure 3 for Differentiable Convolution Search for Point Cloud Processing

Figure 4 for Differentiable Convolution Search for Point Cloud Processing

Exploiting convolutional neural networks for point cloud processing is quite challenging, due to the inherent irregular distribution and discrete shape representation of point clouds. To address these problems, many handcrafted convolution variants have sprung up in recent years. Though with elaborate design, these variants could be far from optimal in sufficiently capturing diverse shapes formed by discrete points. In this paper, we propose PointSeaConv, i.e., a novel differential convolution search paradigm on point clouds. It can work in a purely data-driven manner and thus is capable of auto-creating a group of suitable convolutions for geometric shape modeling. We also propose a joint optimization framework for simultaneous search of internal convolution and external architecture, and introduce epsilon-greedy algorithm to alleviate the effect of discretization error. As a result, PointSeaNet, a deep network that is sufficient to capture geometric shapes at both convolution level and architecture level, can be searched out for point cloud processing. Extensive experiments strongly evidence that our proposed PointSeaNet surpasses current handcrafted deep models on challenging benchmarks across multiple tasks with remarkable margins.

Via

Access Paper or Ask Questions

Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition

Aug 23, 2021
Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, Weiming Hu

Figure 1 for Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition

Figure 2 for Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition

Figure 3 for Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition

Figure 4 for Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition

Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition. In GCNs, graph topology dominates feature aggregation and therefore is the key to extracting representative features. In this work, we propose a novel Channel-wise Topology Refinement Graph Convolution (CTR-GC) to dynamically learn different topologies and effectively aggregate joint features in different channels for skeleton-based action recognition. The proposed CTR-GC models channel-wise topologies through learning a shared topology as a generic prior for all channels and refining it with channel-specific correlations for each channel. Our refinement method introduces few extra parameters and significantly reduces the difficulty of modeling channel-wise topologies. Furthermore, via reformulating graph convolutions into a unified form, we find that CTR-GC relaxes strict constraints of graph convolutions, leading to stronger representation capability. Combining CTR-GC with temporal modeling modules, we develop a powerful graph convolutional network named CTR-GCN which notably outperforms state-of-the-art methods on the NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.

* Accepted to ICCV2021. Camera-ready version with supplementary materials. Code is available at https://github.com/Uason-Chen/CTR-GCN

Via

Access Paper or Ask Questions

Learn to Match: Automatic Matching Network Design for Visual Tracking

Aug 02, 2021
Zhipeng Zhang, Yihao Liu, Xiao Wang, Bing Li, Weiming Hu

Figure 1 for Learn to Match: Automatic Matching Network Design for Visual Tracking

Figure 2 for Learn to Match: Automatic Matching Network Design for Visual Tracking

Figure 3 for Learn to Match: Automatic Matching Network Design for Visual Tracking

Figure 4 for Learn to Match: Automatic Matching Network Design for Visual Tracking

Siamese tracking has achieved groundbreaking performance in recent years, where the essence is the efficient matching operator cross-correlation and its variants. Besides the remarkable success, it is important to note that the heuristic matching network design relies heavily on expert experience. Moreover, we experimentally find that one sole matching operator is difficult to guarantee stable tracking in all challenging environments. Thus, in this work, we introduce six novel matching operators from the perspective of feature fusion instead of explicit similarity learning, namely Concatenation, Pointwise-Addition, Pairwise-Relation, FiLM, Simple-Transformer and Transductive-Guidance, to explore more feasibility on matching operator selection. The analyses reveal these operators' selective adaptability on different environment degradation types, which inspires us to combine them to explore complementary features. To this end, we propose binary channel manipulation (BCM) to search for the optimal combination of these operators. BCM determines to retrain or discard one operator by learning its contribution to other tracking steps. By inserting the learned matching networks to a strong baseline tracker Ocean, our model achieves favorable gains by $67.2 \rightarrow 71.4$, $52.6 \rightarrow 58.3$, $70.3 \rightarrow 76.0$ success on OTB100, LaSOT, and TrackingNet, respectively. Notably, Our tracker, dubbed AutoMatch, uses less than half of training data/time than the baseline tracker, and runs at 50 FPS using PyTorch. Code and model will be released at https://github.com/JudasDie/SOTS.

* accepted by ICCV2021

Via

Access Paper or Ask Questions

A New Gastric Histopathology Subsize Image Database (GasHisSDB) for Classification Algorithm Test: from Linear Regression to Visual Transformer

Jun 18, 2021
Weiming Hu, Chen Li, Xiaoyan Li, Md Mamunur Rahaman, Jiquan Ma, Haoyuan Chen, Wanli Liu, Changhao Sun, Yudong Yao, Marcin Grzegorzek

Figure 1 for A New Gastric Histopathology Subsize Image Database (GasHisSDB) for Classification Algorithm Test: from Linear Regression to Visual Transformer

Figure 2 for A New Gastric Histopathology Subsize Image Database (GasHisSDB) for Classification Algorithm Test: from Linear Regression to Visual Transformer

Figure 3 for A New Gastric Histopathology Subsize Image Database (GasHisSDB) for Classification Algorithm Test: from Linear Regression to Visual Transformer

Figure 4 for A New Gastric Histopathology Subsize Image Database (GasHisSDB) for Classification Algorithm Test: from Linear Regression to Visual Transformer

GasHisSDB is a New Gastric Histopathology Subsize Image Database with a total of 245196 images. GasHisSDB is divided into 160*160 pixels sub-database, 120*120 pixels sub-database and 80*80 pixels sub-database. GasHisSDB is made to realize the function of valuating image classification. In order to prove that the methods of different periods in the field of image classification have discrepancies on GasHisSDB, we select a variety of classifiers for evaluation. Seven classical machine learning classifiers, three CNN classifiers and a novel transformer-based classifier are selected for testing on image classification tasks. GasHisSDB is available at the URL:https://github.com/NEUhwm/GasHisSDB.git.

Via

Access Paper or Ask Questions