Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Feng Wu

ConE: Cone Embeddings for Multi-Hop Reasoning over Knowledge Graphs

Oct 26, 2021
Zhanqiu Zhang, Jie Wang, Jiajun Chen, Shuiwang Ji, Feng Wu

Figure 1 for ConE: Cone Embeddings for Multi-Hop Reasoning over Knowledge Graphs

Figure 2 for ConE: Cone Embeddings for Multi-Hop Reasoning over Knowledge Graphs

Figure 3 for ConE: Cone Embeddings for Multi-Hop Reasoning over Knowledge Graphs

Figure 4 for ConE: Cone Embeddings for Multi-Hop Reasoning over Knowledge Graphs

Query embedding (QE) -- which aims to embed entities and first-order logical (FOL) queries in low-dimensional spaces -- has shown great power in multi-hop reasoning over knowledge graphs. Recently, embedding entities and queries with geometric shapes becomes a promising direction, as geometric shapes can naturally represent answer sets of queries and logical relationships among them. However, existing geometry-based models have difficulty in modeling queries with negation, which significantly limits their applicability. To address this challenge, we propose a novel query embedding model, namely Cone Embeddings (ConE), which is the first geometry-based QE model that can handle all the FOL operations, including conjunction, disjunction, and negation. Specifically, ConE represents entities and queries as Cartesian products of two-dimensional cones, where the intersection and union of cones naturally model the conjunction and disjunction operations. By further noticing that the closure of complement of cones remains cones, we design geometric complement operators in the embedding space for the negation operations. Experiments demonstrate that ConE significantly outperforms existing state-of-the-art methods on benchmark datasets.

* Accepted to NeurIPS 2021

Via

Access Paper or Ask Questions

End-to-End Image Compression with Probabilistic Decoding

Sep 30, 2021
Haichuan Ma, Dong Liu, Cunhui Dong, Li Li, Feng Wu

Figure 1 for End-to-End Image Compression with Probabilistic Decoding

Figure 2 for End-to-End Image Compression with Probabilistic Decoding

Figure 3 for End-to-End Image Compression with Probabilistic Decoding

Figure 4 for End-to-End Image Compression with Probabilistic Decoding

Lossy image compression is a many-to-one process, thus one bitstream corresponds to multiple possible original images, especially at low bit rates. However, this nature was seldom considered in previous studies on image compression, which usually chose one possible image as reconstruction, e.g. the one with the maximal a posteriori probability. We propose a learned image compression framework to natively support probabilistic decoding. The compressed bitstream is decoded into a series of parameters that instantiate a pre-chosen distribution; then the distribution is used by the decoder to sample and reconstruct images. The decoder may adopt different sampling strategies and produce diverse reconstructions, among which some have higher signal fidelity and some others have better visual quality. The proposed framework is dependent on a revertible neural network-based transform to convert pixels into coefficients that obey the pre-chosen distribution as much as possible. Our code and models will be made publicly available.

Via

Access Paper or Ask Questions

VisEvent: Reliable Object Tracking via Collaboration of Frame and Event Flows

Aug 11, 2021
Xiao Wang, Jianing Li, Lin Zhu, Zhipeng Zhang, Zhe Chen, Xin Li, Yaowei Wang, Yonghong Tian, Feng Wu

Figure 1 for VisEvent: Reliable Object Tracking via Collaboration of Frame and Event Flows

Figure 2 for VisEvent: Reliable Object Tracking via Collaboration of Frame and Event Flows

Figure 3 for VisEvent: Reliable Object Tracking via Collaboration of Frame and Event Flows

Figure 4 for VisEvent: Reliable Object Tracking via Collaboration of Frame and Event Flows

Different from visible cameras which record intensity images frame by frame, the biologically inspired event camera produces a stream of asynchronous and sparse events with much lower latency. In practice, the visible cameras can better perceive texture details and slow motion, while event cameras can be free from motion blurs and have a larger dynamic range which enables them to work well under fast motion and low illumination. Therefore, the two sensors can cooperate with each other to achieve more reliable object tracking. In this work, we propose a large-scale Visible-Event benchmark (termed VisEvent) due to the lack of a realistic and scaled dataset for this task. Our dataset consists of 820 video pairs captured under low illumination, high speed, and background clutter scenarios, and it is divided into a training and a testing subset, each of which contains 500 and 320 videos, respectively. Based on VisEvent, we transform the event flows into event images and construct more than 30 baseline methods by extending current single-modality trackers into dual-modality versions. More importantly, we further build a simple but effective tracking algorithm by proposing a cross-modality transformer, to achieve more effective feature fusion between visible and event data. Extensive experiments on the proposed VisEvent dataset, and two simulated datasets (i.e., OTB-DVS and VOT-DVS), validated the effectiveness of our model. The dataset and source code will be available at our project page: \url{https://sites.google.com/view/viseventtrack/}.

* Work in Progress

Via

Access Paper or Ask Questions

Disentangle Your Dense Object Detector

Jul 27, 2021
Zehui Chen, Chenhongyi Yang, Qiaofei Li, Feng Zhao, Zheng-Jun Zha, Feng Wu

Figure 1 for Disentangle Your Dense Object Detector

Figure 2 for Disentangle Your Dense Object Detector

Figure 3 for Disentangle Your Dense Object Detector

Figure 4 for Disentangle Your Dense Object Detector

Deep learning-based dense object detectors have achieved great success in the past few years and have been applied to numerous multimedia applications such as video understanding. However, the current training pipeline for dense detectors is compromised to lots of conjunctions that may not hold. In this paper, we investigate three such important conjunctions: 1) only samples assigned as positive in classification head are used to train the regression head; 2) classification and regression share the same input feature and computational fields defined by the parallel head architecture; and 3) samples distributed in different feature pyramid layers are treated equally when computing the loss. We first carry out a series of pilot experiments to show disentangling such conjunctions can lead to persistent performance improvement. Then, based on these findings, we propose Disentangled Dense Object Detector (DDOD), in which simple and effective disentanglement mechanisms are designed and integrated into the current state-of-the-art dense object detectors. Extensive experiments on MS COCO benchmark show that our approach can lead to 2.0 mAP, 2.4 mAP and 2.2 mAP absolute improvements on RetinaNet, FCOS, and ATSS baselines with negligible extra overhead. Notably, our best model reaches 55.0 mAP on the COCO test-dev set and 93.5 AP on the hard subset of WIDER FACE, achieving new state-of-the-art performance on these two competitive benchmarks. Code is available at https://github.com/zehuichen123/DDOD.

* ACM MM2021

Via

Access Paper or Ask Questions

MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking

Jul 22, 2021
Xiao Wang, Xiujun Shu, Shiliang Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, Feng Wu

Figure 1 for MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking

Figure 2 for MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking

Figure 3 for MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking

Figure 4 for MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking

Many RGB-T trackers attempt to attain robust feature representation by utilizing an adaptive weighting scheme (or attention mechanism). Different from these works, we propose a new dynamic modality-aware filter generation module (named MFGNet) to boost the message communication between visible and thermal data by adaptively adjusting the convolutional kernels for various input images in practical tracking. Given the image pairs as input, we first encode their features with the backbone network. Then, we concatenate these feature maps and generate dynamic modality-aware filters with two independent networks. The visible and thermal filters will be used to conduct a dynamic convolutional operation on their corresponding input feature maps respectively. Inspired by residual connection, both the generated visible and thermal feature maps will be summarized with input feature maps. The augmented feature maps will be fed into the RoI align module to generate instance-level features for subsequent classification. To address issues caused by heavy occlusion, fast motion, and out-of-view, we propose to conduct a joint local and global search by exploiting a new direction-aware target-driven attention mechanism. The spatial and temporal recurrent neural network is used to capture the direction-aware context for accurate global attention prediction. Extensive experiments on three large-scale RGB-T tracking benchmark datasets validated the effectiveness of our proposed algorithm. The project page of this paper is available at https://sites.google.com/view/mfgrgbttrack/.

* In Peer Review

Via

Access Paper or Ask Questions

Tracking by Joint Local and Global Search: A Target-aware Attention based Approach

Jun 09, 2021
Xiao Wang, Jin Tang, Bin Luo, Yaowei Wang, Yonghong Tian, Feng Wu

Figure 1 for Tracking by Joint Local and Global Search: A Target-aware Attention based Approach

Figure 2 for Tracking by Joint Local and Global Search: A Target-aware Attention based Approach

Figure 3 for Tracking by Joint Local and Global Search: A Target-aware Attention based Approach

Figure 4 for Tracking by Joint Local and Global Search: A Target-aware Attention based Approach

Tracking-by-detection is a very popular framework for single object tracking which attempts to search the target object within a local search window for each frame. Although such local search mechanism works well on simple videos, however, it makes the trackers sensitive to extremely challenging scenarios, such as heavy occlusion and fast motion. In this paper, we propose a novel and general target-aware attention mechanism (termed TANet) and integrate it with tracking-by-detection framework to conduct joint local and global search for robust tracking. Specifically, we extract the features of target object patch and continuous video frames, then we concatenate and feed them into a decoder network to generate target-aware global attention maps. More importantly, we resort to adversarial training for better attention prediction. The appearance and motion discriminator networks are designed to ensure its consistency in spatial and temporal views. In the tracking procedure, we integrate the target-aware attention with multiple trackers by exploring candidate search regions for robust tracking. Extensive experiments on both short-term and long-term tracking benchmark datasets all validated the effectiveness of our algorithm. The project page of this paper can be found at \url{https://sites.google.com/view/globalattentiontracking/home/extend}.

* Accepted by IEEE TNNLS 2021

Via

Access Paper or Ask Questions

MViT: Mask Vision Transformer for Facial Expression Recognition in the wild

Jun 08, 2021
Hanting Li, Mingzhe Sui, Feng Zhao, Zhengjun Zha, Feng Wu

Figure 1 for MViT: Mask Vision Transformer for Facial Expression Recognition in the wild

Figure 2 for MViT: Mask Vision Transformer for Facial Expression Recognition in the wild

Figure 3 for MViT: Mask Vision Transformer for Facial Expression Recognition in the wild

Figure 4 for MViT: Mask Vision Transformer for Facial Expression Recognition in the wild

Facial Expression Recognition (FER) in the wild is an extremely challenging task in computer vision due to variant backgrounds, low-quality facial images, and the subjectiveness of annotators. These uncertainties make it difficult for neural networks to learn robust features on limited-scale datasets. Moreover, the networks can be easily distributed by the above factors and perform incorrect decisions. Recently, vision transformer (ViT) and data-efficient image transformers (DeiT) present their significant performance in traditional classification tasks. The self-attention mechanism makes transformers obtain a global receptive field in the first layer which dramatically enhances the feature extraction capability. In this work, we first propose a novel pure transformer-based mask vision transformer (MViT) for FER in the wild, which consists of two modules: a transformer-based mask generation network (MGN) to generate a mask that can filter out complex backgrounds and occlusion of face images, and a dynamic relabeling module to rectify incorrect labels in FER datasets in the wild. Extensive experimental results demonstrate that our MViT outperforms state-of-the-art methods on RAF-DB with 88.62%, FERPlus with 89.22%, and AffectNet-7 with 64.57%, respectively, and achieves a comparable result on AffectNet-8 with 61.40%.

* 11 pages, 6 figures, conference, 5 tables

Via

Access Paper or Ask Questions

Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer

Jun 08, 2021
Yulin Li, Jianfeng He, Tianzhu Zhang, Xiang Liu, Yongdong Zhang, Feng Wu

Figure 1 for Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer

Figure 2 for Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer

Figure 3 for Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer

Figure 4 for Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer

Occluded person re-identification (Re-ID) is a challenging task as persons are frequently occluded by various obstacles or other persons, especially in the crowd scenario. To address these issues, we propose a novel end-to-end Part-Aware Transformer (PAT) for occluded person Re-ID through diverse part discovery via a transformer encoderdecoder architecture, including a pixel context based transformer encoder and a part prototype based transformer decoder. The proposed PAT model enjoys several merits. First, to the best of our knowledge, this is the first work to exploit the transformer encoder-decoder architecture for occluded person Re-ID in a unified deep model. Second, to learn part prototypes well with only identity labels, we design two effective mechanisms including part diversity and part discriminability. Consequently, we can achieve diverse part discovery for occluded person Re-ID in a weakly supervised manner. Extensive experimental results on six challenging benchmarks for three tasks (occluded, partial and holistic Re-ID) demonstrate that our proposed PAT performs favorably against stat-of-the-art methods.

* Accepted by CVPR 2021

Via

Access Paper or Ask Questions

Action Unit Memory Network for Weakly Supervised Temporal Action Localization

Apr 29, 2021
Wang Luo, Tianzhu Zhang, Wenfei Yang, Jingen Liu, Tao Mei, Feng Wu, Yongdong Zhang

Figure 1 for Action Unit Memory Network for Weakly Supervised Temporal Action Localization

Figure 2 for Action Unit Memory Network for Weakly Supervised Temporal Action Localization

Figure 3 for Action Unit Memory Network for Weakly Supervised Temporal Action Localization

Figure 4 for Action Unit Memory Network for Weakly Supervised Temporal Action Localization

Weakly supervised temporal action localization aims to detect and localize actions in untrimmed videos with only video-level labels during training. However, without frame-level annotations, it is challenging to achieve localization completeness and relieve background interference. In this paper, we present an Action Unit Memory Network (AUMN) for weakly supervised temporal action localization, which can mitigate the above two challenges by learning an action unit memory bank. In the proposed AUMN, two attention modules are designed to update the memory bank adaptively and learn action units specific classifiers. Furthermore, three effective mechanisms (diversity, homogeneity and sparsity) are designed to guide the updating of the memory network. To the best of our knowledge, this is the first work to explicitly model the action units with a memory network. Extensive experimental results on two standard benchmarks (THUMOS14 and ActivityNet) demonstrate that our AUMN performs favorably against state-of-the-art methods. Specifically, the average mAP of IoU thresholds from 0.1 to 0.5 on the THUMOS14 dataset is significantly improved from 47.0% to 52.1%.

* Accepted by CVPR2021

Via

Access Paper or Ask Questions