Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dacheng Tao

Cross-Modal Contrastive Learning for Robust Reasoning in VQA

Nov 21, 2022
Qi Zheng, Chaoyue Wang, Daqing Liu, Dadong Wang, Dacheng Tao

Figure 1 for Cross-Modal Contrastive Learning for Robust Reasoning in VQA

Figure 2 for Cross-Modal Contrastive Learning for Robust Reasoning in VQA

Figure 3 for Cross-Modal Contrastive Learning for Robust Reasoning in VQA

Figure 4 for Cross-Modal Contrastive Learning for Robust Reasoning in VQA

Multi-modal reasoning in visual question answering (VQA) has witnessed rapid progress recently. However, most reasoning models heavily rely on shortcuts learned from training data, which prevents their usage in challenging real-world scenarios. In this paper, we propose a simple but effective cross-modal contrastive learning strategy to get rid of the shortcut reasoning caused by imbalanced annotations and improve the overall performance. Different from existing contrastive learning with complex negative categories on coarse (Image, Question, Answer) triplet level, we leverage the correspondences between the language and image modalities to perform finer-grained cross-modal contrastive learning. We treat each Question-Answer (QA) pair as a whole, and differentiate between images that conform with it and those against it. To alleviate the issue of sampling bias, we further build connected graphs among images. For each positive pair, we regard the images from different graphs as negative samples and deduct the version of multi-positive contrastive learning. To our best knowledge, it is the first paper that reveals a general contrastive learning strategy without delicate hand-craft rules can contribute to robust VQA reasoning. Experiments on several mainstream VQA datasets demonstrate our superiority compared to the state of the arts. Code is available at \url{https://github.com/qizhust/cmcl_vqa_pl}.

Via

Access Paper or Ask Questions

Adaptive Edge-to-Edge Interaction Learning for Point Cloud Analysis

Nov 20, 2022
Shanshan Zhao, Mingming Gong, Xi Li, Dacheng Tao

Figure 1 for Adaptive Edge-to-Edge Interaction Learning for Point Cloud Analysis

Figure 2 for Adaptive Edge-to-Edge Interaction Learning for Point Cloud Analysis

Figure 3 for Adaptive Edge-to-Edge Interaction Learning for Point Cloud Analysis

Figure 4 for Adaptive Edge-to-Edge Interaction Learning for Point Cloud Analysis

Recent years have witnessed the great success of deep learning on various point cloud analysis tasks, e.g., classification and semantic segmentation. Since point cloud data is sparse and irregularly distributed, one key issue for point cloud data processing is extracting useful information from local regions. To achieve this, previous works mainly extract the points' features from local regions by learning the relation between each pair of adjacent points. However, these works ignore the relation between edges in local regions, which encodes the local shape information. Associating the neighbouring edges could potentially make the point-to-point relation more aware of the local structure and more robust. To explore the role of the relation between edges, this paper proposes a novel Adaptive Edge-to-Edge Interaction Learning module, which aims to enhance the point-to-point relation through modelling the edge-to-edge interaction in the local region adaptively. We further extend the module to a symmetric version to capture the local structure more thoroughly. Taking advantage of the proposed modules, we develop two networks for segmentation and shape classification tasks, respectively. Various experiments on several public point cloud datasets demonstrate the effectiveness of our method for point cloud analysis.

* Technical Report

Via

Access Paper or Ask Questions

DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

Nov 19, 2022
Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Tongliang Liu, Bo Du, Dacheng Tao

Figure 1 for DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

Figure 2 for DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

Figure 3 for DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

Figure 4 for DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. In this paper, we present DeepSolo, a simple detection transformer baseline that lets a single Decoder with Explicit Points Solo for text detection and recognition simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations and thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel, solving the sub-tasks in text spotting in a unified framework. Besides, we also introduce a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training. Quantitative experiments on public benchmarks demonstrate that DeepSolo outperforms previous state-of-the-art methods and achieves better training efficiency. In addition, DeepSolo is also compatible with line annotations, which require much less annotation cost than polygons. The code will be released.

Via

Access Paper or Ask Questions

Unifying Flow, Stereo and Depth Estimation

Nov 10, 2022
Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, Andreas Geiger

Figure 1 for Unifying Flow, Stereo and Depth Estimation

Figure 2 for Unifying Flow, Stereo and Depth Estimation

Figure 3 for Unifying Flow, Stereo and Depth Estimation

Figure 4 for Unifying Flow, Stereo and Depth Estimation

We present a unified formulation and model for three motion and 3D perception tasks: optical flow, rectified stereo matching and unrectified stereo depth estimation from posed images. Unlike previous specialized architectures for each specific task, we formulate all three tasks as a unified dense correspondence matching problem, which can be solved with a single model by directly comparing feature similarities. Such a formulation calls for discriminative feature representations, which we achieve using a Transformer, in particular the cross-attention mechanism. We demonstrate that cross-attention enables integration of knowledge from another image via cross-view interactions, which greatly improves the quality of the extracted features. Our unified model naturally enables cross-task transfer since the model architecture and parameters are shared across tasks. We outperform RAFT with our unified model on the challenging Sintel dataset, and our final model that uses a few additional task-specific refinement steps outperforms or compares favorably to recent state-of-the-art methods on 10 popular flow, stereo and depth datasets, while being simpler and more efficient in terms of model design and inference speed.

* Project Page: https://haofeixu.github.io/unimatch, Code: https://github.com/autonomousvision/unimatch

Via

Access Paper or Ask Questions

Cherry Hypothesis: Identifying the Cherry on the Cake for Dynamic Networks

Nov 10, 2022
Shwai He, Liang Ding, Daize Dong, Boan Liu, Fuqiang Yu, Dacheng Tao

Figure 1 for Cherry Hypothesis: Identifying the Cherry on the Cake for Dynamic Networks

Figure 2 for Cherry Hypothesis: Identifying the Cherry on the Cake for Dynamic Networks

Figure 3 for Cherry Hypothesis: Identifying the Cherry on the Cake for Dynamic Networks

Figure 4 for Cherry Hypothesis: Identifying the Cherry on the Cake for Dynamic Networks

Dynamic networks have been extensively explored as they can considerably improve the model's representation power with acceptable computational cost. The common practice in implementing dynamic networks is to convert given static layers into fully dynamic ones where all parameters are dynamic and vary with the input. Recent studies empirically show the trend that the more dynamic layers contribute to ever-increasing performance. However, such a fully dynamic setting 1) may cause redundant parameters and high deployment costs, limiting the applicability of dynamic networks to a broader range of tasks and models, and more importantly, 2) contradicts the previous discovery in the human brain that \textit{when human brains process an attention-demanding task, only partial neurons in the task-specific areas are activated by the input, while the rest neurons leave in a baseline state.} Critically, there is no effort to understand and resolve the above contradictory finding, leaving the primal question -- to make the computational parameters fully dynamic or not? -- unanswered. The main contributions of our work are challenging the basic commonsense in dynamic networks, and, proposing and validating the \textsc{cherry hypothesis} -- \textit{A fully dynamic network contains a subset of dynamic parameters that when transforming other dynamic parameters into static ones, can maintain or even exceed the performance of the original network.} Technically, we propose a brain-inspired partially dynamic network, namely PAD-Net, to transform the redundant dynamic parameters into static ones. Also, we further design Iterative Mode Partition to partition the dynamic- and static-subnet, which alleviates the redundancy in traditional fully dynamic networks. Our hypothesis and method are comprehensively supported by large-scale experiments with typical advanced dynamic methods.

Via

Access Paper or Ask Questions

Rethinking Hierarchies in Pre-trained Plain Vision Transformer

Nov 08, 2022
Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao

Figure 1 for Rethinking Hierarchies in Pre-trained Plain Vision Transformer

Figure 2 for Rethinking Hierarchies in Pre-trained Plain Vision Transformer

Figure 3 for Rethinking Hierarchies in Pre-trained Plain Vision Transformer

Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective. However, customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT. More importantly, since these hierarchical ViTs cannot reuse the off-the-shelf pre-trained weights of the plain ViTs, the requirement of pre-training them leads to a massive amount of computational cost, thereby incurring both algorithmic and computational complexity. In this paper, we address this problem by proposing a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training. We transform the plain ViT into a hierarchical one with minimal changes. Technically, we change the stride of linear embedding layer from 16 to 4 and add convolution (or simple average) pooling layers between the transformer blocks, thereby reducing the feature size from 1/4 to 1/32 sequentially. Despite its simplicity, it outperforms the plain ViT baseline in classification, detection, and segmentation tasks on ImageNet, MS COCO, Cityscapes, and ADE20K benchmarks, respectively. We hope this preliminary study could draw more attention from the community on developing effective (hierarchical) ViTs while avoiding the pre-training cost by leveraging the off-the-shelf checkpoints. The code and models will be released at https://github.com/ViTAE-Transformer/HPViT.

* Tech report, work in progress

Via

Access Paper or Ask Questions

Rethinking Hierarchicies in Pre-trained Plain Vision Transformer

Nov 03, 2022
Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao

Figure 1 for Rethinking Hierarchicies in Pre-trained Plain Vision Transformer

Figure 2 for Rethinking Hierarchicies in Pre-trained Plain Vision Transformer

Figure 3 for Rethinking Hierarchicies in Pre-trained Plain Vision Transformer

* Tech report, work in progress

Via

Access Paper or Ask Questions

Adversarial Auto-Augment with Label Preservation: A Representation Learning Principle Guided Approach

Nov 02, 2022
Kaiwen Yang, Yanchao Sun, Jiahao Su, Fengxiang He, Xinmei Tian, Furong Huang, Tianyi Zhou, Dacheng Tao

Figure 1 for Adversarial Auto-Augment with Label Preservation: A Representation Learning Principle Guided Approach

Figure 2 for Adversarial Auto-Augment with Label Preservation: A Representation Learning Principle Guided Approach

Figure 3 for Adversarial Auto-Augment with Label Preservation: A Representation Learning Principle Guided Approach

Figure 4 for Adversarial Auto-Augment with Label Preservation: A Representation Learning Principle Guided Approach

Data augmentation is a critical contributing factor to the success of deep learning but heavily relies on prior domain knowledge which is not always available. Recent works on automatic data augmentation learn a policy to form a sequence of augmentation operations, which are still pre-defined and restricted to limited options. In this paper, we show that a prior-free autonomous data augmentation's objective can be derived from a representation learning principle that aims to preserve the minimum sufficient information of the labels. Given an example, the objective aims at creating a distant "hard positive example" as the augmentation, while still preserving the original label. We then propose a practical surrogate to the objective that can be optimized efficiently and integrated seamlessly into existing methods for a broad class of machine learning tasks, e.g., supervised, semi-supervised, and noisy-label learning. Unlike previous works, our method does not require training an extra generative model but instead leverages the intermediate layer representations of the end-task model for generating data augmentations. In experiments, we show that our method consistently brings non-trivial improvements to the three aforementioned learning tasks from both efficiency and final performance, either or not combined with strong pre-defined augmentations, e.g., on medical images when domain knowledge is unavailable and the existing augmentation techniques perform poorly. Code is available at: https://github.com/kai-wen-yang/LPA3}{https://github.com/kai-wen-yang/LPA3.

* 36th Conference on Neural Information Processing Systems (NeurIPS 2022)

Via

Access Paper or Ask Questions

TASA: Deceiving Question Answering Models by Twin Answer Sentences Attack

Oct 27, 2022
Yu Cao, Dianqi Li, Meng Fang, Tianyi Zhou, Jun Gao, Yibing Zhan, Dacheng Tao

Figure 1 for TASA: Deceiving Question Answering Models by Twin Answer Sentences Attack

Figure 2 for TASA: Deceiving Question Answering Models by Twin Answer Sentences Attack

Figure 3 for TASA: Deceiving Question Answering Models by Twin Answer Sentences Attack

Figure 4 for TASA: Deceiving Question Answering Models by Twin Answer Sentences Attack

We present Twin Answer Sentences Attack (TASA), an adversarial attack method for question answering (QA) models that produces fluent and grammatical adversarial contexts while maintaining gold answers. Despite phenomenal progress on general adversarial attacks, few works have investigated the vulnerability and attack specifically for QA models. In this work, we first explore the biases in the existing models and discover that they mainly rely on keyword matching between the question and context, and ignore the relevant contextual relations for answer prediction. Based on two biases above, TASA attacks the target model in two folds: (1) lowering the model's confidence on the gold answer with a perturbed answer sentence; (2) misguiding the model towards a wrong answer with a distracting answer sentence. Equipped with designed beam search and filtering methods, TASA can generate more effective attacks than existing textual attack methods while sustaining the quality of contexts, in extensive experiments on five QA datasets and human evaluations.

* Accepted by EMNLP 2022 (long), 9 pages main + 2 pages references + 7 pages appendix

Via

Access Paper or Ask Questions

Identifiability and Asymptotics in Learning Homogeneous Linear ODE Systems from Discrete Observations

Oct 12, 2022
Yuanyuan Wang, Wei Huang, Mingming Gong, Xi Geng, Tongliang Liu, Kun Zhang, Dacheng Tao

Figure 1 for Identifiability and Asymptotics in Learning Homogeneous Linear ODE Systems from Discrete Observations

Figure 2 for Identifiability and Asymptotics in Learning Homogeneous Linear ODE Systems from Discrete Observations

Figure 3 for Identifiability and Asymptotics in Learning Homogeneous Linear ODE Systems from Discrete Observations

Figure 4 for Identifiability and Asymptotics in Learning Homogeneous Linear ODE Systems from Discrete Observations

Ordinary Differential Equations (ODEs) have recently gained a lot of attention in machine learning. However, the theoretical aspects, e.g., identifiability and asymptotic properties of statistical estimation are still obscure. This paper derives a sufficient condition for the identifiability of homogeneous linear ODE systems from a sequence of equally-spaced error-free observations sampled from a single trajectory. When observations are disturbed by measurement noise, we prove that under mild conditions, the parameter estimator based on the Nonlinear Least Squares (NLS) method is consistent and asymptotic normal with $n^{-1/2}$ convergence rate. Based on the asymptotic normality property, we construct confidence sets for the unknown system parameters and propose a new method to infer the causal structure of the ODE system, i.e., inferring whether there is a causal link between system variables. Furthermore, we extend the results to degraded observations, including aggregated and time-scaled ones. To the best of our knowledge, our work is the first systematic study of the identifiability and asymptotic properties in learning linear ODE systems. We also construct simulations with various system dimensions to illustrate the established theoretical results.

Via

Access Paper or Ask Questions