Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yiran Zhong

Neural Architecture Search on Efficient Transformers and Beyond

Jul 28, 2022

Zexiang Liu, Dong Li, Kaiyue Lu, Zhen Qin, Weixuan Sun, Jiacheng Xu, Yiran Zhong

Figure 1 for Neural Architecture Search on Efficient Transformers and Beyond

Figure 2 for Neural Architecture Search on Efficient Transformers and Beyond

Figure 3 for Neural Architecture Search on Efficient Transformers and Beyond

Figure 4 for Neural Architecture Search on Efficient Transformers and Beyond

Abstract:Recently, numerous efficient Transformers have been proposed to reduce the quadratic computational complexity of standard Transformers caused by the Softmax attention. However, most of them simply swap Softmax with an efficient attention mechanism without considering the customized architectures specially for the efficient attention. In this paper, we argue that the handcrafted vanilla Transformer architectures for Softmax attention may not be suitable for efficient Transformers. To address this issue, we propose a new framework to find optimal architectures for efficient Transformers with the neural architecture search (NAS) technique. The proposed method is validated on popular machine translation and image classification tasks. We observe that the optimal architecture of the efficient Transformer has the reduced computation compared with that of the standard Transformer, but the general accuracy is less comparable. It indicates that the Softmax attention and efficient attention have their own distinctions but neither of them can simultaneously balance the accuracy and efficiency well. This motivates us to mix the two types of attention to reduce the performance imbalance. Besides the search spaces that commonly used in existing NAS Transformer approaches, we propose a new search space that allows the NAS algorithm to automatically search the attention variants along with architectures. Extensive experiments on WMT' 14 En-De and CIFAR-10 demonstrate that our searched architecture maintains comparable accuracy to the standard Transformer with notably improved computational efficiency.

Via

Access Paper or Ask Questions

AMF: Adaptable Weighting Fusion with Multiple Fine-tuning for Image Classification

Jul 26, 2022

Xuyang Shen, Jo Plested, Sabrina Caldwell, Yiran Zhong, Tom Gedeon

Figure 1 for AMF: Adaptable Weighting Fusion with Multiple Fine-tuning for Image Classification

Figure 2 for AMF: Adaptable Weighting Fusion with Multiple Fine-tuning for Image Classification

Figure 3 for AMF: Adaptable Weighting Fusion with Multiple Fine-tuning for Image Classification

Figure 4 for AMF: Adaptable Weighting Fusion with Multiple Fine-tuning for Image Classification

Abstract:Fine-tuning is widely applied in image classification tasks as a transfer learning approach. It re-uses the knowledge from a source task to learn and obtain a high performance in target tasks. Fine-tuning is able to alleviate the challenge of insufficient training data and expensive labelling of new data. However, standard fine-tuning has limited performance in complex data distributions. To address this issue, we propose the Adaptable Multi-tuning method, which adaptively determines each data sample's fine-tuning strategy. In this framework, multiple fine-tuning settings and one policy network are defined. The policy network in Adaptable Multi-tuning can dynamically adjust to an optimal weighting to feed different samples into models that are trained using different fine-tuning strategies. Our method outperforms the standard fine-tuning approach by 1.69%, 2.79% on the datasets FGVC-Aircraft, and Describable Texture, yielding comparable performance on the datasets Stanford Cars, CIFAR-10, and Fashion-MNIST.

* 9 pages

Via

Access Paper or Ask Questions

Deep Laparoscopic Stereo Matching with Transformers

Jul 25, 2022

Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Tom Drummond, Zhiyong Wang, Zongyuan Ge

Figure 1 for Deep Laparoscopic Stereo Matching with Transformers

Figure 2 for Deep Laparoscopic Stereo Matching with Transformers

Figure 3 for Deep Laparoscopic Stereo Matching with Transformers

Figure 4 for Deep Laparoscopic Stereo Matching with Transformers

Abstract:The self-attention mechanism, successfully employed with the transformer structure is shown promise in many computer vision tasks including image recognition, and object detection. Despite the surge, the use of the transformer for the problem of stereo matching remains relatively unexplored. In this paper, we comprehensively investigate the use of the transformer for the problem of stereo matching, especially for laparoscopic videos, and propose a new hybrid deep stereo matching framework (HybridStereoNet) that combines the best of the CNN and the transformer in a unified design. To be specific, we investigate several ways to introduce transformers to volumetric stereo matching pipelines by analyzing the loss landscape of the designs and in-domain/cross-domain accuracy. Our analysis suggests that employing transformers for feature representation learning, while using CNNs for cost aggregation will lead to faster convergence, higher accuracy and better generalization than other options. Our extensive experiments on Sceneflow, SCARED2019 and dVPN datasets demonstrate the superior performance of our HybridStereoNet.

* Accepted to MICCAI 2022; Xuelian Cheng and Yiran Zhong made equal contributions. Code:https://github.com/XuelianCheng/HybridStereoNet-main.git

Via

Access Paper or Ask Questions

Audio-Visual Segmentation

Jul 11, 2022

Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong

Abstract:We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos. Two settings are studied with this benchmark: 1) semi-supervised audio-visual segmentation with a single sound source and 2) fully-supervised audio-visual segmentation with multiple sound sources. To deal with the AVS problem, we propose a novel method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage the audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench compare our approach to several existing methods from related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench.

* Accepted to ECCV 2022; Jinxing Zhou and Jianyuan Wang contributed equally; Meng Wang and Yiran Zhong are corresponding authors; Code is available at https://github.com/OpenNLPLab/AVSBench

Via

Access Paper or Ask Questions

Vicinity Vision Transformer

Jun 21, 2022

Weixuan Sun, Zhen Qin, Hui Deng, Jianyuan Wang, Yi Zhang, Kaihao Zhang, Nick Barnes, Stan Birchfield, Lingpeng Kong, Yiran Zhong

Figure 1 for Vicinity Vision Transformer

Figure 2 for Vicinity Vision Transformer

Figure 3 for Vicinity Vision Transformer

Figure 4 for Vicinity Vision Transformer

Abstract:Vision transformers have shown great success on numerous computer vision tasks. However, its central component, softmax attention, prohibits vision transformers from scaling up to high-resolution images, due to both the computational complexity and memory footprint being quadratic. Although linear attention was introduced in natural language processing (NLP) tasks to mitigate a similar issue, directly applying existing linear attention to vision transformers may not lead to satisfactory results. We investigate this problem and find that computer vision tasks focus more on local information compared with NLP tasks. Based on this observation, we present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity. Specifically, for each image patch, we adjust its attention weight based on its 2D Manhattan distance measured by its neighbouring patches. In this case, the neighbouring patches will receive stronger attention than far-away patches. Moreover, since our Vicinity Attention requires the token length to be much larger than the feature dimension to show its efficiency advantages, we further propose a new Vicinity Vision Transformer (VVT) structure to reduce the feature dimension without degenerating the accuracy. We perform extensive experiments on the CIFAR100, ImageNet1K, and ADE20K datasets to validate the effectiveness of our method. Our method has a slower growth rate of GFlops than previous transformer-based and convolution-based networks when the input resolution increases. In particular, our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.

Via

Access Paper or Ask Questions

Deep Non-rigid Structure-from-Motion: A Sequence-to-Sequence Translation Perspective

Apr 10, 2022

Hui Deng, Tong Zhang, Yuchao Dai, Jiawei Shi, Yiran Zhong, Hongdong Li

Figure 1 for Deep Non-rigid Structure-from-Motion: A Sequence-to-Sequence Translation Perspective

Figure 2 for Deep Non-rigid Structure-from-Motion: A Sequence-to-Sequence Translation Perspective

Figure 3 for Deep Non-rigid Structure-from-Motion: A Sequence-to-Sequence Translation Perspective

Figure 4 for Deep Non-rigid Structure-from-Motion: A Sequence-to-Sequence Translation Perspective

Abstract:Directly regressing the non-rigid shape and camera pose from the individual 2D frame is ill-suited to the Non-Rigid Structure-from-Motion (NRSfM) problem. This frame-by-frame 3D reconstruction pipeline overlooks the inherent spatial-temporal nature of NRSfM, i.e., reconstructing the whole 3D sequence from the input 2D sequence. In this paper, we propose to model deep NRSfM from a sequence-to-sequence translation perspective, where the input 2D frame sequence is taken as a whole to reconstruct the deforming 3D non-rigid shape sequence. First, we apply a shape-motion predictor to estimate the initial non-rigid shape and camera motion from a single frame. Then we propose a context modeling module to model camera motions and complex non-rigid shapes. To tackle the difficulty in enforcing the global structure constraint within the deep framework, we propose to impose the union-of-subspace structure by replacing the self-expressiveness layer with multi-head attention and delayed regularizers, which enables end-to-end batch-wise training. Experimental results across different datasets such as Human3.6M, CMU Mocap and InterHand prove the superiority of our framework. The code will be made publicly available

Via

Access Paper or Ask Questions

Locality Matters: A Locality-Biased Linear Attention for Automatic Speech Recognition

Mar 29, 2022

Jingyu Sun, Guiping Zhong, Dinghao Zhou, Baoxiang Li, Yiran Zhong

Figure 1 for Locality Matters: A Locality-Biased Linear Attention for Automatic Speech Recognition

Figure 2 for Locality Matters: A Locality-Biased Linear Attention for Automatic Speech Recognition

Figure 3 for Locality Matters: A Locality-Biased Linear Attention for Automatic Speech Recognition

Figure 4 for Locality Matters: A Locality-Biased Linear Attention for Automatic Speech Recognition

Abstract:Conformer has shown a great success in automatic speech recognition (ASR) on many public benchmarks. One of its crucial drawbacks is the quadratic time-space complexity with respect to the input sequence length, which prohibits the model to scale-up as well as process longer input audio sequences. To solve this issue, numerous linear attention methods have been proposed. However, these methods often have limited performance on ASR as they treat tokens equally in modeling, neglecting the fact that the neighbouring tokens are often more connected than the distanced tokens. In this paper, we take this fact into account and propose a new locality-biased linear attention for Conformer. It not only achieves higher accuracy than the vanilla Conformer, but also enjoys linear space-time computational complexity. To be specific, we replace the softmax attention with a locality-biased linear attention (LBLA) mechanism in Conformer blocks. The LBLA contains a kernel function to ensure the linear complexities and a cosine reweighing matrix to impose more weights on neighbouring tokens. Extensive experiments on the LibriSpeech corpus show that by introducing this locality bias to the Conformer, our method achieves a lower word error rate with more than 22% inference speed.

* 5 pages, 2 figures, submitted to interspeech 2022

Via

Access Paper or Ask Questions

Implicit Motion Handling for Video Camouflaged Object Detection

Mar 15, 2022

Xuelian Cheng, Huan Xiong, Deng-Ping Fan, Yiran Zhong, Mehrtash Harandi, Tom Drummond, Zongyuan Ge

Figure 1 for Implicit Motion Handling for Video Camouflaged Object Detection

Figure 2 for Implicit Motion Handling for Video Camouflaged Object Detection

Figure 3 for Implicit Motion Handling for Video Camouflaged Object Detection

Figure 4 for Implicit Motion Handling for Video Camouflaged Object Detection

Abstract:We propose a new video camouflaged object detection (VCOD) framework that can exploit both short-term dynamics and long-term temporal consistency to detect camouflaged objects from video frames. An essential property of camouflaged objects is that they usually exhibit patterns similar to the background and thus make them hard to identify from still images. Therefore, effectively handling temporal dynamics in videos becomes the key for the VCOD task as the camouflaged objects will be noticeable when they move. However, current VCOD methods often leverage homography or optical flows to represent motions, where the detection error may accumulate from both the motion estimation error and the segmentation error. On the other hand, our method unifies motion estimation and object segmentation within a single optimization framework. Specifically, we build a dense correlation volume to implicitly capture motions between neighbouring frames and utilize the final segmentation supervision to optimize the implicit motion estimation and segmentation jointly. Furthermore, to enforce temporal consistency within a video sequence, we jointly utilize a spatio-temporal transformer to refine the short-term predictions. Extensive experiments on VCOD benchmarks demonstrate the architectural effectiveness of our approach. We also provide a large-scale VCOD dataset named MoCA-Mask with pixel-level handcrafted ground-truth masks and construct a comprehensive VCOD benchmark with previous methods to facilitate research in this direction. Dataset Link: https://xueliancheng.github.io/SLT-Net-project.

* Accepted to CVPR 2022; Xuelian Cheng and Huan Xiong made equal contributions; Corresponding author: Deng-Ping Fan (dengpfan@gmail.com). Dataset: https://xueliancheng.github.io/SLT-Net-project

Via

Access Paper or Ask Questions

cosFormer: Rethinking Softmax in Attention

Feb 17, 2022

Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, Yiran Zhong

Abstract:Transformer has shown great successes in natural language processing, computer vision, and audio processing. As one of its core components, the softmax attention helps to capture long-range dependencies yet prohibits its scale-up due to the quadratic space and time complexity to the sequence length. Kernel methods are often adopted to reduce the complexity by approximating the softmax operator. Nevertheless, due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops when compared with the vanilla softmax attention. In this paper, we propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer in both casual and cross attentions. cosFormer is based on two key properties of softmax attention: i). non-negativeness of the attention matrix; ii). a non-linear re-weighting scheme that can concentrate the distribution of the attention matrix. As its linear substitute, cosFormer fulfills these properties with a linear operator and a cosine-based distance re-weighting mechanism. Extensive experiments on language modeling and text understanding tasks demonstrate the effectiveness of our method. We further examine our method on long sequences and achieve state-of-the-art performance on the Long-Range Arena benchmark. The source code is available at https://github.com/OpenNLPLab/cosFormer.

* Accepted to ICLR2022. Yiran Zhong is the corresponding author. Zhen Qin, Weixuan Sun, Hui Deng contributed equally to this work

Via

Access Paper or Ask Questions

Transcribing Natural Languages for The Deaf via Neural Editing Programs

Dec 17, 2021

Dongxu Li, Chenchen Xu, Liu Liu, Yiran Zhong, Rong Wang, Lars Petersson, Hongdong Li

Figure 1 for Transcribing Natural Languages for The Deaf via Neural Editing Programs

Figure 2 for Transcribing Natural Languages for The Deaf via Neural Editing Programs

Figure 3 for Transcribing Natural Languages for The Deaf via Neural Editing Programs

Figure 4 for Transcribing Natural Languages for The Deaf via Neural Editing Programs

Abstract:This work studies the task of glossification, of which the aim is to em transcribe natural spoken language sentences for the Deaf (hard-of-hearing) community to ordered sign language glosses. Previous sequence-to-sequence language models trained with paired sentence-gloss data often fail to capture the rich connections between the two distinct languages, leading to unsatisfactory transcriptions. We observe that despite different grammars, glosses effectively simplify sentences for the ease of deaf communication, while sharing a large portion of vocabulary with sentences. This has motivated us to implement glossification by executing a collection of editing actions, e.g. word addition, deletion, and copying, called editing programs, on their natural spoken language counterparts. Specifically, we design a new neural agent that learns to synthesize and execute editing programs, conditioned on sentence contexts and partial editing results. The agent is trained to imitate minimal editing programs, while exploring more widely the program space via policy gradients to optimize sequence-wise transcription quality. Results show that our approach outperforms previous glossification models by a large margin.

Via

Access Paper or Ask Questions