Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Changlin Li

Refer to the report for detailed contributions

Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers

Oct 16, 2022

Tao Tang, Changlin Li, Guangrun Wang, Kaicheng Yu, Xiaojun Chang, Xiaodan Liang

Figure 1 for Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers

Figure 2 for Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers

Figure 3 for Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers

Figure 4 for Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers

Abstract:Automatic data augmentation (AutoAugment) strategies are indispensable in supervised data-efficient training protocols of vision transformers, and have led to state-of-the-art results in supervised learning. Despite the success, its development and application on self-supervised vision transformers have been hindered by several barriers, including the high search cost, the lack of supervision, and the unsuitable search space. In this work, we propose AutoView, a self-regularized adversarial AutoAugment method, to learn views for self-supervised vision transformers, by addressing the above barriers. First, we reduce the search cost of AutoView to nearly zero by learning views and network parameters simultaneously in a single forward-backward step, minimizing and maximizing the mutual information among different augmented views, respectively. Then, to avoid information collapse caused by the lack of label supervision, we propose a self-regularized loss term to guarantee the information propagation. Additionally, we present a curated augmentation policy search space for self-supervised learning, by modifying the generally used search space designed for supervised learning. On ImageNet, our AutoView achieves remarkable improvement over RandAug baseline (+10.2% k-NN accuracy), and consistently outperforms sota manually tuned view policy by a clear margin (up to +1.3% k-NN accuracy). Extensive experiments show that AutoView pretraining also benefits downstream tasks (+1.2% mAcc on ADE20K Semantic Segmentation and +2.8% mAP on revisited Oxford Image Retrieval benchmark) and improves model robustness (+2.3% Top-1 Acc on ImageNet-A and +1.0% AUPR on ImageNet-O). Code and models will be available at https://github.com/Trent-tangtao/AutoView.

Via

Access Paper or Ask Questions

DeViT: Deformed Vision Transformers in Video Inpainting

Sep 28, 2022

Jiayin Cai, Changlin Li, Xin Tao, Chun Yuan, Yu-Wing Tai

Figure 1 for DeViT: Deformed Vision Transformers in Video Inpainting

Figure 2 for DeViT: Deformed Vision Transformers in Video Inpainting

Figure 3 for DeViT: Deformed Vision Transformers in Video Inpainting

Figure 4 for DeViT: Deformed Vision Transformers in Video Inpainting

Abstract:This paper proposes a novel video inpainting method. We make three main contributions: First, we extended previous Transformers with patch alignment by introducing Deformed Patch-based Homography (DePtH), which improves patch-level feature alignments without additional supervision and benefits challenging scenes with various deformation. Second, we introduce Mask Pruning-based Patch Attention (MPPA) to improve patch-wised feature matching by pruning out less essential features and using saliency map. MPPA enhances matching accuracy between warped tokens with invalid pixels. Third, we introduce a Spatial-Temporal weighting Adaptor (STA) module to obtain accurate attention to spatial-temporal tokens under the guidance of the Deformation Factor learned from DePtH, especially for videos with agile motions. Experimental results demonstrate that our method outperforms recent methods qualitatively and quantitatively and achieves a new state-of-the-art.

* ACMMM'22, October 10-14, 2022, Lisboa, Portugal

Via

Access Paper or Ask Questions

Generalizable Memory-driven Transformer for Multivariate Long Sequence Time-series Forecasting

Jul 16, 2022

Mingjie Li, Xiaoyun Zhao, Rui Liu, Changlin Li, Xiaohan Wang, Xiaojun Chang

Figure 1 for Generalizable Memory-driven Transformer for Multivariate Long Sequence Time-series Forecasting

Figure 2 for Generalizable Memory-driven Transformer for Multivariate Long Sequence Time-series Forecasting

Figure 3 for Generalizable Memory-driven Transformer for Multivariate Long Sequence Time-series Forecasting

Figure 4 for Generalizable Memory-driven Transformer for Multivariate Long Sequence Time-series Forecasting

Abstract:Multivariate long sequence time-series forecasting (M-LSTF) is a practical but challenging problem. Unlike traditional timer-series forecasting tasks, M-LSTF tasks are more challenging from two aspects: 1) M-LSTF models need to learn time-series patterns both within and between multiple time features; 2) Under the rolling forecasting setting, the similarity between two consecutive training samples increases with the increasing prediction length, which makes models more prone to overfitting. In this paper, we propose a generalizable memory-driven Transformer to target M-LSTF problems. Specifically, we first propose a global-level memory component to drive the forecasting procedure by integrating multiple time-series features. In addition, we adopt a progressive fashion to train our model to increase its generalizability, in which we gradually introduce Bernoulli noises to training samples. Extensive experiments have been performed on five different datasets across multiple fields. Experimental results demonstrate that our approach can be seamlessly plugged into varying Transformer-based models to improve their performances up to roughly 30%. Particularly, this is the first work to specifically focus on the M-LSTF tasks to the best of our knowledge.

* Tech report

Via

Access Paper or Ask Questions

Continual Object Detection via Prototypical Task Correlation Guided Gating Mechanism

May 06, 2022

Binbin Yang, Xinchi Deng, Han Shi, Changlin Li, Gengwei Zhang, Hang Xu, Shen Zhao, Liang Lin, Xiaodan Liang

Figure 1 for Continual Object Detection via Prototypical Task Correlation Guided Gating Mechanism

Figure 2 for Continual Object Detection via Prototypical Task Correlation Guided Gating Mechanism

Figure 3 for Continual Object Detection via Prototypical Task Correlation Guided Gating Mechanism

Figure 4 for Continual Object Detection via Prototypical Task Correlation Guided Gating Mechanism

Abstract:Continual learning is a challenging real-world problem for constructing a mature AI system when data are provided in a streaming fashion. Despite recent progress in continual classification, the researches of continual object detection are impeded by the diverse sizes and numbers of objects in each image. Different from previous works that tune the whole network for all tasks, in this work, we present a simple and flexible framework for continual object detection via pRotOtypical taSk corrElaTion guided gaTing mechAnism (ROSETTA). Concretely, a unified framework is shared by all tasks while task-aware gates are introduced to automatically select sub-models for specific tasks. In this way, various knowledge can be successively memorized by storing their corresponding sub-model weights in this system. To make ROSETTA automatically determine which experience is available and useful, a prototypical task correlation guided Gating Diversity Controller(GDC) is introduced to adaptively adjust the diversity of gates for the new task based on class-specific prototypes. GDC module computes class-to-class correlation matrix to depict the cross-task correlation, and hereby activates more exclusive gates for the new task if a significant domain gap is observed. Comprehensive experiments on COCO-VOC, KITTI-Kitchen, class-incremental detection on VOC and sequential learning of four tasks show that ROSETTA yields state-of-the-art performance on both task-based and class-based continual object detection.

Via

Access Paper or Ask Questions

Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling

Apr 14, 2022

Takashi Isobe, Xu Jia, Xin Tao, Changlin Li, Ruihuang Li, Yongjie Shi, Jing Mu, Huchuan Lu, Yu-Wing Tai

Figure 1 for Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling

Figure 2 for Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling

Figure 3 for Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling

Figure 4 for Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling

Abstract:Temporal modeling is crucial for video super-resolution. Most of the video super-resolution methods adopt the optical flow or deformable convolution for explicitly motion compensation. However, such temporal modeling techniques increase the model complexity and might fail in case of occlusion or complex motion, resulting in serious distortion and artifacts. In this paper, we propose to explore the role of explicit temporal difference modeling in both LR and HR space. Instead of directly feeding consecutive frames into a VSR model, we propose to compute the temporal difference between frames and divide those pixels into two subsets according to the level of difference. They are separately processed with two branches of different receptive fields in order to better extract complementary information. To further enhance the super-resolution result, not only spatial residual features are extracted, but the difference between consecutive frames in high-frequency domain is also computed. It allows the model to exploit intermediate SR results in both future and past to refine the current SR output. The difference at different time steps could be cached such that information from further distance in time could be propagated to the current frame for refinement. Experiments on several video super-resolution benchmark datasets demonstrate the effectiveness of the proposed method and its favorable performance against state-of-the-art methods.

* CVPR 2022

Via

Access Paper or Ask Questions

Arch-Graph: Acyclic Architecture Relation Predictor for Task-Transferable Neural Architecture Search

Apr 12, 2022

Minbin Huang, Zhijian Huang, Changlin Li, Xin Chen, Hang Xu, Zhenguo Li, Xiaodan Liang

Figure 1 for Arch-Graph: Acyclic Architecture Relation Predictor for Task-Transferable Neural Architecture Search

Figure 2 for Arch-Graph: Acyclic Architecture Relation Predictor for Task-Transferable Neural Architecture Search

Figure 3 for Arch-Graph: Acyclic Architecture Relation Predictor for Task-Transferable Neural Architecture Search

Figure 4 for Arch-Graph: Acyclic Architecture Relation Predictor for Task-Transferable Neural Architecture Search

Abstract:Neural Architecture Search (NAS) aims to find efficient models for multiple tasks. Beyond seeking solutions for a single task, there are surging interests in transferring network design knowledge across multiple tasks. In this line of research, effectively modeling task correlations is vital yet highly neglected. Therefore, we propose \textbf{Arch-Graph}, a transferable NAS method that predicts task-specific optimal architectures with respect to given task embeddings. It leverages correlations across multiple tasks by using their embeddings as a part of the predictor's input for fast adaptation. We also formulate NAS as an architecture relation graph prediction problem, with the relational graph constructed by treating candidate architectures as nodes and their pairwise relations as edges. To enforce some basic properties such as acyclicity in the relational graph, we add additional constraints to the optimization process, converting NAS into the problem of finding a Maximal Weighted Acyclic Subgraph (MWAS). Our algorithm then strives to eliminate cycles and only establish edges in the graph if the rank results can be trusted. Through MWAS, Arch-Graph can effectively rank candidate models for each task with only a small budget to finetune the predictor. With extensive experiments on TransNAS-Bench-101, we show Arch-Graph's transferability and high sample efficiency across numerous tasks, beating many NAS methods designed for both single-task and multi-task search. It is able to find top 0.16\% and 0.29\% architectures on average on two search spaces under the budget of only 50 models.

* Accepted by CVPR 2022

Via

Access Paper or Ask Questions

Beyond Fixation: Dynamic Window Visual Transformer

Apr 08, 2022

Pengzhen Ren, Changlin Li, Guangrun Wang, Yun Xiao, Qing Du, Xiaodan Liang, Xiaojun Chang

Figure 1 for Beyond Fixation: Dynamic Window Visual Transformer

Figure 2 for Beyond Fixation: Dynamic Window Visual Transformer

Figure 3 for Beyond Fixation: Dynamic Window Visual Transformer

Figure 4 for Beyond Fixation: Dynamic Window Visual Transformer

Abstract:Recently, a surge of interest in visual transformers is to reduce the computational cost by limiting the calculation of self-attention to a local window. Most current work uses a fixed single-scale window for modeling by default, ignoring the impact of window size on model performance. However, this may limit the modeling potential of these window-based models for multi-scale information. In this paper, we propose a novel method, named Dynamic Window Vision Transformer (DW-ViT). The dynamic window strategy proposed by DW-ViT goes beyond the model that employs a fixed single window setting. To the best of our knowledge, we are the first to use dynamic multi-scale windows to explore the upper limit of the effect of window settings on model performance. In DW-ViT, multi-scale information is obtained by assigning windows of different sizes to different head groups of window multi-head self-attention. Then, the information is dynamically fused by assigning different weights to the multi-scale window branches. We conducted a detailed performance evaluation on three datasets, ImageNet-1K, ADE20K, and COCO. Compared with related state-of-the-art (SoTA) methods, DW-ViT obtains the best performance. Specifically, compared with the current SoTA Swin Transformers \cite{liu2021swin}, DW-ViT has achieved consistent and substantial improvements on all three datasets with similar parameters and computational costs. In addition, DW-ViT exhibits good scalability and can be easily inserted into any window-based visual transformers.

* CVPR2022

Via

Access Paper or Ask Questions

Automated Progressive Learning for Efficient Training of Vision Transformers

Mar 28, 2022

Changlin Li, Bohan Zhuang, Guangrun Wang, Xiaodan Liang, Xiaojun Chang, Yi Yang

Figure 1 for Automated Progressive Learning for Efficient Training of Vision Transformers

Figure 2 for Automated Progressive Learning for Efficient Training of Vision Transformers

Figure 3 for Automated Progressive Learning for Efficient Training of Vision Transformers

Figure 4 for Automated Progressive Learning for Efficient Training of Vision Transformers

Abstract:Recent advances in vision Transformers (ViTs) have come with a voracious appetite for computing power, high-lighting the urgent need to develop efficient training methods for ViTs. Progressive learning, a training scheme where the model capacity grows progressively during training, has started showing its ability in efficient training. In this paper, we take a practical step towards efficient training of ViTs by customizing and automating progressive learning. First, we develop a strong manual baseline for progressive learning of ViTs, by introducing momentum growth (MoGrow) to bridge the gap brought by model growth. Then, we propose automated progressive learning (AutoProg), an efficient training scheme that aims to achieve lossless acceleration by automatically increasing the training overload on-the-fly; this is achieved by adaptively deciding whether, where and how much should the model grow during progressive learning. Specifically, we first relax the optimization of the growth schedule to sub-network architecture optimization problem, then propose one-shot estimation of the sub-network performance via an elastic supernet. The searching overhead is reduced to minimal by recycling the parameters of the supernet. Extensive experiments of efficient training on ImageNet with two representative ViT models, DeiT and VOLO, demonstrate that AutoProg can accelerate ViTs training by up to 85.1% with no performance drop. Code: https://github.com/changlin31/AutoProg

* Accepted to CVPR 2022

Via

Access Paper or Ask Questions

Dynamic Slimmable Denoising Network

Oct 17, 2021

Zutao Jiang, Changlin Li, Xiaojun Chang, Jihua Zhu, Yi Yang

Figure 1 for Dynamic Slimmable Denoising Network

Figure 2 for Dynamic Slimmable Denoising Network

Figure 3 for Dynamic Slimmable Denoising Network

Figure 4 for Dynamic Slimmable Denoising Network

Abstract:Recently, tremendous human-designed and automatically searched neural networks have been applied to image denoising. However, previous works intend to handle all noisy images in a pre-defined static network architecture, which inevitably leads to high computational complexity for good denoising quality. Here, we present dynamic slimmable denoising network (DDS-Net), a general method to achieve good denoising quality with less computational complexity, via dynamically adjusting the channel configurations of networks at test time with respect to different noisy images. Our DDS-Net is empowered with the ability of dynamic inference by a dynamic gate, which can predictively adjust the channel configuration of networks with negligible extra computation cost. To ensure the performance of each candidate sub-network and the fairness of the dynamic gate, we propose a three-stage optimization scheme. In the first stage, we train a weight-shared slimmable super network. In the second stage, we evaluate the trained slimmable super network in an iterative way and progressively tailor the channel numbers of each layer with minimal denoising quality drop. By a single pass, we can obtain several sub-networks with good performance under different channel configurations. In the last stage, we identify easy and hard samples in an online way and train a dynamic gate to predictively select the corresponding sub-network with respect to different noisy images. Extensive experiments demonstrate our DDS-Net consistently outperforms the state-of-the-art individually trained static denoising networks.

* 11 pages

Via

Access Paper or Ask Questions

DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers

Sep 21, 2021

Changlin Li, Guangrun Wang, Bing Wang, Xiaodan Liang, Zhihui Li, Xiaojun Chang

Figure 1 for DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers

Figure 2 for DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers

Figure 3 for DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers

Figure 4 for DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers

Abstract:Dynamic networks have shown their promising capability in reducing theoretical computation complexity by adapting their architectures to the input during inference. However, their practical runtime usually lags behind the theoretical acceleration due to inefficient sparsity. Here, we explore a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels, while keeping parameters stored statically and contiguously in hardware to prevent the extra burden of sparse computation. Based on this scheme, we present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers, respectively. To ensure sub-network generality and routing fairness, we propose a disentangled two-stage optimization scheme with training techniques such as in-place bootstrapping (IB), multi-view consistency (MvCo) and sandwich gate sparsification (SGS) to train supernet and gate separately. Extensive experiments on 4 datasets and 3 different network architectures demonstrate our method consistently outperforms state-of-the-art static and dynamic model compression methods by a large margin (up to 6.6%). Typically, DS-Net++ achieves 2-4x computation reduction and 1.62x real-world acceleration over MobileNet, ResNet-50 and Vision Transformer, with minimal accuracy drops (0.1-0.3%) on ImageNet. Code release: https://github.com/changlin31/DS-Net

* Extension of the CVPR 2021 oral paper (https://openaccess.thecvf.com/content/CVPR2021/html/Li_Dynamic_Slimmable_Network_CVPR_2021_paper.html)

Via

Access Paper or Ask Questions