Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dacheng Tao

and Other Contributors

A Survey on Transformer Compression

Feb 05, 2024

Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao

Figure 1 for A Survey on Transformer Compression

Figure 2 for A Survey on Transformer Compression

Figure 3 for A Survey on Transformer Compression

Figure 4 for A Survey on Transformer Compression

Abstract:Large models based on the Transformer architecture play increasingly vital roles in artificial intelligence, particularly within the realms of natural language processing (NLP) and computer vision (CV). Model compression methods reduce their memory and computational cost, which is a necessary step to implement the transformer models on practical devices. Given the unique architecture of transformer, featuring alternative attention and Feedforward Neural Network (FFN) modules, specific compression techniques are required. The efficiency of these compression methods is also paramount, as it is usually impractical to retrain large models on the entire training dataset.This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to transformer models. The compression methods are primarily categorized into pruning, quantization, knowledge distillation, and efficient architecture design. In each category, we discuss compression methods for both CV and NLP tasks, highlighting common underlying principles. At last, we delve into the relation between various compression methods, and discuss the further directions in this domain.

Via

Access Paper or Ask Questions

Fine-Grained Zero-Shot Learning: Advances, Challenges, and Prospects

Feb 04, 2024

Jingcai Guo, Zhijie Rao, Zhi Chen, Jingren Zhou, Dacheng Tao

Figure 1 for Fine-Grained Zero-Shot Learning: Advances, Challenges, and Prospects

Figure 2 for Fine-Grained Zero-Shot Learning: Advances, Challenges, and Prospects

Figure 3 for Fine-Grained Zero-Shot Learning: Advances, Challenges, and Prospects

Figure 4 for Fine-Grained Zero-Shot Learning: Advances, Challenges, and Prospects

Abstract:Recent zero-shot learning (ZSL) approaches have integrated fine-grained analysis, i.e., fine-grained ZSL, to mitigate the commonly known seen/unseen domain bias and misaligned visual-semantics mapping problems, and have made profound progress. Notably, this paradigm differs from existing close-set fine-grained methods and, therefore, can pose unique and nontrivial challenges. However, to the best of our knowledge, there remains a lack of systematic summaries of this topic. To enrich the literature of this domain and provide a sound basis for its future development, in this paper, we present a broad review of recent advances for fine-grained analysis in ZSL. Concretely, we first provide a taxonomy of existing methods and techniques with a thorough analysis of each category. Then, we summarize the benchmark, covering publicly available datasets, models, implementations, and some more details as a library. Last, we sketch out some related applications. In addition, we discuss vital challenges and suggest potential future directions.

* 9 pages, 1 figure, 4 tables

Via

Access Paper or Ask Questions

FreDF: Learning to Forecast in Frequency Domain

Feb 04, 2024

Hao Wang, Licheng Pan, Zhichao Chen, Degui Yang, Sen Zhang, Yifei Yang, Xinggao Liu, Haoxuan Li, Dacheng Tao

Abstract:Time series modeling is uniquely challenged by the presence of autocorrelation in both historical and label sequences. Current research predominantly focuses on handling autocorrelation within the historical sequence but often neglects its presence in the label sequence. Specifically, emerging forecast models mainly conform to the direct forecast (DF) paradigm, generating multi-step forecasts under the assumption of conditional independence within the label sequence. This assumption disregards the inherent autocorrelation in the label sequence, thereby limiting the performance of DF-based models. In response to this gap, we introduce the Frequency-enhanced Direct Forecast (FreDF), which bypasses the complexity of label autocorrelation by learning to forecast in the frequency domain. Our experiments demonstrate that FreDF substantially outperforms existing state-of-the-art methods including iTransformer and is compatible with a variety of forecast models.

Via

Access Paper or Ask Questions

Merging Multi-Task Models via Weight-Ensembling Mixture of Experts

Feb 01, 2024

Anke Tang, Li Shen, Yong Luo, Nan Yin, Lefei Zhang, Dacheng Tao

Abstract:Merging various task-specific Transformer-based models trained on different tasks into a single unified model can execute all the tasks concurrently. Previous methods, exemplified by task arithmetic, have been proven to be both effective and scalable. Existing methods have primarily focused on seeking a static optimal solution within the original model parameter space. A notable challenge is mitigating the interference between parameters of different models, which can substantially deteriorate performance. In this paper, we propose to merge most of the parameters while upscaling the MLP of the Transformer layers to a weight-ensembling mixture of experts (MoE) module, which can dynamically integrate shared and task-specific knowledge based on the input, thereby providing a more flexible solution that can adapt to the specific needs of each instance. Our key insight is that by identifying and separating shared knowledge and task-specific knowledge, and then dynamically integrating them, we can mitigate the parameter interference problem to a great extent. We conduct the conventional multi-task model merging experiments and evaluate the generalization and robustness of our method. The results demonstrate the effectiveness of our method and provide a comprehensive understanding of our method. The code is available at https://anonymous.4open.science/r/weight-ensembling_MoE-67C9/

Via

Access Paper or Ask Questions

Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

Jan 31, 2024

Maoyuan Ye, Jing Zhang, Juhua Liu, Chenyu Liu, Baocai Yin, Cong Liu, Bo Du, Dacheng Tao

Figure 1 for Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

Figure 2 for Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

Figure 3 for Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

Figure 4 for Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

Abstract:The Segment Anything Model (SAM), a profound vision foundation model pre-trained on a large-scale dataset, breaks the boundaries of general segmentation and sparks various downstream applications. This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in text segmentation across four hierarchies, including stroke, word, text-line, and paragraph, while realizing layout analysis as well. Specifically, we first turn SAM into a high-quality text stroke segmentation (TSS) model through a parameter-efficient fine-tuning approach. We use this TSS model to iteratively generate the text stroke labels in a semi-automatical manner, unifying labels across the four text hierarchies in the HierText dataset. Subsequently, with these complete labels, we launch the end-to-end trainable Hi-SAM based on the TSS architecture with a customized hierarchical mask decoder. During inference, Hi-SAM offers both automatic mask generation (AMG) mode and promptable segmentation mode. In terms of the AMG mode, Hi-SAM segments text stroke foreground masks initially, then samples foreground points for hierarchical text mask generation and achieves layout analysis in passing. As for the promptable mode, Hi-SAM provides word, text-line, and paragraph masks with a single point click. Experimental results show the state-of-the-art performance of our TSS model: 84.86% fgIOU on Total-Text and 88.96% fgIOU on TextSeg for text stroke segmentation. Moreover, compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 on the text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layout analysis, requiring 20x fewer training epochs. The code is available at https://github.com/ymy-k/Hi-SAM.

* GitHub repository: https://github.com/ymy-k/Hi-SAM

Via

Access Paper or Ask Questions

Topology-aware Embedding Memory for Learning on Expanding Graphs

Jan 24, 2024

Xikun Zhang, Dongjin Song, Yixin Chen, Dacheng Tao

Figure 1 for Topology-aware Embedding Memory for Learning on Expanding Graphs

Figure 2 for Topology-aware Embedding Memory for Learning on Expanding Graphs

Figure 3 for Topology-aware Embedding Memory for Learning on Expanding Graphs

Figure 4 for Topology-aware Embedding Memory for Learning on Expanding Graphs

Abstract:Memory replay based techniques have shown great success for continual learning with incrementally accumulated Euclidean data. Directly applying them to continually expanding graphs, however, leads to the potential memory explosion problem due to the need to buffer representative nodes and their associated topological neighborhood structures. To this end, we systematically analyze the key challenges in the memory explosion problem, and present a general framework, i.e., Parameter Decoupled Graph Neural Networks (PDGNNs) with Topology-aware Embedding Memory (TEM), to tackle this issue. The proposed framework not only reduces the memory space complexity from $\mathcal{O}(nd^L)$ to $\mathcal{O}(n)$~\footnote{$n$: memory budget, $d$: average node degree, $L$: the radius of the GNN receptive field}, but also fully utilizes the topological information for memory replay. Specifically, PDGNNs decouple trainable parameters from the computation ego-subgraph via \textit{Topology-aware Embeddings} (TEs), which compress ego-subgraphs into compact vectors (i.e., TEs) to reduce the memory consumption. Based on this framework, we discover a unique \textit{pseudo-training effect} in continual learning on expanding graphs and this effect motivates us to develop a novel \textit{coverage maximization sampling} strategy that can enhance the performance with a tight memory budget. Thorough empirical studies demonstrate that, by tackling the memory explosion problem and incorporating topological information into memory replay, PDGNNs with TEM significantly outperform state-of-the-art techniques, especially in the challenging class-incremental setting.

Via

Access Paper or Ask Questions

TD^2-Net: Toward Denoising and Debiasing for Dynamic Scene Graph Generation

Jan 23, 2024

Xin Lin, Chong Shi, Yibing Zhan, Zuopeng Yang, Yaqi Wu, Dacheng Tao

Figure 1 for TD^2-Net: Toward Denoising and Debiasing for Dynamic Scene Graph Generation

Figure 2 for TD^2-Net: Toward Denoising and Debiasing for Dynamic Scene Graph Generation

Figure 3 for TD^2-Net: Toward Denoising and Debiasing for Dynamic Scene Graph Generation

Figure 4 for TD^2-Net: Toward Denoising and Debiasing for Dynamic Scene Graph Generation

Abstract:Dynamic scene graph generation (SGG) focuses on detecting objects in a video and determining their pairwise relationships. Existing dynamic SGG methods usually suffer from several issues, including 1) Contextual noise, as some frames might contain occluded and blurred objects. 2) Label bias, primarily due to the high imbalance between a few positive relationship samples and numerous negative ones. Additionally, the distribution of relationships exhibits a long-tailed pattern. To address the above problems, in this paper, we introduce a network named TD$^2$-Net that aims at denoising and debiasing for dynamic SGG. Specifically, we first propose a denoising spatio-temporal transformer module that enhances object representation with robust contextual information. This is achieved by designing a differentiable Top-K object selector that utilizes the gumbel-softmax sampling strategy to select the relevant neighborhood for each object. Second, we introduce an asymmetrical reweighting loss to relieve the issue of label bias. This loss function integrates asymmetry focusing factors and the volume of samples to adjust the weights assigned to individual samples. Systematic experimental results demonstrate the superiority of our proposed TD$^2$-Net over existing state-of-the-art approaches on Action Genome databases. In more detail, TD$^2$-Net outperforms the second-best competitors by 12.7 \% on mean-Recall@10 for predicate classification.

* Accepted by AAAI 2024

Via

Access Paper or Ask Questions

Revisiting Demonstration Selection Strategies in In-Context Learning

Jan 22, 2024

Keqin Peng, Liang Ding, Yancheng Yuan, Xuebo Liu, Min Zhang, Yuanxin Ouyang, Dacheng Tao

Figure 1 for Revisiting Demonstration Selection Strategies in In-Context Learning

Figure 2 for Revisiting Demonstration Selection Strategies in In-Context Learning

Figure 3 for Revisiting Demonstration Selection Strategies in In-Context Learning

Figure 4 for Revisiting Demonstration Selection Strategies in In-Context Learning

Abstract:Large language models (LLMs) have shown an impressive ability to perform a wide range of tasks using in-context learning (ICL), where a few examples are used to describe a task to the model. However, the performance of ICL varies significantly with the choice of demonstrations, and it is still unclear why this happens or what factors will influence its choice. In this work, we first revisit the factors contributing to this variance from both data and model aspects, and find that the choice of demonstration is both data- and model-dependent. We further proposed a data- and model-dependent demonstration selection method, \textbf{TopK + ConE}, based on the assumption that \textit{the performance of a demonstration positively correlates with its contribution to the model's understanding of the test samples}, resulting in a simple and effective recipe for ICL. Empirically, our method yields consistent improvements in both language understanding and generation tasks with different model scales. Further analyses confirm that, besides the generality and stability under different circumstances, our method provides a unified explanation for the effectiveness of previous methods. Code will be released.

Via

Access Paper or Ask Questions

Solving Continual Offline Reinforcement Learning with Decision Transformer

Jan 16, 2024

Kaixin Huang, Li Shen, Chen Zhao, Chun Yuan, Dacheng Tao

Figure 1 for Solving Continual Offline Reinforcement Learning with Decision Transformer

Figure 2 for Solving Continual Offline Reinforcement Learning with Decision Transformer

Figure 3 for Solving Continual Offline Reinforcement Learning with Decision Transformer

Figure 4 for Solving Continual Offline Reinforcement Learning with Decision Transformer

Abstract:Continuous offline reinforcement learning (CORL) combines continuous and offline reinforcement learning, enabling agents to learn multiple tasks from static datasets without forgetting prior tasks. However, CORL faces challenges in balancing stability and plasticity. Existing methods, employing Actor-Critic structures and experience replay (ER), suffer from distribution shifts, low efficiency, and weak knowledge-sharing. We aim to investigate whether Decision Transformer (DT), another offline RL paradigm, can serve as a more suitable offline continuous learner to address these issues. We first compare AC-based offline algorithms with DT in the CORL framework. DT offers advantages in learning efficiency, distribution shift mitigation, and zero-shot generalization but exacerbates the forgetting problem during supervised parameter updates. We introduce multi-head DT (MH-DT) and low-rank adaptation DT (LoRA-DT) to mitigate DT's forgetting problem. MH-DT stores task-specific knowledge using multiple heads, facilitating knowledge sharing with common components. It employs distillation and selective rehearsal to enhance current task learning when a replay buffer is available. In buffer-unavailable scenarios, LoRA-DT merges less influential weights and fine-tunes DT's decisive MLP layer to adapt to the current task. Extensive experiments on MoJuCo and Meta-World benchmarks demonstrate that our methods outperform SOTA CORL baselines and showcase enhanced learning capabilities and superior memory efficiency.

* 11 pages, 6 figures

Via

Access Paper or Ask Questions

GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching

Jan 13, 2024

Haibin He, Maoyuan Ye, Jing Zhang, Juhua Liu, Dacheng Tao

Abstract:Beyond the text detection and recognition tasks in image text spotting, video text spotting presents an augmented challenge with the inclusion of tracking. While advanced end-to-end trainable methods have shown commendable performance, the pursuit of multi-task optimization may pose the risk of producing sub-optimal outcomes for individual tasks. In this paper, we highlight a main bottleneck in the state-of-the-art video text spotter: the limited recognition capability. In response to this issue, we propose to efficiently turn an off-the-shelf query-based image text spotter into a specialist on video and present a simple baseline termed GoMatching, which focuses the training efforts on tracking while maintaining strong recognition performance. To adapt the image text spotter to video datasets, we add a rescoring head to rescore each detected instance's confidence via efficient tuning, leading to a better tracking candidate pool. Additionally, we design a long-short term matching module, termed LST-Matcher, to enhance the spotter's tracking capability by integrating both long- and short-term matching results via Transformer. Based on the above simple designs, GoMatching achieves impressive performance on two public benchmarks, e.g., setting a new record on the ICDAR15-video dataset, and one novel test set with arbitrary-shaped text, while saving considerable training budgets. The code will be released at https://github.com/Hxyz-123/GoMatching.

Via

Access Paper or Ask Questions