Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aston Zhang

Jack

Automated Few-shot Classification with Instruction-Finetuned Language Models

May 21, 2023

Rami Aly, Xingjian Shi, Kaixiang Lin, Aston Zhang, Andrew Gordon Wilson

Figure 1 for Automated Few-shot Classification with Instruction-Finetuned Language Models

Figure 2 for Automated Few-shot Classification with Instruction-Finetuned Language Models

Figure 3 for Automated Few-shot Classification with Instruction-Finetuned Language Models

Figure 4 for Automated Few-shot Classification with Instruction-Finetuned Language Models

Abstract:A particularly successful class of approaches for few-shot learning combines language models with prompts -- hand-crafted task descriptions that complement data samples. However, designing prompts by hand for each task commonly requires domain knowledge and substantial guesswork. We observe, in the context of classification tasks, that instruction finetuned language models exhibit remarkable prompt robustness, and we subsequently propose a simple method to eliminate the need for handcrafted prompts, named AuT-Few. This approach consists of (i) a prompt retrieval module that selects suitable task instructions from the instruction-tuning knowledge base, and (ii) the generation of two distinct, semantically meaningful, class descriptions and a selection mechanism via cross-validation. Over $12$ datasets, spanning $8$ classification tasks, we show that AuT-Few outperforms current state-of-the-art few-shot learning methods. Moreover, AuT-Few is the best ranking method across datasets on the RAFT few-shot benchmark. Notably, these results are achieved without task-specific handcrafted prompts on unseen tasks.

Via

Access Paper or Ask Questions

Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

May 07, 2023

Zhanpeng Zeng, Cole Hawkins, Mingyi Hong, Aston Zhang, Nikolaos Pappas, Vikas Singh, Shuai Zheng

Figure 1 for Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

Figure 2 for Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

Figure 3 for Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

Figure 4 for Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

Abstract:Transformer models are foundational to natural language processing (NLP) and computer vision. Despite various recent works devoted to reducing the quadratic cost of such models (as a function of the sequence length $n$), dealing with ultra long sequences efficiently (e.g., with more than 16K tokens) remains challenging. Applications such as answering questions based on an entire book or summarizing a scientific article are inefficient or infeasible. In this paper, we propose to significantly reduce the dependency of a Transformer model's complexity on $n$, by compressing the input into a representation whose size $r$ is independent of $n$ at each layer. Specifically, by exploiting the fact that in many tasks, only a small subset of special tokens (we call VIP-tokens) are most relevant to the final prediction, we propose a VIP-token centric compression (Vcc) scheme which selectively compresses the input sequence based on their impact on approximating the representation of these VIP-tokens. Compared with competitive baselines, the proposed algorithm not only is efficient (achieving more than $3\times$ efficiency improvement compared to baselines on 4K and 16K lengths), but also achieves competitive or better performance on a large number of tasks. Further, we show that our algorithm can be scaled to 128K tokens (or more) while consistently offering accuracy improvement.

* 10 pages main text, 11 pages appendix, preprint

Via

Access Paper or Ask Questions

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

Apr 10, 2023

Shuhuai Ren, Aston Zhang, Yi Zhu, Shuai Zhang, Shuai Zheng, Mu Li, Alex Smola, Xu Sun

Figure 1 for Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

Figure 2 for Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

Figure 3 for Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

Figure 4 for Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

Abstract:This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 downstream datasets, e.g., 67.0% average accuracy on 10 classification dataset (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg).

* Code is available at https://github.com/amazon-science/prompt-pretraining

Via

Access Paper or Ask Questions

A Cheaper and Better Diffusion Language Model with Soft-Masked Noise

Apr 10, 2023

Jiaao Chen, Aston Zhang, Mu Li, Alex Smola, Diyi Yang

Abstract:Diffusion models that are based on iterative denoising have been recently proposed and leveraged in various generation tasks like image generation. Whereas, as a way inherently built for continuous data, existing diffusion models still have some limitations in modeling discrete data, e.g., languages. For example, the generally used Gaussian noise can not handle the discrete corruption well, and the objectives in continuous spaces fail to be stable for textual data in the diffusion process especially when the dimension is high. To alleviate these issues, we introduce a novel diffusion model for language modeling, Masked-Diffuse LM, with lower training cost and better performances, inspired by linguistic features in languages. Specifically, we design a linguistic-informed forward process which adds corruptions to the text through strategically soft-masking to better noise the textual data. Also, we directly predict the categorical distribution with cross-entropy loss function in every diffusion step to connect the continuous space and discrete space in a more efficient and straightforward way. Through experiments on 5 controlled generation tasks, we demonstrate that our Masked-Diffuse LM can achieve better generation quality than the state-of-the-art diffusion models with better efficiency.

* Code is available at https://github.com/amazon-science/masked-diffusion-lm

Via

Access Paper or Ask Questions

Multimodal Chain-of-Thought Reasoning in Language Models

Feb 17, 2023

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola

Figure 1 for Multimodal Chain-of-Thought Reasoning in Language Models

Figure 2 for Multimodal Chain-of-Thought Reasoning in Language Models

Figure 3 for Multimodal Chain-of-Thought Reasoning in Language Models

Figure 4 for Multimodal Chain-of-Thought Reasoning in Language Models

Abstract:Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. With Multimodal-CoT, our model under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by 16 percentage points (75.17%->91.68% accuracy) on the ScienceQA benchmark and even surpasses human performance. Code is publicly available available at https://github.com/amazon-science/mm-cot.

Via

Access Paper or Ask Questions

Is ChatGPT a General-Purpose Natural Language Processing Task Solver?

Feb 15, 2023

Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, Diyi Yang

Figure 1 for Is ChatGPT a General-Purpose Natural Language Processing Task Solver?

Figure 2 for Is ChatGPT a General-Purpose Natural Language Processing Task Solver?

Figure 3 for Is ChatGPT a General-Purpose Natural Language Processing Task Solver?

Figure 4 for Is ChatGPT a General-Purpose Natural Language Processing Task Solver?

Abstract:Spurred by advancements in scale, large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot -- i.e., without adaptation on downstream data. Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community due to the fact that it can generate high-quality responses to human input and self-correct previous mistakes based on subsequent conversations. However, it is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot. In this work, we empirically analyze the zero-shot learning ability of ChatGPT by evaluating it on 20 popular NLP datasets covering 7 representative task categories. With extensive empirical studies, we demonstrate both the effectiveness and limitations of the current version of ChatGPT. We find that ChatGPT performs well on many tasks favoring reasoning capabilities (e.g., arithmetic reasoning) while it still faces challenges when solving specific tasks such as sequence tagging. We additionally provide in-depth analysis through qualitative case studies.

Via

Access Paper or Ask Questions

AIM: Adapting Image Models for Efficient Video Action Recognition

Feb 06, 2023

Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, Mu Li

Figure 1 for AIM: Adapting Image Models for Efficient Video Action Recognition

Figure 2 for AIM: Adapting Image Models for Efficient Video Action Recognition

Figure 3 for AIM: Adapting Image Models for Efficient Video Action Recognition

Figure 4 for AIM: Adapting Image Models for Efficient Video Action Recognition

Abstract:Recent vision transformer based video models mostly follow the ``image pre-training then finetuning" paradigm and have achieved great success on multiple video benchmarks. However, full finetuning such a video model could be computationally expensive and unnecessary, given the pre-trained image transformer models have demonstrated exceptional transferability. In this work, we propose a novel method to Adapt pre-trained Image Models (AIM) for efficient video understanding. By freezing the pre-trained image model and adding a few lightweight Adapters, we introduce spatial adaptation, temporal adaptation and joint adaptation to gradually equip an image model with spatiotemporal reasoning capability. We show that our proposed AIM can achieve competitive or even better performance than prior arts with substantially fewer tunable parameters on four video action recognition benchmarks. Thanks to its simplicity, our method is also generally applicable to different image pre-trained models, which has the potential to leverage more powerful image foundation models in the future. The project webpage is \url{https://adapt-image-models.github.io/}.

* Accepted to ICLR 2023. Project webpage is at https://adapt-image-models.github.io/

Via

Access Paper or Ask Questions

Parameter-Efficient Fine-Tuning Design Spaces

Jan 04, 2023

Jiaao Chen, Aston Zhang, Xingjian Shi, Mu Li, Alex Smola, Diyi Yang

Figure 1 for Parameter-Efficient Fine-Tuning Design Spaces

Figure 2 for Parameter-Efficient Fine-Tuning Design Spaces

Figure 3 for Parameter-Efficient Fine-Tuning Design Spaces

Figure 4 for Parameter-Efficient Fine-Tuning Design Spaces

Abstract:Parameter-efficient fine-tuning aims to achieve performance comparable to fine-tuning, using fewer trainable parameters. Several strategies (e.g., Adapters, prefix tuning, BitFit, and LoRA) have been proposed. However, their designs are hand-crafted separately, and it remains unclear whether certain design patterns exist for parameter-efficient fine-tuning. Thus, we present a parameter-efficient fine-tuning design paradigm and discover design patterns that are applicable to different experimental settings. Instead of focusing on designing another individual tuning strategy, we introduce parameter-efficient fine-tuning design spaces that parameterize tuning structures and tuning strategies. Specifically, any design space is characterized by four components: layer grouping, trainable parameter allocation, tunable groups, and strategy assignment. Starting from an initial design space, we progressively refine the space based on the model quality of each design choice and make greedy selection at each stage over these four components. We discover the following design patterns: (i) group layers in a spindle pattern; (ii) allocate the number of trainable parameters to layers uniformly; (iii) tune all the groups; (iv) assign proper tuning strategies to different groups. These design patterns result in new parameter-efficient fine-tuning methods. We show experimentally that these methods consistently and significantly outperform investigated parameter-efficient fine-tuning strategies across different backbone models and different tasks in natural language processing.

* Code is available at https://github.com/amazon-science/peft-design-spaces

Via

Access Paper or Ask Questions

Learning Multimodal Data Augmentation in Feature Space

Dec 29, 2022

Zichang Liu, Zhiqiang Tang, Xingjian Shi, Aston Zhang, Mu Li, Anshumali Shrivastava, Andrew Gordon Wilson

Figure 1 for Learning Multimodal Data Augmentation in Feature Space

Figure 2 for Learning Multimodal Data Augmentation in Feature Space

Figure 3 for Learning Multimodal Data Augmentation in Feature Space

Figure 4 for Learning Multimodal Data Augmentation in Feature Space

Abstract:The ability to jointly learn from multiple modalities, such as text, audio, and visual data, is a defining feature of intelligent systems. While there have been promising advances in designing neural networks to harness multimodal data, the enormous success of data augmentation currently remains limited to single-modality tasks like image classification. Indeed, it is particularly difficult to augment each modality while preserving the overall semantic structure of the data; for example, a caption may no longer be a good description of an image after standard augmentations have been applied, such as translation. Moreover, it is challenging to specify reasonable transformations that are not tailored to a particular modality. In this paper, we introduce LeMDA, Learning Multimodal Data Augmentation, an easy-to-use method that automatically learns to jointly augment multimodal data in feature space, with no constraints on the identities of the modalities or the relationship between modalities. We show that LeMDA can (1) profoundly improve the performance of multimodal deep learning architectures, (2) apply to combinations of modalities that have not been previously considered, and (3) achieve state-of-the-art results on a wide range of applications comprised of image, text, and tabular data.

Via

Access Paper or Ask Questions

SPT: Semi-Parametric Prompt Tuning for Multitask Prompted Learning

Dec 21, 2022

M Saiful Bari, Aston Zhang, Shuai Zheng, Xingjian Shi, Yi Zhu, Shafiq Joty, Mu Li

Figure 1 for SPT: Semi-Parametric Prompt Tuning for Multitask Prompted Learning

Figure 2 for SPT: Semi-Parametric Prompt Tuning for Multitask Prompted Learning

Figure 3 for SPT: Semi-Parametric Prompt Tuning for Multitask Prompted Learning

Figure 4 for SPT: Semi-Parametric Prompt Tuning for Multitask Prompted Learning

Abstract:Pre-trained large language models can efficiently interpolate human-written prompts in a natural way. Multitask prompted learning can help generalization through a diverse set of tasks at once, thus enhancing the potential for more effective downstream fine-tuning. To perform efficient multitask-inference in the same batch, parameter-efficient fine-tuning methods such as prompt tuning have been proposed. However, the existing prompt tuning methods may lack generalization. We propose SPT, a semi-parametric prompt tuning method for multitask prompted learning. The novel component of SPT is a memory bank from where memory prompts are retrieved based on discrete prompts. Extensive experiments, such as (i) fine-tuning a full language model with SPT on 31 different tasks from 8 different domains and evaluating zero-shot generalization on 9 heldout datasets under 5 NLP task categories and (ii) pretraining SPT on the GLUE datasets and evaluating fine-tuning on the SuperGLUE datasets, demonstrate effectiveness of SPT.

Via

Access Paper or Ask Questions