Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexander I. Rudnicky

Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

Nov 15, 2023

Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky

Figure 1 for Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

Figure 2 for Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

Figure 3 for Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

Figure 4 for Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

Abstract:An ideal length-extrapolatable Transformer language model can handle sequences longer than the training length without any fine-tuning. Such long-context utilization capability relies heavily on a flexible positional embedding design. Upon investigating the flexibility of existing large pre-trained Transformer language models, we find that the T5 family deserves a closer look, as its positional embeddings capture rich and flexible attention patterns. However, T5 suffers from the dispersed attention issue: the longer the input sequence, the flatter the attention distribution. To alleviate the issue, we propose two attention alignment strategies via temperature scaling. Our findings show improvement on the long-context utilization capability of T5 on language modeling, retrieval, multi-document question answering, and code completion tasks without any fine-tuning. This suggests that a flexible positional embedding design and attention alignment can go a long way toward Transformer length extrapolation.

Via

Access Paper or Ask Questions

Advancing Regular Language Reasoning in Linear Recurrent Neural Networks

Sep 14, 2023

Ting-Han Fan, Ta-Chung Chi, Alexander I. Rudnicky

Abstract:In recent studies, linear recurrent neural networks (LRNNs) have achieved Transformer-level performance in natural language modeling and long-range modeling while offering rapid parallel training and constant inference costs. With the resurged interest in LRNNs, we study whether they can learn the hidden rules in training sequences, such as the grammatical structures of regular language. We theoretically analyze some existing LRNNs and discover their limitations on regular language. Motivated by the analysis, we propose a new LRNN equipped with a block-diagonal and input-dependent transition matrix. Experiments suggest that the proposed model is the only LRNN that can perform length extrapolation on regular language tasks such as Sum, Even Pair, and Modular Arithmetic.

* The first two authors contributed equally to this work

Via

Access Paper or Ask Questions

Structured Dialogue Discourse Parsing

Jun 26, 2023

Ta-Chung Chi, Alexander I. Rudnicky

Figure 1 for Structured Dialogue Discourse Parsing

Figure 2 for Structured Dialogue Discourse Parsing

Figure 3 for Structured Dialogue Discourse Parsing

Figure 4 for Structured Dialogue Discourse Parsing

Abstract:Dialogue discourse parsing aims to uncover the internal structure of a multi-participant conversation by finding all the discourse~\emph{links} and corresponding~\emph{relations}. Previous work either treats this task as a series of independent multiple-choice problems, in which the link existence and relations are decoded separately, or the encoding is restricted to only local interaction, ignoring the holistic structural information. In contrast, we propose a principled method that improves upon previous work from two perspectives: encoding and decoding. From the encoding side, we perform structured encoding on the adjacency matrix followed by the matrix-tree learning algorithm, where all discourse links and relations in the dialogue are jointly optimized based on latent tree-level distribution. From the decoding side, we perform structured inference using the modified Chiu-Liu-Edmonds algorithm, which explicitly generates the labeled multi-root non-projective spanning tree that best captures the discourse structure. In addition, unlike in previous work, we do not rely on hand-crafted features; this improves the model's robustness. Experiments show that our method achieves new state-of-the-art, surpassing the previous model by 2.3 on STAC and 1.5 on Molweni (F1 scores). \footnote{Code released at~\url{https://github.com/chijames/structured_dialogue_discourse_parsing}.}

* 9 pages, accepted at SIGDIAL 2022

Via

Access Paper or Ask Questions

Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings

May 23, 2023

Ta-Chung Chi, Ting-Han Fan, Li-Wei Chen, Alexander I. Rudnicky, Peter J. Ramadge

Figure 1 for Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings

Figure 2 for Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings

Figure 3 for Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings

Figure 4 for Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings

Abstract:The use of positional embeddings in transformer language models is widely accepted. However, recent research has called into question the necessity of such embeddings. We further extend this inquiry by demonstrating that a randomly initialized and frozen transformer language model, devoid of positional embeddings, inherently encodes strong positional information through the shrinkage of self-attention variance. To quantify this variance, we derive the underlying distribution of each step within a transformer layer. Through empirical validation using a fully pretrained model, we show that the variance shrinkage effect still persists after extensive gradient updates. Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.

* Accepted by ACL 2023

Via

Access Paper or Ask Questions

Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation

May 05, 2023

Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky, Peter J. Ramadge

Figure 1 for Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation

Figure 2 for Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation

Figure 3 for Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation

Figure 4 for Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation

Abstract:Unlike recurrent models, conventional wisdom has it that Transformers cannot perfectly model regular languages. Inspired by the notion of working memory, we propose a new Transformer variant named RegularGPT. With its novel combination of Weight-Sharing, Adaptive-Depth, and Sliding-Dilated-Attention, RegularGPT constructs working memory along the depth dimension, thereby enabling efficient and successful modeling of regular languages such as PARITY. We further test RegularGPT on the task of natural language length extrapolation and surprisingly find that it rediscovers the local windowed attention effect deemed necessary in prior work for length extrapolation.

Via

Access Paper or Ask Questions

Receptive Field Alignment Enables Transformer Length Extrapolation

Dec 20, 2022

Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky

Abstract:Length extrapolation is a desirable property that permits training a transformer language model on short sequences and retaining similar perplexities when the model is tested on substantially longer sequences. A relative positional embedding mechanism applied on the transformer self-attention matrix, ALiBi, demonstrates the length extrapolation property with the widest usage to date. In this paper, we show that ALiBi surprisingly does not utilize tokens further than the training sequence length, which can be explained by its implicit windowed attention effect that aligns the receptive field during training and testing stages. Inspired by ALiBi and the receptive filed alignment hypothesis, we propose another transformer positional embedding design named~\textbf{Sandwich} that uses longer than training sequence length information, and it is a greatly simplified formulation of the earliest proposed Sinusoidal positional embedding. Finally, we show that both ALiBi and Sandwich enable efficient inference thanks to their implicit windowed attention effect.

* Work In progress

Via

Access Paper or Ask Questions

Training Discrete Deep Generative Models via Gapped Straight-Through Estimator

Jun 15, 2022

Ting-Han Fan, Ta-Chung Chi, Alexander I. Rudnicky, Peter J. Ramadge

Figure 1 for Training Discrete Deep Generative Models via Gapped Straight-Through Estimator

Figure 2 for Training Discrete Deep Generative Models via Gapped Straight-Through Estimator

Figure 3 for Training Discrete Deep Generative Models via Gapped Straight-Through Estimator

Figure 4 for Training Discrete Deep Generative Models via Gapped Straight-Through Estimator

Abstract:While deep generative models have succeeded in image processing, natural language processing, and reinforcement learning, training that involves discrete random variables remains challenging due to the high variance of its gradient estimation process. Monte Carlo is a common solution used in most variance reduction approaches. However, this involves time-consuming resampling and multiple function evaluations. We propose a Gapped Straight-Through (GST) estimator to reduce the variance without incurring resampling overhead. This estimator is inspired by the essential properties of Straight-Through Gumbel-Softmax. We determine these properties and show via an ablation study that they are essential. Experiments demonstrate that the proposed GST estimator enjoys better performance compared to strong baselines on two discrete deep generative modeling tasks, MNIST-VAE and ListOps.

* Accepted at the International Conference on Machine Learning (ICML) 2022. The first two authors contributed equally

Via

Access Paper or Ask Questions

KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

May 20, 2022

Ta-Chung Chi, Ting-Han Fan, Peter J. Ramadge, Alexander I. Rudnicky

Figure 1 for KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Figure 2 for KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Figure 3 for KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Figure 4 for KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Abstract:Relative positional embeddings (RPE) have received considerable attention since RPEs effectively model the relative distance among tokens and enable length extrapolation. We propose KERPLE, a framework that generalizes relative position embedding for extrapolation by kernelizing positional differences. We achieve this goal using conditionally positive definite (CPD) kernels, a class of functions known for generalizing distance metrics. To maintain the inner product interpretation of self-attention, we show that a CPD kernel can be transformed into a PD kernel by adding a constant offset. This offset is implicitly absorbed in the Softmax normalization during self-attention. The diversity of CPD kernels allows us to derive various RPEs that enable length extrapolation in a principled way. Experiments demonstrate that the logarithmic variant achieves excellent extrapolation performance on three large language modeling datasets.

* The first two authors contributed equally to this work

Via

Access Paper or Ask Questions

Zero-Shot Dialogue Disentanglement by Self-Supervised Entangled Response Selection

Oct 25, 2021

Ta-Chung Chi, Alexander I. Rudnicky

Figure 1 for Zero-Shot Dialogue Disentanglement by Self-Supervised Entangled Response Selection

Figure 2 for Zero-Shot Dialogue Disentanglement by Self-Supervised Entangled Response Selection

Figure 3 for Zero-Shot Dialogue Disentanglement by Self-Supervised Entangled Response Selection

Figure 4 for Zero-Shot Dialogue Disentanglement by Self-Supervised Entangled Response Selection

Abstract:Dialogue disentanglement aims to group utterances in a long and multi-participant dialogue into threads. This is useful for discourse analysis and downstream applications such as dialogue response selection, where it can be the first step to construct a clean context/response set. Unfortunately, labeling all~\emph{reply-to} links takes quadratic effort w.r.t the number of utterances: an annotator must check all preceding utterances to identify the one to which the current utterance is a reply. In this paper, we are the first to propose a~\textbf{zero-shot} dialogue disentanglement solution. Firstly, we train a model on a multi-participant response selection dataset harvested from the web which is not annotated; we then apply the trained model to perform zero-shot dialogue disentanglement. Without any labeled data, our model can achieve a cluster F1 score of 25. We also fine-tune the model using various amounts of labeled data. Experiments show that with only 10\% of the data, we achieve nearly the same performance of using the full dataset\footnote{Code is released at \url{https://github.com/chijames/zero_shot_dialogue_disentanglement}}.

* 6 pages, accepted by EMNLP 2021

Via

Access Paper or Ask Questions

Learning Conversational Systems that Interleave Task and Non-Task Content

Mar 01, 2017

Zhou Yu, Alan W Black, Alexander I. Rudnicky

Figure 1 for Learning Conversational Systems that Interleave Task and Non-Task Content

Figure 2 for Learning Conversational Systems that Interleave Task and Non-Task Content

Figure 3 for Learning Conversational Systems that Interleave Task and Non-Task Content

Figure 4 for Learning Conversational Systems that Interleave Task and Non-Task Content

Abstract:Task-oriented dialog systems have been applied in various tasks, such as automated personal assistants, customer service providers and tutors. These systems work well when users have clear and explicit intentions that are well-aligned to the systems' capabilities. However, they fail if users intentions are not explicit. To address this shortcoming, we propose a framework to interleave non-task content (i.e. everyday social conversation) into task conversations. When the task content fails, the system can still keep the user engaged with the non-task content. We trained a policy using reinforcement learning algorithms to promote long-turn conversation coherence and consistency, so that the system can have smooth transitions between task and non-task content. To test the effectiveness of the proposed framework, we developed a movie promotion dialog system. Experiments with human users indicate that a system that interleaves social and task content achieves a better task success rate and is also rated as more engaging compared to a pure task-oriented system.

* Dialog Systems, Reinforcement Learning

Via

Access Paper or Ask Questions