Alert button
Picture for Jiaoda Li

Jiaoda Li

Alert button

Probing via Prompting

Jul 04, 2022
Jiaoda Li, Ryan Cotterell, Mrinmaya Sachan

Figure 1 for Probing via Prompting
Figure 2 for Probing via Prompting
Figure 3 for Probing via Prompting
Figure 4 for Probing via Prompting

Probing is a popular method to discern what linguistic information is contained in the representations of pre-trained language models. However, the mechanism of selecting the probe model has recently been subject to intense debate, as it is not clear if the probes are merely extracting information or modeling the linguistic property themselves. To address this challenge, this paper introduces a novel model-free approach to probing, by formulating probing as a prompting task. We conduct experiments on five probing tasks and show that our approach is comparable or better at extracting information than diagnostic probes while learning much less on its own. We further combine the probing via prompting approach with attention head pruning to analyze where the model stores the linguistic information in its architecture. We then examine the usefulness of a specific linguistic property for pre-training by removing the heads that are essential to that property and evaluating the resulting model's performance on language modeling.

* NAACL 2022 
Viaarxiv icon

Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models

Sep 08, 2021
Jiaoda Li, Duygu Ataman, Rico Sennrich

Figure 1 for Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models
Figure 2 for Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models

Multimodal machine translation (MMT) systems have been shown to outperform their text-only neural machine translation (NMT) counterparts when visual context is available. However, recent studies have also shown that the performance of MMT models is only marginally impacted when the associated image is replaced with an unrelated image or noise, which suggests that the visual context might not be exploited by the model at all. We hypothesize that this might be caused by the nature of the commonly used evaluation benchmark, also known as Multi30K, where the translations of image captions were prepared without actually showing the images to human translators. In this paper, we present a qualitative study that examines the role of datasets in stimulating the leverage of visual modality and we propose methods to highlight the importance of visual signals in the datasets which demonstrate improvements in reliance of models on the source images. Our findings suggest the research on effective MMT architectures is currently impaired by the lack of suitable datasets and careful consideration must be taken in creation of future MMT datasets, for which we also provide useful insights.

* EMNLP 2021 
Viaarxiv icon

Differentiable Subset Pruning of Transformer Heads

Aug 22, 2021
Jiaoda Li, Ryan Cotterell, Mrinmaya Sachan

Figure 1 for Differentiable Subset Pruning of Transformer Heads
Figure 2 for Differentiable Subset Pruning of Transformer Heads
Figure 3 for Differentiable Subset Pruning of Transformer Heads
Figure 4 for Differentiable Subset Pruning of Transformer Heads

Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer's multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. Intuitively, our method learns per-head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. The importance variables are learned via stochastic gradient descent. We conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.

* TACL 2021 
Viaarxiv icon