Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Percy Liang

LinkBERT: Pretraining Language Models with Document Links

Mar 29, 2022
Michihiro Yasunaga, Jure Leskovec, Percy Liang

Figure 1 for LinkBERT: Pretraining Language Models with Document Links

Figure 2 for LinkBERT: Pretraining Language Models with Document Links

Figure 3 for LinkBERT: Pretraining Language Models with Document Links

Figure 4 for LinkBERT: Pretraining Language Models with Document Links

Language model (LM) pretraining can learn various knowledge from text corpora, helping downstream tasks. However, existing methods such as BERT model a single document, and do not capture dependencies or knowledge that span across documents. In this work, we propose LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks. Given a text corpus, we view it as a graph of documents and create LM inputs by placing linked documents in the same context. We then pretrain the LM with two joint self-supervised objectives: masked language modeling and our new proposal, document relation prediction. We show that LinkBERT outperforms BERT on various downstream tasks across two domains: the general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). LinkBERT is especially effective for multi-hop reasoning and few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA), and our biomedical LinkBERT sets new states of the art on various BioNLP tasks (+7% on BioASQ and USMLE). We release our pretrained models, LinkBERT and BioLinkBERT, as well as code and data at https://github.com/michiyasunaga/LinkBERT.

* Published at ACL 2022. Code, data, and pretrained models are available at https://github.com/michiyasunaga/LinkBERT

Via

Access Paper or Ask Questions

Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

Feb 21, 2022
Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, Percy Liang

Figure 1 for Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

Figure 2 for Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

Figure 3 for Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

Figure 4 for Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer -- the "head"). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR $\to$ STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. We prove that the OOD error of fine-tuning is high when we initialize with a fixed or random head -- this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing. Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).

* ICLR (Oral) 2022

Via

Access Paper or Ask Questions

CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities

Jan 25, 2022
Mina Lee, Percy Liang, Qian Yang

Figure 1 for CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities

Figure 2 for CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities

Figure 3 for CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities

Figure 4 for CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities

Large language models (LMs) offer unprecedented language generation capabilities and exciting opportunities for interaction design. However, their highly context-dependent capabilities are difficult to grasp and are often subjectively interpreted. In this paper, we argue that by curating and analyzing large interaction datasets, the HCI community can foster more incisive examinations of LMs' generative capabilities. Exemplifying this approach, we present CoAuthor, a dataset designed for revealing GPT-3's capabilities in assisting creative and argumentative writing. CoAuthor captures rich interactions between 63 writers and four instances of GPT-3 across 1445 writing sessions. We demonstrate that CoAuthor can address questions about GPT-3's language, ideation, and collaboration capabilities, and reveal its contribution as a writing "collaborator" under various definitions of good collaboration. Finally, we discuss how this work may facilitate a more principled discussion around LMs' promises and pitfalls in relation to interaction design. The dataset and an interface for replaying the writing sessions are publicly available at https://coauthor.stanford.edu.

* Published as a conference paper at CHI 2022

Via

Access Paper or Ask Questions

GreaseLM: Graph REASoning Enhanced Language Models for Question Answering

Jan 21, 2022
Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren, Percy Liang, Christopher D. Manning, Jure Leskovec

Figure 1 for GreaseLM: Graph REASoning Enhanced Language Models for Question Answering

Figure 2 for GreaseLM: Graph REASoning Enhanced Language Models for Question Answering

Figure 3 for GreaseLM: Graph REASoning Enhanced Language Models for Question Answering

Figure 4 for GreaseLM: Graph REASoning Enhanced Language Models for Question Answering

Answering complex questions about textual narratives requires reasoning over both stated context and the world knowledge that underlies it. However, pretrained language models (LM), the foundation of most modern QA systems, do not robustly represent latent relationships between concepts, which is necessary for reasoning. While knowledge graphs (KG) are often used to augment LMs with structured representations of world knowledge, it remains an open question how to effectively fuse and reason over the KG representations and the language context, which provides situational constraints and nuances. In this work, we propose GreaseLM, a new model that fuses encoded representations from pretrained LMs and graph neural networks over multiple layers of modality interaction operations. Information from both modalities propagates to the other, allowing language context representations to be grounded by structured world knowledge, and allowing linguistic nuances (e.g., negation, hedging) in the context to inform the graph representations of knowledge. Our results on three benchmarks in the commonsense reasoning (i.e., CommonsenseQA, OpenbookQA) and medical question answering (i.e., MedQA-USMLE) domains demonstrate that GreaseLM can more reliably answer questions that require reasoning over both situational constraints and structured knowledge, even outperforming models 8x larger.

* Published at ICLR 2022. All code, data, and pretrained models are available at https://github.com/snap-stanford/GreaseLM

Via

Access Paper or Ask Questions

Extending the WILDS Benchmark for Unsupervised Adaptation

Dec 09, 2021
Shiori Sagawa, Pang Wei Koh, Tony Lee, Irena Gao, Sang Michael Xie, Kendrick Shen, Ananya Kumar, Weihua Hu, Michihiro Yasunaga, Henrik Marklund, Sara Beery, Etienne David, Ian Stavness, Wei Guo, Jure Leskovec, Kate Saenko, Tatsunori Hashimoto, Sergey Levine, Chelsea Finn, Percy Liang

Figure 1 for Extending the WILDS Benchmark for Unsupervised Adaptation

Figure 2 for Extending the WILDS Benchmark for Unsupervised Adaptation

Figure 3 for Extending the WILDS Benchmark for Unsupervised Adaptation

Figure 4 for Extending the WILDS Benchmark for Unsupervised Adaptation

Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data. However, existing distribution shift benchmarks for unlabeled data do not reflect the breadth of scenarios that arise in real-world applications. In this work, we present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment. To maintain consistency, the labeled training, validation, and test sets, as well as the evaluation metrics, are exactly the same as in the original WILDS benchmark. These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities (photos, satellite images, microscope slides, text, molecular graphs). We systematically benchmark state-of-the-art methods that leverage unlabeled data, including domain-invariant, self-training, and self-supervised methods, and show that their success on WILDS 2.0 is limited. To facilitate method development and evaluation, we provide an open-source package that automates data loading and contains all of the model architectures and methods used in this paper. Code and leaderboards are available at https://wilds.stanford.edu.

Via

Access Paper or Ask Questions

An Explanation of In-context Learning as Implicit Bayesian Inference

Nov 14, 2021
Sang Michael Xie, Aditi Raghunathan, Percy Liang, Tengyu Ma

Figure 1 for An Explanation of In-context Learning as Implicit Bayesian Inference

Figure 2 for An Explanation of In-context Learning as Implicit Bayesian Inference

Figure 3 for An Explanation of In-context Learning as Implicit Bayesian Inference

Figure 4 for An Explanation of In-context Learning as Implicit Bayesian Inference

Large pretrained language models such as GPT-3 have the surprising ability to do in-context learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples. Without being explicitly pretrained to do so, the language model learns from these examples during its forward pass without parameter updates on "out-of-distribution" prompts. Thus, it is unclear what mechanism enables in-context learning. In this paper, we study the role of the pretraining distribution on the emergence of in-context learning under a mathematical setting where the pretraining texts have long-range coherence. Here, language model pretraining requires inferring a latent document-level concept from the conditioning text to generate coherent next tokens. At test time, this mechanism enables in-context learning by inferring the shared latent concept between prompt examples and applying it to make a prediction on the test example. Concretely, we prove that in-context learning occurs implicitly via Bayesian inference of the latent concept when the pretraining distribution is a mixture of HMMs. This can occur despite the distribution mismatch between prompts and pretraining data. In contrast to messy large-scale pretraining datasets for in-context learning in natural language, we generate a family of small-scale synthetic datasets (GINC) where Transformer and LSTM language models both exhibit in-context learning. Beyond the theory which focuses on the effect of the pretraining distribution, we empirically find that scaling model size improves in-context accuracy even when the pretraining loss is the same.

Via

Access Paper or Ask Questions

LILA: Language-Informed Latent Actions

Nov 05, 2021
Siddharth Karamcheti, Megha Srivastava, Percy Liang, Dorsa Sadigh

Figure 1 for LILA: Language-Informed Latent Actions

Figure 2 for LILA: Language-Informed Latent Actions

Figure 3 for LILA: Language-Informed Latent Actions

Figure 4 for LILA: Language-Informed Latent Actions

We introduce Language-Informed Latent Actions (LILA), a framework for learning natural language interfaces in the context of human-robot collaboration. LILA falls under the shared autonomy paradigm: in addition to providing discrete language inputs, humans are given a low-dimensional controller $-$ e.g., a 2 degree-of-freedom (DoF) joystick that can move left/right and up/down $-$ for operating the robot. LILA learns to use language to modulate this controller, providing users with a language-informed control space: given an instruction like "place the cereal bowl on the tray," LILA may learn a 2-DoF space where one dimension controls the distance from the robot's end-effector to the bowl, and the other dimension controls the robot's end-effector pose relative to the grasp point on the bowl. We evaluate LILA with real-world user studies, where users can provide a language instruction while operating a 7-DoF Franka Emika Panda Arm to complete a series of complex manipulation tasks. We show that LILA models are not only more sample efficient and performant than imitation learning and end-effector control baselines, but that they are also qualitatively preferred by users.

* Accepted at the 5th Conference on Robot Learning (CoRL). Joint first authorship. 21 Pages, 11 Figures

Via

Access Paper or Ask Questions

Large Language Models Can Be Strong Differentially Private Learners

Oct 12, 2021
Xuechen Li, Florian Tramèr, Percy Liang, Tatsunori Hashimoto

Figure 1 for Large Language Models Can Be Strong Differentially Private Learners

Figure 2 for Large Language Models Can Be Strong Differentially Private Learners

Figure 3 for Large Language Models Can Be Strong Differentially Private Learners

Figure 4 for Large Language Models Can Be Strong Differentially Private Learners

Differentially Private (DP) learning has seen limited success for building large deep learning models of text, and attempts at straightforwardly applying Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have resulted in large performance drops and high computational overhead. We show that this performance drop can be mitigated with (1) the use of large pretrained models; (2) hyperparameters that suit DP optimization; and (3) fine-tuning objectives aligned with the pretraining procedure. With these factors set right, we obtain private NLP models that outperform state-of-the-art private training approaches and strong non-private baselines -- by directly fine-tuning pretrained models with DP optimization on moderately-sized corpora. To address the computational challenge of running DP-SGD with large Transformers, we propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients for any layer in the model. The technique enables privately training Transformers with almost the same memory cost as non-private training at a modest run-time overhead. Contrary to conventional wisdom that DP optimization fails at learning high-dimensional models (due to noise that scales with dimension) empirical results reveal that private learning with pretrained models tends to not suffer from dimension-dependent performance degradation.

* 29 pages

Via

Access Paper or Ask Questions