Alert button
Picture for Jannis Bulian

Jannis Bulian

Alert button

Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$

Mar 31, 2022
Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulian, Xavier Garcia, Jianmo Ni, Andrew Chen, Kathleen Kenealy, Jonathan H. Clark, Stephan Lee, Dan Garrette, James Lee-Thorp, Colin Raffel, Noam Shazeer, Marvin Ritter, Maarten Bosma, Alexandre Passos, Jeremy Maitin-Shepard, Noah Fiedel, Mark Omernick, Brennan Saeta, Ryan Sepassi, Alexander Spiridonov, Joshua Newlan, Andrea Gesmundo

Figure 1 for Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$
Figure 2 for Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$

Recent neural network-based language models have benefited greatly from scaling up the size of training datasets and the number of parameters in the models themselves. Scaling can be complicated due to various factors including the need to distribute computation on supercomputer clusters (e.g., TPUs), prevent bottlenecks when infeeding data, and ensure reproducible results. In this work, we present two software libraries that ease these issues: $\texttt{t5x}$ simplifies the process of building and training large language models at scale while maintaining ease of use, and $\texttt{seqio}$ provides a task-based API for simple creation of fast and reproducible training data and evaluation pipelines. These open-source libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data. Along with the libraries, we release configurations and instructions for T5-like encoder-decoder models as well as GPT-like decoder-only architectures. $\texttt{t5x}$ and $\texttt{seqio}$ are open source and available at https://github.com/google-research/t5x and https://github.com/google/seqio, respectively.

Viaarxiv icon

Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation

Feb 15, 2022
Jannis Bulian, Christian Buck, Wojciech Gajewski, Benjamin Boerschinger, Tal Schuster

Figure 1 for Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation
Figure 2 for Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation
Figure 3 for Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation
Figure 4 for Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation

The predictions of question answering (QA) systems are typically evaluated against manually annotated finite sets of one or more answers. This leads to a coverage limitation that results in underestimating the true performance of systems, and is typically addressed by extending over exact match (EM) with predefined rules or with the token-level F1 measure. In this paper, we present the first systematic conceptual and data-driven analysis to examine the shortcomings of token-level equivalence measures. To this end, we define the asymmetric notion of answer equivalence (AE), accepting answers that are equivalent to or improve over the reference, and collect over 26K human judgements for candidates produced by multiple QA systems on SQuAD. Through a careful analysis of this data, we reveal and quantify several concrete limitations of the F1 measure, such as false impression of graduality, missing dependence on question, and more. Since collecting AE annotations for each evaluated model is expensive, we learn a BERT matching BEM measure to approximate this task. Being a simpler task than QA, we find BEM to provide significantly better AE approximations than F1, and more accurately reflect the performance of systems. Finally, we also demonstrate the practical utility of AE and BEM on the concrete application of minimal accurate prediction sets, reducing the number of required answers by up to 2.6 times.

Viaarxiv icon

Fool Me Twice: Entailment from Wikipedia Gamification

Apr 10, 2021
Julian Martin Eisenschlos, Bhuwan Dhingra, Jannis Bulian, Benjamin Börschinger, Jordan Boyd-Graber

Figure 1 for Fool Me Twice: Entailment from Wikipedia Gamification
Figure 2 for Fool Me Twice: Entailment from Wikipedia Gamification
Figure 3 for Fool Me Twice: Entailment from Wikipedia Gamification
Figure 4 for Fool Me Twice: Entailment from Wikipedia Gamification

We release FoolMeTwice (FM2 for short), a large dataset of challenging entailment pairs collected through a fun multi-player game. Gamification encourages adversarial examples, drastically lowering the number of examples that can be solved using "shortcuts" compared to other popular entailment datasets. Players are presented with two tasks. The first task asks the player to write a plausible claim based on the evidence from a Wikipedia page. The second one shows two plausible claims written by other players, one of which is false, and the goal is to identify it before the time runs out. Players "pay" to see clues retrieved from the evidence pool: the more evidence the player needs, the harder the claim. Game-play between motivated players leads to diverse strategies for crafting claims, such as temporal inference and diverting to unrelated evidence, and results in higher quality data for the entailment and evidence retrieval tasks. We open source the dataset and the game code.

* Published in NAACL 2021 
Viaarxiv icon

CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims

Jan 02, 2021
Thomas Diggelmann, Jordan Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, Markus Leippold

Figure 1 for CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims
Figure 2 for CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims
Figure 3 for CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims
Figure 4 for CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims

We introduce CLIMATE-FEVER, a new publicly available dataset for verification of climate change-related claims. By providing a dataset for the research community, we aim to facilitate and encourage work on improving algorithms for retrieving evidential support for climate-specific claims, addressing the underlying language understanding challenges, and ultimately help alleviate the impact of misinformation on climate change. We adapt the methodology of FEVER [1], the largest dataset of artificially designed claims, to real-life claims collected from the Internet. While during this process, we could rely on the expertise of renowned climate scientists, it turned out to be no easy task. We discuss the surprising, subtle complexity of modeling real-world climate-related claims within the \textsc{fever} framework, which we believe provides a valuable challenge for general natural language understanding. We hope that our work will mark the beginning of a new exciting long-term joint effort by the climate science and AI community.

* Accepted for the Tackling Climate Change with Machine Learning Workshop at NeurIPS 2020 
Viaarxiv icon

Meta Answering for Machine Reading

Nov 11, 2019
Benjamin Borschinger, Jordan Boyd-Graber, Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Michelle Chen Huebscher, Wojciech Gajewski, Yannic Kilcher, Rodrigo Nogueira, Lierni Sestorain Saralegu

Figure 1 for Meta Answering for Machine Reading
Figure 2 for Meta Answering for Machine Reading
Figure 3 for Meta Answering for Machine Reading
Figure 4 for Meta Answering for Machine Reading

We investigate a framework for machine reading, inspired by real world information-seeking problems, where a meta question answering system interacts with a black box environment. The environment encapsulates a competitive machine reader based on BERT, providing candidate answers to questions, and possibly some context. To validate the realism of our formulation, we ask humans to play the role of a meta-answerer. With just a small snippet of text around an answer, humans can outperform the machine reader, improving recall. Similarly, a simple machine meta-answerer outperforms the environment, improving both precision and recall on the Natural Questions dataset. The system relies on joint training of answer scoring and the selection of conditioning information.

Viaarxiv icon

Learning to Coordinate Multiple Reinforcement Learning Agents for Diverse Query Reformulation

Sep 27, 2018
Rodrigo Nogueira, Jannis Bulian, Massimiliano Ciaramita

Figure 1 for Learning to Coordinate Multiple Reinforcement Learning Agents for Diverse Query Reformulation
Figure 2 for Learning to Coordinate Multiple Reinforcement Learning Agents for Diverse Query Reformulation
Figure 3 for Learning to Coordinate Multiple Reinforcement Learning Agents for Diverse Query Reformulation
Figure 4 for Learning to Coordinate Multiple Reinforcement Learning Agents for Diverse Query Reformulation

We propose a method to efficiently learn diverse strategies in reinforcement learning for query reformulation in the tasks of document retrieval and question answering. In the proposed framework an agent consists of multiple specialized sub-agents and a meta-agent that learns to aggregate the answers from sub-agents to produce a final answer. Sub-agents are trained on disjoint partitions of the training data, while the meta-agent is trained on the full training set. Our method makes learning faster, because it is highly parallelizable, and has better generalization performance than strong baselines, such as an ensemble of agents trained on the full data. We show that the improved performance is due to the increased diversity of reformulation strategies.

Viaarxiv icon

Ask the Right Questions: Active Question Reformulation with Reinforcement Learning

Mar 02, 2018
Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Wojciech Gajewski, Andrea Gesmundo, Neil Houlsby, Wei Wang

Figure 1 for Ask the Right Questions: Active Question Reformulation with Reinforcement Learning
Figure 2 for Ask the Right Questions: Active Question Reformulation with Reinforcement Learning
Figure 3 for Ask the Right Questions: Active Question Reformulation with Reinforcement Learning

We frame Question Answering (QA) as a Reinforcement Learning task, an approach that we call Active Question Answering. We propose an agent that sits between the user and a black box QA system and learns to reformulate questions to elicit the best possible answers. The agent probes the system with, potentially many, natural language reformulations of an initial question and aggregates the returned evidence to yield the best answer. The reformulation system is trained end-to-end to maximize answer quality using policy gradient. We evaluate on SearchQA, a dataset of complex questions extracted from Jeopardy!. The agent outperforms a state-of-the-art base model, playing the role of the environment, and other benchmarks. We also analyze the language that the agent has learned while interacting with the question answering system. We find that successful question reformulations look quite different from natural language paraphrases. The agent is able to discover non-trivial reformulation strategies that resemble classic information retrieval techniques such as term re-weighting (tf-idf) and stemming.

* Sixth International Conference on Learning Representations (ICLR), 2018  
Viaarxiv icon

Analyzing Language Learned by an Active Question Answering Agent

Jan 23, 2018
Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Wojciech Gajewski, Andrea Gesmundo, Neil Houlsby, Wei Wang

Figure 1 for Analyzing Language Learned by an Active Question Answering Agent
Figure 2 for Analyzing Language Learned by an Active Question Answering Agent

We analyze the language learned by an agent trained with reinforcement learning as a component of the ActiveQA system [Buck et al., 2017]. In ActiveQA, question answering is framed as a reinforcement learning task in which an agent sits between the user and a black box question-answering system. The agent learns to reformulate the user's questions to elicit the optimal answers. It probes the system with many versions of a question that are generated via a sequence-to-sequence question reformulation model, then aggregates the returned evidence to find the best answer. This process is an instance of \emph{machine-machine} communication. The question reformulation model must adapt its language to increase the quality of the answers returned, matching the language of the question answering system. We find that the agent does not learn transformations that align with semantic intuitions but discovers through learning classical information retrieval techniques such as tf-idf re-weighting and stemming.

* Emergent Communication Workshop, NIPS 2017 
Viaarxiv icon