Alert button
Picture for Alane Suhr

Alane Suhr

Alert button

NLVR2 Visual Bias Analysis

Sep 23, 2019
Alane Suhr, Yoav Artzi

Figure 1 for NLVR2 Visual Bias Analysis
Figure 2 for NLVR2 Visual Bias Analysis
Figure 3 for NLVR2 Visual Bias Analysis
Figure 4 for NLVR2 Visual Bias Analysis

NLVR2 (Suhr et al., 2019) was designed to be robust for language bias through a data collection process that resulted in each natural language sentence appearing with both true and false labels. The process did not provide a similar measure of control for visual bias. This technical report analyzes the potential for visual bias in NLVR2. We show that some amount of visual bias likely exists. Finally, we identify a subset of the test data that allows to test for model performance in a way that is robust to such potential biases. We show that the performance of existing models (Li et al., 2019; Tan and Bansal 2019) is relatively robust to this potential bias. We propose to add the evaluation on this subset of the data to the NLVR2 evaluation protocol, and update the official release to include it. A notebook including an implementation of the code used to replicate this analysis is available at http://nlvr.ai/NLVR2BiasAnalysis.html.

* Corresponding notebook available at http://lil.nlp.cornell.edu/nlvr/NLVR2BiasAnalysis.html 
Viaarxiv icon

Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments

Nov 30, 2018
Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, Yoav Artzi

Figure 1 for Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments
Figure 2 for Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments
Figure 3 for Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments
Figure 4 for Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments

We study the problem of jointly reasoning about language and vision through a navigation and spatial reasoning task. We introduce the Touchdown task and dataset, where an agent must first follow navigation instructions in a real-life visual urban environment to a goal position, and then identify in the observed image a location described in natural language to find a hidden object. The data contains 9,326 examples of English instructions and spatial descriptions paired with demonstrations. We perform qualitative linguistic analysis, and show that the data displays richer use of spatial reasoning compared to related resources. Empirical analysis shows the data presents an open challenge to existing methods.

Viaarxiv icon

A Corpus for Reasoning About Natural Language Grounded in Photographs

Nov 01, 2018
Alane Suhr, Stephanie Zhou, Iris Zhang, Huajun Bai, Yoav Artzi

Figure 1 for A Corpus for Reasoning About Natural Language Grounded in Photographs
Figure 2 for A Corpus for Reasoning About Natural Language Grounded in Photographs
Figure 3 for A Corpus for Reasoning About Natural Language Grounded in Photographs
Figure 4 for A Corpus for Reasoning About Natural Language Grounded in Photographs

We introduce a new dataset for joint reasoning about language and vision. The data contains 107,296 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a photograph. We present an approach for finding visually complex images and crowdsourcing linguistically diverse captions. Qualitative analysis shows the data requires complex reasoning about quantities, comparisons, and relationships between objects. Evaluation of state-of-the-art visual reasoning methods shows the data is a challenge for current methods.

Viaarxiv icon

Situated Mapping of Sequential Instructions to Actions with Single-step Reward Observation

Jun 08, 2018
Alane Suhr, Yoav Artzi

Figure 1 for Situated Mapping of Sequential Instructions to Actions with Single-step Reward Observation
Figure 2 for Situated Mapping of Sequential Instructions to Actions with Single-step Reward Observation
Figure 3 for Situated Mapping of Sequential Instructions to Actions with Single-step Reward Observation
Figure 4 for Situated Mapping of Sequential Instructions to Actions with Single-step Reward Observation

We propose a learning approach for mapping context-dependent sequential instructions to actions. We address the problem of discourse and state dependencies with an attention-based model that considers both the history of the interaction and the state of the world. To train from start and goal states without access to demonstrations, we propose SESTRA, a learning algorithm that takes advantage of single-step reward observations and immediate expected reward maximization. We evaluate on the SCONE domains, and show absolute accuracy improvements of 9.8%-25.3% across the domains over approaches that use high-level logical representations.

* ACL 2018 Long Paper 
Viaarxiv icon

Learning to Map Context-Dependent Sentences to Executable Formal Queries

Apr 25, 2018
Alane Suhr, Srinivasan Iyer, Yoav Artzi

Figure 1 for Learning to Map Context-Dependent Sentences to Executable Formal Queries
Figure 2 for Learning to Map Context-Dependent Sentences to Executable Formal Queries
Figure 3 for Learning to Map Context-Dependent Sentences to Executable Formal Queries
Figure 4 for Learning to Map Context-Dependent Sentences to Executable Formal Queries

We propose a context-dependent model to map utterances within an interaction to executable formal queries. To incorporate interaction history, the model maintains an interaction-level encoder that updates after each turn, and can copy sub-sequences of previously predicted queries during generation. Our approach combines implicit and explicit modeling of references between utterances. We evaluate our model on the ATIS flight planning interactions, and demonstrate the benefits of modeling context and explicit references.

* NAACL-HLT 2018 Long Paper 
Viaarxiv icon

Visual Reasoning with Natural Language

Oct 02, 2017
Stephanie Zhou, Alane Suhr, Yoav Artzi

Figure 1 for Visual Reasoning with Natural Language
Figure 2 for Visual Reasoning with Natural Language
Figure 3 for Visual Reasoning with Natural Language
Figure 4 for Visual Reasoning with Natural Language

Natural language provides a widely accessible and expressive interface for robotic agents. To understand language in complex environments, agents must reason about the full range of language inputs and their correspondence to the world. Such reasoning over language and vision is an open problem that is receiving increasing attention. While existing data sets focus on visual diversity, they do not display the full range of natural language expressions, such as counting, set reasoning, and comparisons. We propose a simple task for natural language visual reasoning, where images are paired with descriptive statements. The task is to predict if a statement is true for the given scene. This abstract describes our existing synthetic images corpus and our current work on collecting real vision data.

* AAAI NCHRC 2017 
Viaarxiv icon