Alert button
Picture for André F. T. Martins

André F. T. Martins

Alert button

Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms

Oct 05, 2018
Mathieu Blondel, André F. T. Martins, Vlad Niculae

Figure 1 for Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms
Figure 2 for Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms
Figure 3 for Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms
Figure 4 for Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms

We study in this paper Fenchel-Young losses, a generic way to construct convex loss functions from a convex regularizer. We provide an in-depth study of their properties in a broad setting and show that they unify many well-known loss functions. When constructed from a generalized entropy, which includes well-known entropies such as Shannon and Tsallis entropies, we show that Fenchel-Young losses induce a predictive probability distribution and develop an efficient algorithm to compute that distribution for separable entropies. We derive conditions for generalized entropies to yield a distribution with sparse support and losses with a separation margin. Finally, we present both primal and dual algorithms to learn predictive models with generic Fenchel-Young losses.

Viaarxiv icon

Towards Dynamic Computation Graphs via Sparse Latent Structure

Sep 03, 2018
Vlad Niculae, André F. T. Martins, Claire Cardie

Figure 1 for Towards Dynamic Computation Graphs via Sparse Latent Structure
Figure 2 for Towards Dynamic Computation Graphs via Sparse Latent Structure
Figure 3 for Towards Dynamic Computation Graphs via Sparse Latent Structure
Figure 4 for Towards Dynamic Computation Graphs via Sparse Latent Structure

Deep NLP models benefit from underlying structures in the data---e.g., parse trees---typically extracted using off-the-shelf parsers. Recent attempts to jointly learn the latent structure encounter a tradeoff: either make factorization assumptions that limit expressiveness, or sacrifice end-to-end differentiability. Using the recently proposed SparseMAP inference, which retrieves a sparse distribution over latent structures, we propose a novel approach for end-to-end learning of latent structure predictors jointly with a downstream predictor. To the best of our knowledge, our method is the first to enable unrestricted dynamic computation graph construction from the global latent structure, while maintaining differentiability.

* EMNLP 2018; 9 pages (incl. appendix) 
Viaarxiv icon

Contextual Neural Model for Translating Bilingual Multi-Speaker Conversations

Sep 02, 2018
Sameen Maruf, André F. T. Martins, Gholamreza Haffari

Figure 1 for Contextual Neural Model for Translating Bilingual Multi-Speaker Conversations
Figure 2 for Contextual Neural Model for Translating Bilingual Multi-Speaker Conversations
Figure 3 for Contextual Neural Model for Translating Bilingual Multi-Speaker Conversations
Figure 4 for Contextual Neural Model for Translating Bilingual Multi-Speaker Conversations

Recent works in neural machine translation have begun to explore document translation. However, translating online multi-speaker conversations is still an open problem. In this work, we propose the task of translating Bilingual Multi-Speaker Conversations, and explore neural architectures which exploit both source and target-side conversation histories for this task. To initiate an evaluation for this task, we introduce datasets extracted from Europarl v7 and OpenSubtitles2016. Our experiments on four language-pairs confirm the significance of leveraging conversation history, both in terms of BLEU and manual evaluation.

* WMT 2018 
Viaarxiv icon

SparseMAP: Differentiable Sparse Structured Inference

Jun 20, 2018
Vlad Niculae, André F. T. Martins, Mathieu Blondel, Claire Cardie

Figure 1 for SparseMAP: Differentiable Sparse Structured Inference
Figure 2 for SparseMAP: Differentiable Sparse Structured Inference
Figure 3 for SparseMAP: Differentiable Sparse Structured Inference
Figure 4 for SparseMAP: Differentiable Sparse Structured Inference

Structured prediction requires searching over a combinatorial number of structures. To tackle it, we introduce SparseMAP: a new method for sparse structured inference, and its natural loss function. SparseMAP automatically selects only a few global structures: it is situated between MAP inference, which picks a single structure, and marginal inference, which assigns probability mass to all structures, including implausible ones. Importantly, SparseMAP can be computed using only calls to a MAP oracle, making it applicable to problems with intractable marginal inference, e.g., linear assignment. Sparsity makes gradient backpropagation efficient regardless of the structure, enabling us to augment deep neural networks with generic and sparse structured hidden layers. Experiments in dependency parsing and natural language inference reveal competitive accuracy, improved interpretability, and the ability to capture natural language ambiguities, which is attractive for pipeline systems.

* Published in ICML 2018. 14 pages, including appendix 
Viaarxiv icon

Sparse and Constrained Attention for Neural Machine Translation

May 21, 2018
Chaitanya Malaviya, Pedro Ferreira, André F. T. Martins

Figure 1 for Sparse and Constrained Attention for Neural Machine Translation
Figure 2 for Sparse and Constrained Attention for Neural Machine Translation
Figure 3 for Sparse and Constrained Attention for Neural Machine Translation
Figure 4 for Sparse and Constrained Attention for Neural Machine Translation

In NMT, words are sometimes dropped from the source or generated repeatedly in the translation. We explore novel strategies to address the coverage problem that change only the attention transformation. Our approach allocates fertilities to source words, used to bound the attention each word can receive. We experiment with various sparse and constrained attention transformations and propose a new one, constrained sparsemax, shown to be differentiable and sparse. Empirical evaluation is provided in three languages pairs.

* Proceedings of ACL 2018 
Viaarxiv icon

Marian: Fast Neural Machine Translation in C++

Apr 04, 2018
Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, Alexandra Birch

Figure 1 for Marian: Fast Neural Machine Translation in C++
Figure 2 for Marian: Fast Neural Machine Translation in C++
Figure 3 for Marian: Fast Neural Machine Translation in C++
Figure 4 for Marian: Fast Neural Machine Translation in C++

We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.

* Demonstration paper 
Viaarxiv icon

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification

Feb 08, 2016
André F. T. Martins, Ramón Fernandez Astudillo

Figure 1 for From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification
Figure 2 for From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification
Figure 3 for From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification
Figure 4 for From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification

We propose sparsemax, a new activation function similar to the traditional softmax, but able to output sparse probabilities. After deriving its properties, we show how its Jacobian can be efficiently computed, enabling its use in a network trained with backpropagation. Then, we propose a new smooth and convex loss function which is the sparsemax analogue of the logistic loss. We reveal an unexpected connection between this new loss and the Huber classification loss. We obtain promising empirical results in multi-label classification problems and in attention-based neural networks for natural language inference. For the latter, we achieve a similar performance as the traditional softmax, but with a selective, more compact, attention focus.

* Minor corrections 
Viaarxiv icon

Parsing as Reduction

Feb 27, 2015
Daniel Fernández-González, André F. T. Martins

Figure 1 for Parsing as Reduction
Figure 2 for Parsing as Reduction
Figure 3 for Parsing as Reduction
Figure 4 for Parsing as Reduction

We reduce phrase-representation parsing to dependency parsing. Our reduction is grounded on a new intermediate representation, "head-ordered dependency trees", shown to be isomorphic to constituent trees. By encoding order information in the dependency labels, we show that any off-the-shelf, trainable dependency parser can be used to produce constituents. When this parser is non-projective, we can perform discontinuous parsing in a very natural manner. Despite the simplicity of our approach, experiments show that the resulting parsers are on par with strong baselines, such as the Berkeley parser for English and the best single system in the SPMRL-2014 shared task. Results are particularly striking for discontinuous parsing of German, where we surpass the current state of the art by a wide margin.

Viaarxiv icon