Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Simon Lacoste-Julien

DIRO, MILA

Adaptive Stochastic Dual Coordinate Ascent for Conditional Random Fields

Jul 10, 2018

Rémi Le Priol, Alexandre Piché, Simon Lacoste-Julien

Figure 1 for Adaptive Stochastic Dual Coordinate Ascent for Conditional Random Fields

Figure 2 for Adaptive Stochastic Dual Coordinate Ascent for Conditional Random Fields

Figure 3 for Adaptive Stochastic Dual Coordinate Ascent for Conditional Random Fields

Figure 4 for Adaptive Stochastic Dual Coordinate Ascent for Conditional Random Fields

Abstract:This work investigates the training of conditional random fields (CRFs) via the stochastic dual coordinate ascent (SDCA) algorithm of Shalev-Shwartz and Zhang (2016). SDCA enjoys a linear convergence rate and a strong empirical performance for binary classification problems. However, it has never been used to train CRFs. Yet it benefits from an `exact' line search with a single marginalization oracle call, unlike previous approaches. In this paper, we adapt SDCA to train CRFs, and we enhance it with an adaptive non-uniform sampling strategy based on block duality gaps. We perform experiments on four standard sequence prediction tasks. SDCA demonstrates performances on par with the state of the art, and improves over it on three of the four datasets, which have in common the use of sparse features.

* Published as a conference paper at UAI 2018. 22 pages

Via

Access Paper or Ask Questions

Parametric Adversarial Divergences are Good Task Losses for Generative Modeling

Jun 27, 2018

Gabriel Huang, Hugo Berard, Ahmed Touati, Gauthier Gidel, Pascal Vincent, Simon Lacoste-Julien

Figure 1 for Parametric Adversarial Divergences are Good Task Losses for Generative Modeling

Figure 2 for Parametric Adversarial Divergences are Good Task Losses for Generative Modeling

Figure 3 for Parametric Adversarial Divergences are Good Task Losses for Generative Modeling

Figure 4 for Parametric Adversarial Divergences are Good Task Losses for Generative Modeling

Abstract:Generative modeling of high dimensional data like images is a notoriously difficult and ill-defined problem. In particular, how to evaluate a learned generative model is unclear. In this position paper, we argue that adversarial learning, pioneered with generative adversarial networks (GANs), provides an interesting framework to implicitly define more meaningful task losses for generative modeling tasks, such as for generating "visually realistic" images. We refer to those task losses as parametric adversarial divergences and we give two main reasons why we think parametric divergences are good learning objectives for generative modeling. Additionally, we unify the processes of choosing a good structured loss (in structured prediction) and choosing a discriminator architecture (in generative modeling) using statistical decision theory; we are then able to formalize and quantify the intuition that "weaker" losses are easier to learn from, in a specific setting. Finally, we propose two new challenging tasks to evaluate parametric and nonparametric divergences: a qualitative task of generating very high-resolution digits, and a quantitative task of learning data that satisfies high-level algebraic constraints. We use two common divergences to train a generator and show that the parametric divergence outperforms the nonparametric divergence on both the qualitative and the quantitative task.

* 22 pages

Via

Access Paper or Ask Questions

Frank-Wolfe Splitting via Augmented Lagrangian Method

Apr 09, 2018

Gauthier Gidel, Fabian Pedregosa, Simon Lacoste-Julien

Figure 1 for Frank-Wolfe Splitting via Augmented Lagrangian Method

Figure 2 for Frank-Wolfe Splitting via Augmented Lagrangian Method

Abstract:Minimizing a function over an intersection of convex sets is an important task in optimization that is often much more challenging than minimizing it over each individual constraint set. While traditional methods such as Frank-Wolfe (FW) or proximal gradient descent assume access to a linear or quadratic oracle on the intersection, splitting techniques take advantage of the structure of each sets, and only require access to the oracle on the individual constraints. In this work, we develop and analyze the Frank-Wolfe Augmented Lagrangian (FW-AL) algorithm, a method for minimizing a smooth function over convex compact sets related by a "linear consistency" constraint that only requires access to a linear minimization oracle over the individual constraints. It is based on the Augmented Lagrangian Method (ALM), also known as Method of Multipliers, but unlike most existing splitting methods, it only requires access to linear (instead of quadratic) minimization oracles. We use recent advances in the analysis of Frank-Wolfe and the alternating direction method of multipliers algorithms to prove a sublinear convergence rate for FW-AL over general convex compact sets and a linear convergence rate for polytopes.

* Appears in: Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS 2018). 30 pages

Via

Access Paper or Ask Questions

SEARNN: Training RNNs with Global-Local Losses

Mar 04, 2018

Rémi Leblond, Jean-Baptiste Alayrac, Anton Osokin, Simon Lacoste-Julien

Figure 1 for SEARNN: Training RNNs with Global-Local Losses

Figure 2 for SEARNN: Training RNNs with Global-Local Losses

Figure 3 for SEARNN: Training RNNs with Global-Local Losses

Abstract:We propose SEARNN, a novel training algorithm for recurrent neural networks (RNNs) inspired by the "learning to search" (L2S) approach to structured prediction. RNNs have been widely successful in structured prediction applications such as machine translation or parsing, and are commonly trained using maximum likelihood estimation (MLE). Unfortunately, this training loss is not always an appropriate surrogate for the test error: by only maximizing the ground truth probability, it fails to exploit the wealth of information offered by structured losses. Further, it introduces discrepancies between training and predicting (such as exposure bias) that may hurt test performance. Instead, SEARNN leverages test-alike search space exploration to introduce global-local losses that are closer to the test error. We first demonstrate improved performance over MLE on two different tasks: OCR and spelling correction. Then, we propose a subsampling strategy to enable SEARNN to scale to large vocabulary sizes. This allows us to validate the benefits of our approach on a machine translation task.

* Published as a conference paper at ICLR 2018, 16 pages

Via

Access Paper or Ask Questions

On Structured Prediction Theory with Calibrated Convex Surrogate Losses

Jan 29, 2018

Anton Osokin, Francis Bach, Simon Lacoste-Julien

Figure 1 for On Structured Prediction Theory with Calibrated Convex Surrogate Losses

Abstract:We provide novel theoretical insights on structured prediction in the context of efficient convex surrogate loss minimization with consistency guarantees. For any task loss, we construct a convex surrogate that can be optimized via stochastic gradient descent and we prove tight bounds on the so-called "calibration function" relating the excess surrogate risk to the actual risk. In contrast to prior related work, we carefully monitor the effect of the exponential number of classes in the learning guarantees as well as on the optimization complexity. As an interesting consequence, we formalize the intuition that some task losses make learning harder than others, and that the classical 0-1 loss is ill-suited for general structured prediction.

* Appears in: Advances in Neural Information Processing Systems 30 (NIPS 2017). 30 pages

Via

Access Paper or Ask Questions

Improved asynchronous parallel optimization analysis for stochastic incremental methods

Jan 12, 2018

Rémi Leblond, Fabian Pedregosa, Simon Lacoste-Julien

Figure 1 for Improved asynchronous parallel optimization analysis for stochastic incremental methods

Figure 2 for Improved asynchronous parallel optimization analysis for stochastic incremental methods

Figure 3 for Improved asynchronous parallel optimization analysis for stochastic incremental methods

Figure 4 for Improved asynchronous parallel optimization analysis for stochastic incremental methods

Abstract:As datasets continue to increase in size and multi-core computer architectures are developed, asynchronous parallel optimization algorithms become more and more essential to the field of Machine Learning. Unfortunately, conducting the theoretical analysis asynchronous methods is difficult, notably due to the introduction of delay and inconsistency in inherently sequential algorithms. Handling these issues often requires resorting to simplifying but unrealistic assumptions. Through a novel perspective, we revisit and clarify a subtle but important technical issue present in a large fraction of the recent convergence rate proofs for asynchronous parallel optimization algorithms, and propose a simplification of the recently introduced "perturbed iterate" framework that resolves it. We demonstrate the usefulness of our new framework by analyzing three distinct asynchronous parallel incremental optimization algorithms: Hogwild (asynchronous SGD), KROMAGNON (asynchronous SVRG) and ASAGA, a novel asynchronous parallel version of the incremental gradient algorithm SAGA that enjoys fast linear convergence rates. We are able to both remove problematic assumptions and obtain better theoretical results. Notably, we prove that ASAGA and KROMAGNON can obtain a theoretical linear speedup on multi-core systems even without sparsity assumptions. We present results of an implementation on a 40-core architecture illustrating the practical speedups as well as the hardware overhead. Finally, we investigate the overlap constant, an ill-understood but central quantity for the theoretical analysis of asynchronous parallel algorithms. We find that it encompasses much more complexity than suggested in previous work, and often is order-of-magnitude bigger than traditionally thought.

* 67 pages

Via

Access Paper or Ask Questions

A3T: Adversarially Augmented Adversarial Training

Jan 12, 2018

Akram Erraqabi, Aristide Baratin, Yoshua Bengio, Simon Lacoste-Julien

Figure 1 for A3T: Adversarially Augmented Adversarial Training

Figure 2 for A3T: Adversarially Augmented Adversarial Training

Figure 3 for A3T: Adversarially Augmented Adversarial Training

Abstract:Recent research showed that deep neural networks are highly sensitive to so-called adversarial perturbations, which are tiny perturbations of the input data purposely designed to fool a machine learning classifier. Most classification models, including deep learning models, are highly vulnerable to adversarial attacks. In this work, we investigate a procedure to improve adversarial robustness of deep neural networks through enforcing representation invariance. The idea is to train the classifier jointly with a discriminator attached to one of its hidden layer and trained to filter the adversarial noise. We perform preliminary experiments to test the viability of the approach and to compare it to other standard adversarial training methods.

* accepted for an oral presentation in Machine Deception Workshop, NIPS 2017

Via

Access Paper or Ask Questions

ASAGA: Asynchronous Parallel SAGA

Nov 08, 2017

Rémi Leblond, Fabian Pedregosa, Simon Lacoste-Julien

Figure 1 for ASAGA: Asynchronous Parallel SAGA

Abstract:We describe ASAGA, an asynchronous parallel version of the incremental gradient algorithm SAGA that enjoys fast linear convergence rates. Through a novel perspective, we revisit and clarify a subtle but important technical issue present in a large fraction of the recent convergence rate proofs for asynchronous parallel optimization algorithms, and propose a simplification of the recently introduced "perturbed iterate" framework that resolves it. We thereby prove that ASAGA can obtain a theoretical linear speedup on multi-core systems even without sparsity assumptions. We present results of an implementation on a 40-core architecture illustrating the practical speedup as well as the hardware overhead.

* Appears in: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017), 37 pages

Via

Access Paper or Ask Questions

Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization

Nov 05, 2017

Fabian Pedregosa, Rémi Leblond, Simon Lacoste-Julien

Figure 1 for Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization

Abstract:Due to their simplicity and excellent performance, parallel asynchronous variants of stochastic gradient descent have become popular methods to solve a wide range of large-scale optimization problems on multi-core architectures. Yet, despite their practical success, support for nonsmooth objectives is still lacking, making them unsuitable for many problems of interest in machine learning, such as the Lasso, group Lasso or empirical risk minimization with convex constraints. In this work, we propose and analyze ProxASAGA, a fully asynchronous sparse method inspired by SAGA, a variance reduced incremental gradient algorithm. The proposed method is easy to implement and significantly outperforms the state of the art on several nonsmooth, large-scale problems. We prove that our method achieves a theoretical linear speedup with respect to the sequential version under assumptions on the sparsity of gradients and block-separability of the proximal term. Empirical benchmarks on a multi-core architecture illustrate practical speedups of up to 12x on a 20-core machine.

* Advances in Neural Information Processing Systems 30 (NIPS 2017)
* Appears in Advances in Neural Information Processing Systems 30 (NIPS 2017), 28 pages

Via

Access Paper or Ask Questions

Joint Discovery of Object States and Manipulation Actions

Aug 28, 2017

Jean-Baptiste Alayrac, Josev Sivic, Ivan Laptev, Simon Lacoste-Julien

Figure 1 for Joint Discovery of Object States and Manipulation Actions

Figure 2 for Joint Discovery of Object States and Manipulation Actions

Figure 3 for Joint Discovery of Object States and Manipulation Actions

Figure 4 for Joint Discovery of Object States and Manipulation Actions

Abstract:Many human activities involve object manipulations aiming to modify the object state. Examples of common state changes include full/empty bottle, open/closed door, and attached/detached car wheel. In this work, we seek to automatically discover the states of objects and the associated manipulation actions. Given a set of videos for a particular task, we propose a joint model that learns to identify object states and to localize state-modifying actions. Our model is formulated as a discriminative clustering cost with constraints. We assume a consistent temporal order for the changes in object states and manipulation actions, and introduce new optimization techniques to learn model parameters without additional supervision. We demonstrate successful discovery of seven manipulation actions and corresponding object states on a new dataset of videos depicting real-life object manipulations. We show that our joint formulation results in an improvement of object state discovery by action recognition and vice versa.

* Appears in: International Conference on Computer Vision 2017 (ICCV 2017). 15 pages

Via

Access Paper or Ask Questions