Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Youssef Mroueh

IBM Research, USA

Learning Implicit Text Generation via Feature Matching

May 09, 2020

Inkit Padhi, Pierre Dognin, Ke Bai, Cicero Nogueira dos Santos, Vijil Chenthamarakshan, Youssef Mroueh, Payel Das

Figure 1 for Learning Implicit Text Generation via Feature Matching

Figure 2 for Learning Implicit Text Generation via Feature Matching

Figure 3 for Learning Implicit Text Generation via Feature Matching

Figure 4 for Learning Implicit Text Generation via Feature Matching

Abstract:Generative feature matching network (GFMN) is an approach for training implicit generative models for images by performing moment matching on features from pre-trained neural networks. In this paper, we present new GFMN formulations that are effective for sequential data. Our experimental results show the effectiveness of the proposed method, SeqGFMN, for three distinct generation tasks in English: unconditional text generation, class-conditional text generation, and unsupervised text style transfer. SeqGFMN is stable to train and outperforms various adversarial approaches for text generation and text style transfer.

* ACL 2020

Via

Access Paper or Ask Questions

Generative Modeling with Denoising Auto-Encoders and Langevin Sampling

Feb 17, 2020

Adam Block, Youssef Mroueh, Alexander Rakhlin

Abstract:We study convergence of a generative modeling method that first estimates the score function of the distribution using Denoising Auto-Encoders (DAE) or Denoising Score Matching (DSM) and then employs Langevin diffusion for sampling. We show that both DAE and DSM provide estimates of the score of the Gaussian smoothed population density, allowing us to apply the machinery of Empirical Processes. We overcome the challenge of relying only on $L^2$ bounds on the score estimation error and provide finite-sample bounds in the Wasserstein distance between the law of the population distribution and the law of this sampling scheme. We then apply our results to the homotopy method of arXiv:1907.05600 and provide theoretical justification for its empirical success.

* 22 pages

Via

Access Paper or Ask Questions

Improving Efficiency in Large-Scale Decentralized Distributed Training

Feb 04, 2020

Wei Zhang, Xiaodong Cui, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, Youssef Mroueh, Alper Buyuktosunoglu, Payel Das(+2 more)

Figure 1 for Improving Efficiency in Large-Scale Decentralized Distributed Training

Figure 2 for Improving Efficiency in Large-Scale Decentralized Distributed Training

Figure 3 for Improving Efficiency in Large-Scale Decentralized Distributed Training

Figure 4 for Improving Efficiency in Large-Scale Decentralized Distributed Training

Abstract:Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchronous Parallel SGD (AD-PSGD) is a family of distributed learning algorithms that have been demonstrated to perform well for large-scale deep learning tasks. One drawback of (A)D-PSGD is that the spectral gap of the mixing matrix decreases when the number of learners in the system increases, which hampers convergence. In this paper, we investigate techniques to accelerate (A)D-PSGD based training by improving the spectral gap while minimizing the communication cost. We demonstrate the effectiveness of our proposed techniques by running experiments on the 2000-hour Switchboard speech recognition task and the ImageNet computer vision task. On an IBM P9 supercomputer, our system is able to train an LSTM acoustic model in 2.28 hours with 7.5% WER on the Hub5-2000 Switchboard (SWB) test set and 13.3% WER on the CallHome (CH) test set using 64 V100 GPUs and in 1.98 hours with 7.7% WER on SWB and 13.3% WER on CH using 128 V100 GPUs, the fastest training time reported to date.

* 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP'2020) Oral

Via

Access Paper or Ask Questions

Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets

Dec 26, 2019

Mingrui Liu, Youssef Mroueh, Jerret Ross, Wei Zhang, Xiaodong Cui, Payel Das, Tianbao Yang

Figure 1 for Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets

Figure 2 for Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets

Figure 3 for Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets

Figure 4 for Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets

Abstract:Adaptive gradient algorithms perform gradient-based updates using the history of gradients and are ubiquitous in training deep neural networks. While adaptive gradient methods theory is well understood for minimization problems, the underlying factors driving their empirical success in min-max problems such as GANs remain unclear. In this paper, we aim at bridging this gap from both theoretical and empirical perspectives. First, we analyze a variant of Optimistic Stochastic Gradient (OSG) proposed in~\citep{daskalakis2017training} for solving a class of non-convex non-concave min-max problem and establish $O(\epsilon^{-4})$ complexity for finding $\epsilon$-first-order stationary point, in which the algorithm only requires invoking one stochastic first-order oracle while enjoying state-of-the-art iteration complexity achieved by stochastic extragradient method by~\citep{iusem2017extragradient}. Then we propose an adaptive variant of OSG named Optimistic Adagrad (OAdagrad) and reveal an \emph{improved} adaptive complexity $\widetilde{O}\left(\epsilon^{-\frac{2}{1-\alpha}}\right)$~\footnote{Here $\widetilde{O}(\cdot)$ compresses a logarithmic factor of $\epsilon$.}, where $\alpha$ characterizes the growth rate of the cumulative stochastic gradient and $0\leq \alpha\leq 1/2$. To the best of our knowledge, this is the first work for establishing adaptive complexity in non-convex non-concave min-max optimization. Empirically, our experiments show that indeed adaptive gradient algorithms outperform their non-adaptive counterparts in GAN training. Moreover, this observation can be explained by the slow growth rate of the cumulative stochastic gradient, as observed empirically.

* Accepted by ICLR 2020

Via

Access Paper or Ask Questions

Unsupervised Hierarchy Matching with Optimal Transport over Hyperbolic Spaces

Nov 06, 2019

David Alvarez-Melis, Youssef Mroueh, Tommi S. Jaakkola

Figure 1 for Unsupervised Hierarchy Matching with Optimal Transport over Hyperbolic Spaces

Figure 2 for Unsupervised Hierarchy Matching with Optimal Transport over Hyperbolic Spaces

Figure 3 for Unsupervised Hierarchy Matching with Optimal Transport over Hyperbolic Spaces

Figure 4 for Unsupervised Hierarchy Matching with Optimal Transport over Hyperbolic Spaces

Abstract:This paper focuses on the problem of unsupervised alignment of hierarchical data such as ontologies or lexical databases. This is a problem that appears across areas, from natural language processing to bioinformatics, and is typically solved by appeal to outside knowledge bases and label-textual similarity. In contrast, we approach the problem from a purely geometric perspective: given only a vector-space representation of the items in the two hierarchies, we seek to infer correspondences across them. Our work derives from and interweaves hyperbolic-space representations for hierarchical data, on one hand, and unsupervised word-alignment methods, on the other. We first provide a set of negative results showing how and why Euclidean methods fail in this hyperbolic setting. We then propose a novel approach based on optimal transport over hyperbolic spaces, and show that it outperforms standard embedding alignment techniques in various experiments on cross-lingual WordNet alignment and ontology matching tasks.

Via

Access Paper or Ask Questions

Sobolev Independence Criterion

Oct 31, 2019

Youssef Mroueh, Tom Sercu, Mattia Rigotti, Inkit Padhi, Cicero Dos Santos

Figure 1 for Sobolev Independence Criterion

Figure 2 for Sobolev Independence Criterion

Figure 3 for Sobolev Independence Criterion

Figure 4 for Sobolev Independence Criterion

Abstract:We propose the Sobolev Independence Criterion (SIC), an interpretable dependency measure between a high dimensional random variable X and a response variable Y . SIC decomposes to the sum of feature importance scores and hence can be used for nonlinear feature selection. SIC can be seen as a gradient regularized Integral Probability Metric (IPM) between the joint distribution of the two random variables and the product of their marginals. We use sparsity inducing gradient penalties to promote input sparsity of the critic of the IPM. In the kernel version we show that SIC can be cast as a convex optimization problem by introducing auxiliary variables that play an important role in feature selection as they are normalized feature importance scores. We then present a neural version of SIC where the critic is parameterized as a homogeneous neural network, improving its representation power as well as its interpretability. We conduct experiments validating SIC for feature selection in synthetic and real-world experiments. We show that SIC enables reliable and interpretable discoveries, when used in conjunction with the holdout randomization test and knockoffs to control the False Discovery Rate. Code is available at http://github.com/ibm/sic.

* NeurIPS 2019

Via

Access Paper or Ask Questions

Decentralized Parallel Algorithm for Training Generative Adversarial Nets

Oct 30, 2019

Mingrui Liu, Youssef Mroueh, Wei Zhang, Xiaodong Cui, Jerret Ross, Tianbao Yang, Payel Das

Figure 1 for Decentralized Parallel Algorithm for Training Generative Adversarial Nets

Figure 2 for Decentralized Parallel Algorithm for Training Generative Adversarial Nets

Abstract:Generative Adversarial Networks (GANs) are powerful class of generative models in the deep learning community. Current practice on large-scale GAN training \cite{brock2018large} utilizes large models and distributed large-batch training strategies, and is implemented on deep learning frameworks (e.g., TensorFlow, PyTorch, etc.) designed in a centralized manner. In the centralized network topology, every worker needs to communicate with the central node. However, when the network bandwidth is low or network latency is high, the performance would be significantly degraded. Despite recent progress on decentralized algorithms for training deep neural networks, it remains unclear whether it is possible to train GANs in a decentralized manner. In this paper, we design a decentralized algorithm for solving a class of non-convex non-concave min-max problem with provable guarantee. Experimental results on GANs demonstrate the effectiveness of the proposed algorithm.

* Accepted by NeurIPS Smooth Games Optimization and Machine Learning Workshop: bridging game theory and deep learning, 2019

Via

Access Paper or Ask Questions

Wasserstein Style Transfer

May 30, 2019

Youssef Mroueh

Abstract:We propose Gaussian optimal transport for Image style transfer in an Encoder/Decoder framework. Optimal transport for Gaussian measures has closed forms Monge mappings from source to target distributions. Moreover interpolates between a content and a style image can be seen as geodesics in the Wasserstein Geometry. Using this insight, we show how to mix different target styles , using Wasserstein barycenter of Gaussian measures. Since Gaussians are closed under Wasserstein barycenter, this allows us a simple style transfer and style mixing and interpolation. Moreover we show how mixing different styles can be achieved using other geodesic metrics between gaussians such as the Fisher Rao metric, while the transport of the content to the new interpolate style is still performed with Gaussian OT maps. Our simple methodology allows to generate new stylized content interpolating between many artistic styles. The metric used in the interpolation results in different stylizations.

Via

Access Paper or Ask Questions

Learning Implicit Generative Models by Matching Perceptual Features

Apr 04, 2019

Cicero Nogueira dos Santos, Youssef Mroueh, Inkit Padhi, Pierre Dognin

Figure 1 for Learning Implicit Generative Models by Matching Perceptual Features

Figure 2 for Learning Implicit Generative Models by Matching Perceptual Features

Figure 3 for Learning Implicit Generative Models by Matching Perceptual Features

Figure 4 for Learning Implicit Generative Models by Matching Perceptual Features

Abstract:Perceptual features (PFs) have been used with great success in tasks such as transfer learning, style transfer, and super-resolution. However, the efficacy of PFs as key source of information for learning generative models is not well studied. We investigate here the use of PFs in the context of learning implicit generative models through moment matching (MM). More specifically, we propose a new effective MM approach that learns implicit generative models by performing mean and covariance matching of features extracted from pretrained ConvNets. Our proposed approach improves upon existing MM methods by: (1) breaking away from the problematic min/max game of adversarial learning; (2) avoiding online learning of kernel functions; and (3) being efficient with respect to both number of used moments and required minibatch size. Our experimental results demonstrate that, due to the expressiveness of PFs from pretrained deep ConvNets, our method achieves state-of-the-art results for challenging benchmarks.

* 16 pages

Via

Access Paper or Ask Questions

Implicit Kernel Learning

Feb 26, 2019

Chun-Liang Li, Wei-Cheng Chang, Youssef Mroueh, Yiming Yang, Barnabás Póczos

Abstract:Kernels are powerful and versatile tools in machine learning and statistics. Although the notion of universal kernels and characteristic kernels has been studied, kernel selection still greatly influences the empirical performance. While learning the kernel in a data driven way has been investigated, in this paper we explore learning the spectral distribution of kernel via implicit generative models parametrized by deep neural networks. We called our method Implicit Kernel Learning (IKL). The proposed framework is simple to train and inference is performed via sampling random Fourier features. We investigate two applications of the proposed IKL as examples, including generative adversarial networks with MMD (MMD GAN) and standard supervised learning. Empirically, MMD GAN with IKL outperforms vanilla predefined kernels on both image and text generation benchmarks; using IKL with Random Kitchen Sinks also leads to substantial improvement over existing state-of-the-art kernel learning algorithms on popular supervised learning benchmarks. Theory and conditions for using IKL in both applications are also studied as well as connections to previous state-of-the-art methods.

* In the Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019)

Via

Access Paper or Ask Questions