Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chris Dyer

Sparse Overcomplete Word Vector Representations

Jun 05, 2015

Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, Noah Smith

Figure 1 for Sparse Overcomplete Word Vector Representations

Figure 2 for Sparse Overcomplete Word Vector Representations

Figure 3 for Sparse Overcomplete Word Vector Representations

Figure 4 for Sparse Overcomplete Word Vector Representations

Abstract:Current distributed representations of words show little resemblance to theories of lexical semantics. The former are dense and uninterpretable, the latter largely based on familiar, discrete classes (e.g., supersenses) and relations (e.g., synonymy and hypernymy). We propose methods that transform word vectors into sparse (and optionally binary) vectors. The resulting representations are more similar to the interpretable features typically used in NLP, though they are discovered automatically from raw corpora. Because the vectors are highly sparse, they are computationally easy to work with. Most importantly, we find that they outperform the original vectors on benchmark tasks.

* Proceedings of ACL 2015

Via

Access Paper or Ask Questions

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

May 29, 2015

Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, Noah A. Smith

Figure 1 for Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Figure 2 for Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Figure 3 for Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Figure 4 for Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Abstract:We propose a technique for learning representations of parser states in transition-based dependency parsers. Our primary innovation is a new control structure for sequence-to-sequence neural networks---the stack LSTM. Like the conventional stack data structures used in transition-based parsing, elements can be pushed to or popped from the top of the stack in constant time, but, in addition, an LSTM maintains a continuous space embedding of the stack contents. This lets us formulate an efficient parsing model that captures three facets of a parser's state: (i) unbounded look-ahead into the buffer of incoming words, (ii) the complete history of actions taken by the parser, and (iii) the complete contents of the stack of partially built tree fragments, including their internal structures. Standard backpropagation techniques are used for training and yield state-of-the-art parsing performance.

* Proceedings of ACL 2015

Via

Access Paper or Ask Questions

Unsupervised POS Induction with Word Embeddings

Mar 23, 2015

Chu-Cheng Lin, Waleed Ammar, Chris Dyer, Lori Levin

Figure 1 for Unsupervised POS Induction with Word Embeddings

Figure 2 for Unsupervised POS Induction with Word Embeddings

Figure 3 for Unsupervised POS Induction with Word Embeddings

Abstract:Unsupervised word embeddings have been shown to be valuable as features in supervised learning problems; however, their role in unsupervised problems has been less thoroughly explored. In this paper, we show that embeddings can likewise add value to the problem of unsupervised POS induction. In two representative models of POS induction, we replace multinomial distributions over the vocabulary with multivariate Gaussian distributions over word embeddings and observe consistent improvements in eight languages. We also analyze the effect of various choices while inducing word embeddings on "downstream" POS induction results.

* NAACL 2015

Via

Access Paper or Ask Questions

Retrofitting Word Vectors to Semantic Lexicons

Mar 22, 2015

Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, Noah A. Smith

Figure 1 for Retrofitting Word Vectors to Semantic Lexicons

Figure 2 for Retrofitting Word Vectors to Semantic Lexicons

Figure 3 for Retrofitting Word Vectors to Semantic Lexicons

Abstract:Vector space word representations are learned from distributional information of words in large corpora. Although such statistics are semantically informative, they disregard the valuable information that is contained in semantic lexicons such as WordNet, FrameNet, and the Paraphrase Database. This paper proposes a method for refining vector space representations using relational information from semantic lexicons by encouraging linked words to have similar vector representations, and it makes no assumptions about how the input vectors were constructed. Evaluated on a battery of standard lexical semantic evaluation tasks in several languages, we obtain substantial improvements starting with a variety of word vector models. Our refinement method outperforms prior techniques for incorporating semantic lexicons into the word vector training algorithms.

* Proceedings of NAACL 2015

Via

Access Paper or Ask Questions

Conditional Random Field Autoencoders for Unsupervised Structured Prediction

Nov 10, 2014

Waleed Ammar, Chris Dyer, Noah A. Smith

Figure 1 for Conditional Random Field Autoencoders for Unsupervised Structured Prediction

Figure 2 for Conditional Random Field Autoencoders for Unsupervised Structured Prediction

Abstract:We introduce a framework for unsupervised learning of structured predictors with overlapping, global features. Each input's latent representation is predicted conditional on the observable data using a feature-rich conditional random field. Then a reconstruction of the input is (re)generated, conditional on the latent structure, using models for which maximum likelihood estimation has a closed-form. Our autoencoder formulation enables efficient learning without making unrealistic independence assumptions or restricting the kinds of features that can be used. We illustrate insightful connections to traditional autoencoders, posterior regularization and multi-view learning. We show competitive results with instantiations of the model for two canonical NLP tasks: part-of-speech induction and bitext word alignment, and show that training our model can be substantially more efficient than comparable feature-rich baselines.

Via

Access Paper or Ask Questions

Learning Word Representations with Hierarchical Sparse Coding

Nov 06, 2014

Dani Yogatama, Manaal Faruqui, Chris Dyer, Noah A. Smith

Figure 1 for Learning Word Representations with Hierarchical Sparse Coding

Figure 2 for Learning Word Representations with Hierarchical Sparse Coding

Figure 3 for Learning Word Representations with Hierarchical Sparse Coding

Figure 4 for Learning Word Representations with Hierarchical Sparse Coding

Abstract:We propose a new method for learning word representations using hierarchical regularization in sparse coding inspired by the linguistic study of word meanings. We show an efficient learning algorithm based on stochastic proximal methods that is significantly faster than previous approaches, making it possible to perform hierarchical sparse coding on a corpus of billions of word tokens. Experiments on various benchmark tasks---word similarity ranking, analogies, sentence completion, and sentiment analysis---demonstrate that the method outperforms or is competitive with state-of-the-art methods. Our word representations are available at \url{http://www.ark.cs.cmu.edu/dyogatam/wordvecs/}.

Via

Access Paper or Ask Questions

Notes on Noise Contrastive Estimation and Negative Sampling

Oct 30, 2014

Chris Dyer

Abstract:Estimating the parameters of probabilistic models of language such as maxent models and probabilistic neural models is computationally difficult since it involves evaluating partition functions by summing over an entire vocabulary, which may be millions of word types in size. Two closely related strategies---noise contrastive estimation (Mnih and Teh, 2012; Mnih and Kavukcuoglu, 2013; Vaswani et al., 2013) and negative sampling (Mikolov et al., 2012; Goldberg and Levy, 2014)---have emerged as popular solutions to this computational problem, but some confusion remains as to which is more appropriate and when. This document explicates their relationships to each other and to other estimation techniques. The analysis shows that, although they are superficially similar, NCE is a general parameter estimation technique that is asymptotically unbiased, while negative sampling is best understood as a family of binary classification models that are useful for learning word representations but not as a general-purpose estimator.

* 4 pages

Via

Access Paper or Ask Questions

Language Modeling with Power Low Rank Ensembles

Oct 03, 2014

Ankur P. Parikh, Avneesh Saluja, Chris Dyer, Eric P. Xing

Figure 1 for Language Modeling with Power Low Rank Ensembles

Figure 2 for Language Modeling with Power Low Rank Ensembles

Figure 3 for Language Modeling with Power Low Rank Ensembles

Figure 4 for Language Modeling with Power Low Rank Ensembles

Abstract:We present power low rank ensembles (PLRE), a flexible framework for n-gram language modeling where ensembles of low rank matrices and tensors are used to obtain smoothed probability estimates of words in context. Our method can be understood as a generalization of n-gram modeling to non-integer n, and includes standard techniques such as absolute discounting and Kneser-Ney smoothing as special cases. PLRE training is efficient and our approach outperforms state-of-the-art modified Kneser Ney baselines in terms of perplexity on large corpora as well as on BLEU score in a downstream machine translation task.

Via

Access Paper or Ask Questions

Predicting the NFL using Twitter

Oct 25, 2013

Shiladitya Sinha, Chris Dyer, Kevin Gimpel, Noah A. Smith

Figure 1 for Predicting the NFL using Twitter

Figure 2 for Predicting the NFL using Twitter

Figure 3 for Predicting the NFL using Twitter

Figure 4 for Predicting the NFL using Twitter

Abstract:We study the relationship between social media output and National Football League (NFL) games, using a dataset containing messages from Twitter and NFL game statistics. Specifically, we consider tweets pertaining to specific teams and games in the NFL season and use them alongside statistical game data to build predictive models for future game outcomes (which team will win?) and sports betting outcomes (which team will win with the point spread? will the total points be over/under the line?). We experiment with several feature sets and find that simple features using large volumes of tweets can match or exceed the performance of more traditional features that use game statistics.

* Presented at ECML/PKDD 2013 Workshop on Machine Learning and Data Mining for Sports Analytics

Via

Access Paper or Ask Questions

Minimum Error Rate Training and the Convex Hull Semiring

Jul 13, 2013

Chris Dyer

Figure 1 for Minimum Error Rate Training and the Convex Hull Semiring

Figure 2 for Minimum Error Rate Training and the Convex Hull Semiring

Abstract:We describe the line search used in the minimum error rate training algorithm MERT as the "inside score" of a weighted proof forest under a semiring defined in terms of well-understood operations from computational geometry. This conception leads to a straightforward complexity analysis of the dynamic programming MERT algorithms of Macherey et al. (2008) and Kumar et al. (2009) and practical approaches to implementation.

Via

Access Paper or Ask Questions