Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Steven Van Canneyt

Representation learning for very short texts using weighted word embedding aggregation

Jul 02, 2016

Cedric De Boom, Steven Van Canneyt, Thomas Demeester, Bart Dhoedt

Figure 1 for Representation learning for very short texts using weighted word embedding aggregation

Figure 2 for Representation learning for very short texts using weighted word embedding aggregation

Figure 3 for Representation learning for very short texts using weighted word embedding aggregation

Figure 4 for Representation learning for very short texts using weighted word embedding aggregation

Abstract:Short text messages such as tweets are very noisy and sparse in their use of vocabulary. Traditional textual representations, such as tf-idf, have difficulty grasping the semantic meaning of such texts, which is important in applications such as event detection, opinion mining, news recommendation, etc. We constructed a method based on semantic word embeddings and frequency information to arrive at low-dimensional representations for short texts designed to capture semantic similarity. For this purpose we designed a weight-based model and a learning procedure based on a novel median-based loss function. This paper discusses the details of our model and the optimization methods, together with the experimental results on both Wikipedia and Twitter data. We find that our method outperforms the baseline approaches in the experiments, and that it generalizes well on different word embeddings without retraining. Our method is therefore capable of retaining most of the semantic information in the text, and is applicable out-of-the-box.

* 8 pages, 3 figures, 2 tables, appears in Pattern Recognition Letters

Via

Access Paper or Ask Questions

Learning Semantic Similarity for Very Short Texts

Dec 02, 2015

Cedric De Boom, Steven Van Canneyt, Steven Bohez, Thomas Demeester, Bart Dhoedt

Figure 1 for Learning Semantic Similarity for Very Short Texts

Figure 2 for Learning Semantic Similarity for Very Short Texts

Figure 3 for Learning Semantic Similarity for Very Short Texts

Figure 4 for Learning Semantic Similarity for Very Short Texts

Abstract:Levering data on social media, such as Twitter and Facebook, requires information retrieval algorithms to become able to relate very short text fragments to each other. Traditional text similarity methods such as tf-idf cosine-similarity, based on word overlap, mostly fail to produce good results in this case, since word overlap is little or non-existent. Recently, distributed word representations, or word embeddings, have been shown to successfully allow words to match on the semantic level. In order to pair short text fragments - as a concatenation of separate words - an adequate distributed sentence representation is needed, in existing literature often obtained by naively combining the individual word representations. We therefore investigated several text representations as a combination of word embeddings in the context of semantic pair matching. This paper investigates the effectiveness of several such naive techniques, as well as traditional tf-idf similarity, for fragments of different lengths. Our main contribution is a first step towards a hybrid method that combines the strength of dense distributed representations - as opposed to sparse term matching - with the strength of tf-idf based methods to automatically reduce the impact of less informative terms. Our new approach outperforms the existing techniques in a toy experimental set-up, leading to the conclusion that the combination of word embeddings and tf-idf information might lead to a better model for semantic content within very short text fragments.

* 6 pages, 5 figures, 3 tables, ReLSD workshop at ICDM 15

Via

Access Paper or Ask Questions