Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Identity-sensitive Word Embedding through Heterogeneous Networks

Nov 29, 2016
Jian Tang, Meng Qu, Qiaozhu Mei

Most existing word embedding approaches do not distinguish the same words in different contexts, therefore ignoring their contextual meanings. As a result, the learned embeddings of these words are usually a mixture of multiple meanings. In this paper, we acknowledge multiple identities of the same word in different contexts and learn the \textbf{identity-sensitive} word embeddings. Based on an identity-labeled text corpora, a heterogeneous network of words and word identities is constructed to model different-levels of word co-occurrences. The heterogeneous network is further embedded into a low-dimensional space through a principled network embedding approach, through which we are able to obtain the embeddings of words and the embeddings of word identities. We study three different types of word identities including topics, sentiments and categories. Experimental results on real-world data sets show that the identity-sensitive word embeddings learned by our approach indeed capture different meanings of words and outperforms competitive methods on tasks including text classification and word similarity computation.

  Access Paper or Ask Questions

Is a picture worth a thousand words? A Deep Multi-Modal Fusion Architecture for Product Classification in e-commerce

Nov 29, 2016
Tom Zahavy, Alessandro Magnani, Abhinandan Krishnan, Shie Mannor

Classifying products into categories precisely and efficiently is a major challenge in modern e-commerce. The high traffic of new products uploaded daily and the dynamic nature of the categories raise the need for machine learning models that can reduce the cost and time of human editors. In this paper, we propose a decision level fusion approach for multi-modal product classification using text and image inputs. We train input specific state-of-the-art deep neural networks for each input source, show the potential of forging them together into a multi-modal architecture and train a novel policy network that learns to choose between them. Finally, we demonstrate that our multi-modal network improves the top-1 accuracy % over both networks on a real-world large-scale product classification dataset that we collected While we focus on image-text fusion that characterizes e-commerce domains, our algorithms can be easily applied to other modalities such as audio, video, physical sensors, etc.

  Access Paper or Ask Questions

A Theme-Rewriting Approach for Generating Algebra Word Problems

Oct 19, 2016
Rik Koncel-Kedziorski, Ioannis Konstas, Luke Zettlemoyer, Hannaneh Hajishirzi

Texts present coherent stories that have a particular theme or overall setting, for example science fiction or western. In this paper, we present a text generation method called {\it rewriting} that edits existing human-authored narratives to change their theme without changing the underlying story. We apply the approach to math word problems, where it might help students stay more engaged by quickly transforming all of their homework assignments to the theme of their favorite movie without changing the math concepts that are being taught. Our rewriting method uses a two-stage decoding process, which proposes new words from the target theme and scores the resulting stories according to a number of factors defining aspects of syntactic, semantic, and thematic coherence. Experiments demonstrate that the final stories typically represent the new theme well while still testing the original math concepts, outperforming a number of baselines. We also release a new dataset of human-authored rewrites of math word problems in several themes.

* To appear EMNLP 2016 

  Access Paper or Ask Questions

Using Semantic Similarity for Input Topic Identification in Crawling-based Web Application Testing

Aug 23, 2016
Jun-Wei Lin, Farn Wang

To automatically test web applications, crawling-based techniques are usually adopted to mine the behavior models, explore the state spaces or detect the violated invariants of the applications. However, in existing crawlers, rules for identifying the topics of input text fields, such as login ids, passwords, emails, dates and phone numbers, have to be manually configured. Moreover, the rules for one application are very often not suitable for another. In addition, when several rules conflict and match an input text field to more than one topics, it can be difficult to determine which rule suggests a better match. This paper presents a natural-language approach to automatically identify the topics of encountered input fields during crawling by semantically comparing their similarities with the input fields in labeled corpus. In our evaluation with 100 real-world forms, the proposed approach demonstrated comparable performance to the rule-based one. Our experiments also show that the accuracy of the rule-based approach can be improved by up to 19% when integrated with our approach.

* 11 pages 

  Access Paper or Ask Questions

Redundancy-free Verbalization of Individuals for Ontology Validation

Jul 24, 2016
E. V. Vinu, P Sreenivasa Kumar

We investigate the problem of verbalizing Web Ontology Language (OWL) axioms of domain ontologies in this paper. The existing approaches address the problem of fidelity of verbalized OWL texts to OWL semantics by exploring different ways of expressing the same OWL axiom in various linguistic forms. They also perform grouping and aggregating of the natural language (NL) sentences that are generated corresponding to each OWL statement into a comprehensible structure. However, no efforts have been taken to try out a semantic reduction at logical level to remove redundancies and repetitions, so that the reduced set of axioms can be used for generating a more meaningful and human-understandable (what we call redundancy-free) text. Our experiments show that, formal semantic reduction at logical level is very helpful to generate redundancy-free descriptions of ontology entities. In this paper, we particularly focus on generating descriptions of individuals of SHIQ based ontologies. The details of a case study are provided to support the usefulness of the redundancy-free NL descriptions of individuals, in knowledge validation application.

* Under review 

  Access Paper or Ask Questions

Generating Navigable Semantic Maps from Social Sciences Corpora

Jul 08, 2015
Thierry Poibeau, Pablo Ruiz

It is now commonplace to observe that we are facing a deluge of online information. Researchers have of course long acknowledged the potential value of this information since digital traces make it possible to directly observe, describe and analyze social facts, and above all the co-evolution of ideas and communities over time. However, most online information is expressed through text, which means it is not directly usable by machines, since computers require structured, organized and typed information in order to be able to manipulate it. Our goal is thus twofold: 1. Provide new natural language processing techniques aiming at automatically extracting relevant information from texts, especially in the context of social sciences, and connect these pieces of information so as to obtain relevant socio-semantic networks; 2. Provide new ways of exploring these socio-semantic networks, thanks to tools allowing one to dynamically navigate these networks, de-construct and re-construct them interactively, from different points of view following the needs expressed by domain experts.

* in Digital Humanities 2015, Jun 2015, Sydney, Australia. Actes de la Conf{\'e}rence Digital Humanities 2015. arXiv admin note: text overlap with arXiv:1406.4211 

  Access Paper or Ask Questions

Unveiling the relationship between complex networks metrics and word senses

Feb 18, 2013
Diego R. Amancio, Osvaldo N. Oliveira Jr., Luciano da F. Costa

The automatic disambiguation of word senses (i.e., the identification of which of the meanings is used in a given context for a word that has multiple meanings) is essential for such applications as machine translation and information retrieval, and represents a key step for developing the so-called Semantic Web. Humans disambiguate words in a straightforward fashion, but this does not apply to computers. In this paper we address the problem of Word Sense Disambiguation (WSD) by treating texts as complex networks, and show that word senses can be distinguished upon characterizing the local structure around ambiguous words. Our goal was not to obtain the best possible disambiguation system, but we nevertheless found that in half of the cases our approach outperforms traditional shallow methods. We show that the hierarchical connectivity and clustering of words are usually the most relevant features for WSD. The results reported here shine light on the relationship between semantic and structural parameters of complex networks. They also indicate that when combined with traditional techniques the complex network approach may be useful to enhance the discrimination of senses in large texts

* Europhysics Letters (2012) 98 18002 
* The Supplementary Information (SI) is available from 

  Access Paper or Ask Questions

A New Similarity Measure for Taxonomy Based on Edge Counting

Nov 20, 2012
Manjula Shenoy. K, K. C. Shet, U. Dinesh Acharya

This paper introduces a new similarity measure based on edge counting in a taxonomy like WorldNet or Ontology. Measurement of similarity between text segments or concepts is very useful for many applications like information retrieval, ontology matching, text mining, and question answering and so on. Several measures have been developed for measuring similarity between two concepts: out of these we see that the measure given by Wu and Palmer [1] is simple, and gives good performance. Our measure is based on their measure but strengthens it. Wu and Palmer [1] measure has a disadvantage that it does not consider how far the concepts are semantically. In our measure we include the shortest path between the concepts and the depth of whole taxonomy together with the distances used in Wu and Palmer [1]. Also the measure has following disadvantage i.e. in some situations, the similarity of two elements of an IS-A ontology contained in the neighborhood exceeds the similarity value of two elements contained in the same hierarchy. Our measure introduces a penalization factor for this case based upon shortest length between the concepts and depth of whole taxonomy.

  Access Paper or Ask Questions

Distant finetuning with discourse relations for stance classification

Apr 27, 2022
Lifeng Jin, Kun Xu, Linfeng Song, Dong Yu

Approaches for the stance classification task, an important task for understanding argumentation in debates and detecting fake news, have been relying on models which deal with individual debate topics. In this paper, in order to train a system independent from topics, we propose a new method to extract data with silver labels from raw text to finetune a model for stance classification. The extraction relies on specific discourse relation information, which is shown as a reliable and accurate source for providing stance information. We also propose a 3-stage training framework where the noisy level in the data used for finetuning decreases over different stages going from the most noisy to the least noisy. Detailed experiments show that the automatically annotated dataset as well as the 3-stage training help improve model performance in stance classification. Our approach ranks 1st among 26 competing teams in the stance classification track of the NLPCC 2021 shared task Argumentative Text Understanding for AI Debater, which confirms the effectiveness of our approach.

* NLPCC 2021 

  Access Paper or Ask Questions

Automatic Detection of Expressed Emotion from Five-Minute Speech Samples: Challenges and Opportunities

Mar 30, 2022
Bahman Mirheidari, André Bittar, Nicholas Cummins, Johnny Downs, Helen L. Fisher, Heidi Christensen

We present a novel feasibility study on the automatic recognition of Expressed Emotion (EE), a family environment concept based on caregivers speaking freely about their relative/family member. We describe an automated approach for determining the \textit{degree of warmth}, a key component of EE, from acoustic and text features acquired from a sample of 37 recorded interviews. These recordings, collected over 20 years ago, are derived from a nationally representative birth cohort of 2,232 British twin children and were manually coded for EE. We outline the core steps of extracting usable information from recordings with highly variable audio quality and assess the efficacy of four machine learning approaches trained with different combinations of acoustic and text features. Despite the challenges of working with this legacy data, we demonstrated that the degree of warmth can be predicted with an $F_{1}$-score of \textbf{61.5\%}. In this paper, we summarise our learning and provide recommendations for future work using real-world speech samples.

* Submitted to Interspeech 2022 

  Access Paper or Ask Questions