Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

On the "Calligraphy" of Books

May 29, 2017
Vanessa Q. Marinho, Henrique F. de Arruda, Thales S. Lima, Luciano F. Costa, Diego R. Amancio

Authorship attribution is a natural language processing task that has been widely studied, often by considering small order statistics. In this paper, we explore a complex network approach to assign the authorship of texts based on their mesoscopic representation, in an attempt to capture the flow of the narrative. Indeed, as reported in this work, such an approach allowed the identification of the dominant narrative structure of the studied authors. This has been achieved due to the ability of the mesoscopic approach to take into account relationships between different, not necessarily adjacent, parts of the text, which is able to capture the story flow. The potential of the proposed approach has been illustrated through principal component analysis, a comparison with the chance baseline method, and network visualization. Such visualizations reveal individual characteristics of the authors, which can be understood as a kind of calligraphy.

* TextGraphs ACL 2017 (to appear) 

  Access Paper or Ask Questions

Nonparametric Bayesian Topic Modelling with the Hierarchical Pitman-Yor Processes

Sep 22, 2016
Kar Wai Lim, Wray Buntine, Changyou Chen, Lan Du

The Dirichlet process and its extension, the Pitman-Yor process, are stochastic processes that take probability distributions as a parameter. These processes can be stacked up to form a hierarchical nonparametric Bayesian model. In this article, we present efficient methods for the use of these processes in this hierarchical context, and apply them to latent variable models for text analytics. In particular, we propose a general framework for designing these Bayesian models, which are called topic models in the computer science community. We then propose a specific nonparametric Bayesian topic model for modelling text from social media. We focus on tweets (posts on Twitter) in this article due to their ease of access. We find that our nonparametric model performs better than existing parametric models in both goodness of fit and real world applications.

* International Journal of Approximate Reasoning, Volume 78, pp. 172-191. Elsevier. 2016 
* Preprint for International Journal of Approximate Reasoning 

  Access Paper or Ask Questions

Regular expressions for decoding of neural network outputs

Feb 22, 2016
Tobias Strauß, Gundram Leifert, Tobias Grüning, Roger Labahn

This article proposes a convenient tool for decoding the output of neural networks trained by Connectionist Temporal Classification (CTC) for handwritten text recognition. We use regular expressions to describe the complex structures expected in the writing. The corresponding finite automata are employed to build a decoder. We analyze theoretically which calculations are relevant and which can be avoided. A great speed-up results from an approximation. We conclude that the approximation most likely fails if the regular expression does not match the ground truth which is not harmful for many applications since the low probability will be even underestimated. The proposed decoder is very efficient compared to other decoding methods. The variety of applications reaches from information retrieval to full text recognition. We refer to applications where we integrated the proposed decoder successfully.

* 21 pages, 8 (+2) figures, 2 tables 

  Access Paper or Ask Questions

Model Selection for Topic Models via Spectral Decomposition

Feb 17, 2015
Dehua Cheng, Xinran He, Yan Liu

Topic models have achieved significant successes in analyzing large-scale text corpus. In practical applications, we are always confronted with the challenge of model selection, i.e., how to appropriately set the number of topics. Following recent advances in topic model inference via tensor decomposition, we make a first attempt to provide theoretical analysis on model selection in latent Dirichlet allocation. Under mild conditions, we derive the upper bound and lower bound on the number of topics given a text collection of finite size. Experimental results demonstrate that our bounds are accurate and tight. Furthermore, using Gaussian mixture model as an example, we show that our methodology can be easily generalized to model selection analysis for other latent models.

* accepted in AISTATS 2015 

  Access Paper or Ask Questions

FrameNet CNL: a Knowledge Representation and Information Extraction Language

Jun 10, 2014
Guntis Barzdins

The paper presents a FrameNet-based information extraction and knowledge representation framework, called FrameNet-CNL. The framework is used on natural language documents and represents the extracted knowledge in a tailor-made Frame-ontology from which unambiguous FrameNet-CNL paraphrase text can be generated automatically in multiple languages. This approach brings together the fields of information extraction and CNL, because a source text can be considered belonging to FrameNet-CNL, if information extraction parser produces the correct knowledge representation as a result. We describe a state-of-the-art information extraction parser used by a national news agency and speculate that FrameNet-CNL eventually could shape the natural language subset used for writing the newswire articles.

* CNL-2014 camera-ready version. The final publication is available at 

  Access Paper or Ask Questions

Language Invariant Properties in Natural Language Processing

Oct 01, 2021
Federico Bianchi, Debora Nozza, Dirk Hovy

Meaning is context-dependent, but many properties of language (should) remain the same even if we transform the context. For example, sentiment, entailment, or speaker properties should be the same in a translation and original of a text. We introduce language invariant properties: i.e., properties that should not change when we transform text, and how they can be used to quantitatively evaluate the robustness of transformation algorithms. We use translation and paraphrasing as transformation examples, but our findings apply more broadly to any transformation. Our results indicate that many NLP transformations change properties like author characteristics, i.e., make them sound more male. We believe that studying these properties will allow NLP to address both social factors and pragmatic aspects of language. We also release an application suite that can be used to evaluate the invariance of transformation applications.

  Access Paper or Ask Questions

nmT5 -- Is parallel data still relevant for pre-training massively multilingual language models?

Jun 03, 2021
Mihir Kale, Aditya Siddhant, Noah Constant, Melvin Johnson, Rami Al-Rfou, Linting Xue

Recently, mT5 - a massively multilingual version of T5 - leveraged a unified text-to-text format to attain state-of-the-art results on a wide variety of multilingual NLP tasks. In this paper, we investigate the impact of incorporating parallel data into mT5 pre-training. We find that multi-tasking language modeling with objectives such as machine translation during pre-training is a straightforward way to improve performance on downstream multilingual and cross-lingual tasks. However, the gains start to diminish as the model capacity increases, suggesting that parallel data might not be as essential for larger models. At the same time, even at larger model sizes, we find that pre-training with parallel data still provides benefits in the limited labelled data regime.

* Accepted at ACL-IJCNLP 2021 

  Access Paper or Ask Questions

EmotionGIF-IITP-AINLPML: Ensemble-based Automated Deep Neural System for predicting category(ies) of a GIF response

Dec 23, 2020
Soumitra Ghosh, Arkaprava Roy, Asif Ekbal, Pushpak Bhattacharyya

In this paper, we describe the systems submitted by our IITP-AINLPML team in the shared task of SocialNLP 2020, EmotionGIF 2020, on predicting the category(ies) of a GIF response for a given unlabelled tweet. For the round 1 phase of the task, we propose an attention-based Bi-directional GRU network trained on both the tweet (text) and their replies (text wherever available) and the given category(ies) for its GIF response. In the round 2 phase, we build several deep neural-based classifiers for the task and report the final predictions through a majority voting based ensemble technique. Our proposed models attain the best Mean Recall (MR) scores of 52.92% and 53.80% in round 1 and round 2, respectively.

* EmotionGIF 2020, the shared task of SocialNLP 2020 (in conjunction with ACL 2020) 

  Access Paper or Ask Questions

NLNDE: Enhancing Neural Sequence Taggers with Attention and Noisy Channel for Robust Pharmacological Entity Detection

Jul 02, 2020
Lukas Lange, Heike Adel, Jannik Strötgen

Named entity recognition has been extensively studied on English news texts. However, the transfer to other domains and languages is still a challenging problem. In this paper, we describe the system with which we participated in the first subtrack of the PharmaCoNER competition of the BioNLP Open Shared Tasks 2019. Aiming at pharmacological entity detection in Spanish texts, the task provides a non-standard domain and language setting. However, we propose an architecture that requires neither language nor domain expertise. We treat the task as a sequence labeling task and experiment with attention-based embedding selection and the training on automatically annotated data to further improve our system's performance. Our system achieves promising results, especially by combining the different techniques, and reaches up to 88.6% F1 in the competition.

* Published at [email protected] 2019 

  Access Paper or Ask Questions

Towards Better Graph Representation: Two-Branch Collaborative Graph Neural Networks for Multimodal Marketing Intention Detection

May 22, 2020
Lu Zhang, Jian Zhang, Zhibin Li, Jingsong Xu

Inspired by the fact that spreading and collecting information through the Internet becomes the norm, more and more people choose to post for-profit contents (images and texts) in social networks. Due to the difficulty of network censors, malicious marketing may be capable of harming society. Therefore, it is meaningful to detect marketing intentions online automatically. However, gaps between multimodal data make it difficult to fuse images and texts for content marketing detection. To this end, this paper proposes Two-Branch Collaborative Graph Neural Networks to collaboratively represent multimodal data by Graph Convolution Networks (GCNs) in an end-to-end fashion. Experimental results demonstrate that our proposed method achieves superior graph classification performance for marketing intention detection.

* We want to withdraw this paper. There are some insufficient explanations and bad figure presentation 

  Access Paper or Ask Questions