Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

What If We Simply Swap the Two Text Fragments? A Straightforward yet Effective Way to Test the Robustness of Methods to Confounding Signals in Nature Language Inference Tasks

Sep 07, 2018
Haohan Wang, Da Sun, Eric P. Xing

Nature language inference (NLI) task is a predictive task of determining the inference relationship of a pair of natural language sentences. With the increasing popularity of NLI, many state-of-the-art predictive models have been proposed with impressive performances. However, several works have noticed the statistical irregularities in the collected NLI data set that may result in an over-estimated performance of these models and proposed remedies. In this paper, we further investigate the statistical irregularities, what we refer as confounding factors, of the NLI data sets. With the belief that some NLI labels should preserve under swapping operations, we propose a simple yet effective way (swapping the two text fragments) of evaluating the NLI predictive models that naturally mitigate the observed problems. Further, we continue to train the predictive models with our swapping manner and propose to use the deviation of the model's evaluation performances under different percentages of training text fragments to be swapped to describe the robustness of a predictive model. Our evaluation metrics leads to some interesting understandings of recent published NLI methods. Finally, we also apply the swapping operation on NLI models to see the effectiveness of this straightforward method in mitigating the confounding factor problems in training generic sentence embeddings for other NLP transfer tasks.

* 8 pages 

  Access Paper or Ask Questions

About Norms and Causes

Jul 18, 2006
Daniel Kayser, Farid Nouioua

Knowing the norms of a domain is crucial, but there exist no repository of norms. We propose a method to extract them from texts: texts generally do not describe a norm, but rather how a state-of-affairs differs from it. Answers concerning the cause of the state-of-affairs described often reveal the implicit norm. We apply this idea to the domain of driving, and validate it by designing algorithms that identify, in a text, the "basic" norms to which it refers implicitly.

* The 17th FLAIRS'04 Conference (2004) 502-507 

  Access Paper or Ask Questions

Textual Data Distributions: Kullback Leibler Textual Distributions Contrasts on GPT-2 Generated Texts, with Supervised, Unsupervised Learning on Vaccine & Market Topics & Sentiment

Jun 15, 2021
Jim Samuel, Ratnakar Palle, Eduardo Correa Soares

Efficient textual data distributions (TDD) alignment and generation are open research problems in textual analytics and NLP. It is presently difficult to parsimoniously and methodologically confirm that two or more natural language datasets belong to similar distributions, and to identify the extent to which textual data possess alignment. This study focuses on addressing a segment of the broader problem described above by applying multiple supervised and unsupervised machine learning (ML) methods to explore the behavior of TDD by (i) topical alignment, and (ii) by sentiment alignment. Furthermore we use multiple text generation methods including fine-tuned GPT-2, to generate text by topic and by sentiment. Finally we develop a unique process driven variation of Kullback-Leibler divergence (KLD) application to TDD, named KL Textual Distributions Contrasts(KL-TDC) to identify the alignment of machine generated textual corpora with naturally occurring textual corpora. This study thus identifies a unique approach for generating and validating TDD by topic and sentiment, which can be used to help address sparse data problems and other research, practice and classroom situations in need of artificially generated topic or sentiment aligned textual data.

  Access Paper or Ask Questions

Deep Learning Neural Networks for Emotion Classification from Text: Enhanced Leaky Rectified Linear Unit Activation and Weighted Loss

Mar 04, 2022
Hui Yang, Abeer Alsadoon, P. W. C. Prasad, Thair Al-Dala'in, Tarik A. Rashid, Angelika Maag, Omar Hisham Alsadoon

Accurate emotion classification for online reviews is vital for business organizations to gain deeper insights into markets. Although deep learning has been successfully implemented in this area, accuracy and processing time are still major problems preventing it from reaching its full potential. This paper proposes an Enhanced Leaky Rectified Linear Unit activation and Weighted Loss (ELReLUWL) algorithm for enhanced text emotion classification and faster parameter convergence speed. This algorithm includes the definition of the inflection point and the slope for inputs on the left side of the inflection point to avoid gradient saturation. It also considers the weight of samples belonging to each class to compensate for the influence of data imbalance. Convolutional Neural Network (CNN) combined with the proposed algorithm to increase the classification accuracy and decrease the processing time by eliminating the gradient saturation problem and minimizing the negative effect of data imbalance, demonstrated on a binary sentiment problem. The results show that the proposed solution achieves better classification performance in different data scenarios and different review types. The proposed model takes less convergence time to achieve model optimization with seven epochs against the current convergence time of 11.5 epochs on average. The proposed solution improves accuracy and reduces the processing time of text emotion classification. The solution provides an average class accuracy of 96.63% against a current average accuracy of 91.56%. It also provides a processing time of 23.3 milliseconds compared to the current average processing time of 33.2 milliseconds. Finally, this study solves the issues of gradient saturation and data imbalance. It enhances overall average class accuracy and decreases processing time.

* Multimed Tools Appl.,2022 
* 28 pages 

  Access Paper or Ask Questions

$ \text{T}^3 $OMVP: A Transformer-based Time and Team Reinforcement Learning Scheme for Observation-constrained Multi-Vehicle Pursuit in Urban Area

Mar 04, 2022
Zheng Yuan, Tianhao Wu, Qinwen Wang, Yiying Yang, Lei Li, Lin Zhang

Smart Internet of Vehicles (IoVs) combined with Artificial Intelligence (AI) will contribute to vehicle decision-making in the Intelligent Transportation System (ITS). Multi-Vehicle Pursuit games (MVP), a multi-vehicle cooperative ability to capture mobile targets, is becoming a hot research topic gradually. Although there are some achievements in the field of MVP in the open space environment, the urban area brings complicated road structures and restricted moving spaces as challenges to the resolution of MVP games. We define an Observation-constrained MVP (OMVP) problem in this paper and propose a Transformer-based Time and Team Reinforcement Learning scheme ($ \text{T}^3 $OMVP) to address the problem. First, a new multi-vehicle pursuit model is constructed based on decentralized partially observed Markov decision processes (Dec-POMDP) to instantiate this problem. Second, by introducing and modifying the transformer-based observation sequence, QMIX is redefined to adapt to the complicated road structure, restricted moving spaces and constrained observations, so as to control vehicles to pursue the target combining the vehicle's observations. Third, a multi-intersection urban environment is built to verify the proposed scheme. Extensive experimental results demonstrate that the proposed $ \text{T}^3 $OMVP scheme achieves significant improvements relative to state-of-the-art QMIX approaches by 9.66%~106.25%. Code is available at

  Access Paper or Ask Questions

[email protected]: Using Machine Learning for Detection of Hate Speech and Offensive Code-Mixed Social Media text

Feb 19, 2021
Varsha Pathak, Manish Joshi, Prasad Joshi, Monica Mundada, Tanmay Joshi

This paper describes the system submitted by our team, KBCNMUJAL, for Task 2 of the shared task Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC), at Forum for Information Retrieval Evaluation, December 16-20, 2020, Hyderabad, India. The datasets of two Dravidian languages Viz. Malayalam and Tamil of size 4000 observations, each were shared by the HASOC organizers. These datasets are used to train the machine using different machine learning algorithms, based on classification and regression models. The datasets consist of tweets or YouTube comments with two class labels offensive and not offensive. The machine is trained to classify such social media messages in these two categories. Appropriate n-gram feature sets are extracted to learn the specific characteristics of the Hate Speech text messages. These feature models are based on TFIDF weights of n-gram. The referred work and respective experiments show that the features such as word, character and combined model of word and character n-grams could be used to identify the term patterns of offensive text contents. As a part of the HASOC shared task, the test data sets are made available by the HASOC track organizers. The best performing classification models developed for both languages are applied on test datasets. The model which gives the highest accuracy result on training dataset for Malayalam language was experimented to predict the categories of respective test data. This system has obtained an F1 score of 0.77. Similarly the best performing model for Tamil language has obtained an F1 score of 0.87. This work has received 2nd and 3rd rank in this shared Task 2 for Malayalam and Tamil language respectively. The proposed system is named HASOC_kbcnmujal.

  Access Paper or Ask Questions

Classification, Slippage, Failure and Discovery

Apr 08, 2021
Marc Böhlen

This text argues for the potential of machine learning infused classification systems as vectors for a technically-engaged and constructive technology critique. The text describes this potential with several experiments in image data creation and neural network based classification. The text considers varying aspects of slippage in classification and considers the potential for discovery - as opposed to disaster - stemming from machine learning systems when they fail to perform as anticipated.

* 9th Conference on Computation, Communication, Aesthetics & X 2021 

  Access Paper or Ask Questions

Quantitative Entropy Study of Language Complexity

Jan 15, 2017
R. R. Xie, W. B. Deng, D. J. Wang, L. P. Csernai

We study the entropy of Chinese and English texts, based on characters in case of Chinese texts and based on words for both languages. Significant differences are found between the languages and between different personal styles of debating partners. The entropy analysis points in the direction of lower entropy, that is of higher complexity. Such a text analysis would be applied for individuals of different styles, a single individual at different age, as well as different groups of the population.

  Access Paper or Ask Questions

Linguistic Structure as Composition and Perturbation

Jun 21, 1996
Carl de Marcken

This paper discusses the problem of learning language from unprocessed text and speech signals, concentrating on the problem of learning a lexicon. In particular, it argues for a representation of language in which linguistic parameters like words are built by perturbing a composition of existing parameters. The power of this representation is demonstrated by several examples in text segmentation and compression, acquisition of a lexicon from raw speech, and the acquisition of mappings between text and artificial representations of meaning.

* 7 pages 

  Access Paper or Ask Questions

Towards Structuring Real-World Data at Scale: Deep Learning for Extracting Key Oncology Information from Clinical Text with Patient-Level Supervision

Mar 20, 2022
Sam Preston, Mu Wei, Rajesh Rao, Robert Tinn, Naoto Usuyama, Michael Lucas, Roshanthi Weerasinghe, Soohee Lee, Brian Piening, Paul Tittel, Naveen Valluri, Tristan Naumann, Carlo Bifulco, Hoifung Poon

Objective: The majority of detailed patient information in real-world data (RWD) is only consistently available in free-text clinical documents. Manual curation is expensive and time-consuming. Developing natural language processing (NLP) methods for structuring RWD is thus essential for scaling real-world evidence generation. Materials and Methods: Traditional rule-based systems are vulnerable to the prevalent linguistic variations and ambiguities in clinical text, and prior applications of machine-learning methods typically require sentence-level or report-level labeled examples that are hard to produce at scale. We propose leveraging patient-level supervision from medical registries, which are often readily available and capture key patient information, for general RWD applications. To combat the lack of sentence-level or report-level annotations, we explore advanced deep-learning methods by combining domain-specific pretraining, recurrent neural networks, and hierarchical attention. Results: We conduct an extensive study on 135,107 patients from the cancer registry of a large integrated delivery network (IDN) comprising healthcare systems in five western US states. Our deep learning methods attain test AUROC of 94-99% for key tumor attributes and comparable performance on held-out data from separate health systems and states. Discussion and Conclusion: Ablation results demonstrate clear superiority of these advanced deep-learning methods over prior approaches. Error analysis shows that our NLP system sometimes even corrects errors in registrar labels. We also conduct a preliminary investigation in accelerating registry curation and general RWD structuring via assisted curation for over 1.2 million cancer patients in this healthcare network.

  Access Paper or Ask Questions