In the text classification problem, the imbalance of labels in datasets affect the performance of the text-classification models. Practically, the data about user comments on social networking sites not altogether appeared - the administrators often only allow positive comments and hide negative comments. Thus, when collecting the data about user comments on the social network, the data is usually skewed about one label, which leads the dataset to become imbalanced and deteriorate the model's ability. The data augmentation techniques are applied to solve the imbalance problem between classes of the dataset, increasing the prediction model's accuracy. In this paper, we performed augmentation techniques on the VLSP2019 Hate Speech Detection on Vietnamese social texts and the UIT - VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis. The result of augmentation increases by about 1.5% in the F1-macro score on both corpora.
We propose Masker, an unsupervised text-editing method for style transfer. To tackle cases when no parallel source-target pairs are available, we train masked language models (MLMs) for both the source and the target domain. Then we find the text spans where the two models disagree the most in terms of likelihood. This allows us to identify the source tokens to delete to transform the source text to match the style of the target domain. The deleted tokens are replaced with the target MLM, and by using a padded MLM variant, we avoid having to predetermine the number of inserted tokens. Our experiments on sentence fusion and sentiment transfer demonstrate that Masker performs competitively in a fully unsupervised setting. Moreover, in low-resource settings, it improves supervised methods' accuracy by over 10 percentage points when pre-training them on silver training data generated by Masker.
The current policy of removing drill music videos from social media platforms such as YouTube remains controversial because it risks conflating the co-occurrence of drill rap and violence with a causal chain of the two. Empirically, we revisit the question of whether there is evidence to support the conjecture that drill music and gang violence are linked. We provide new empirical insights suggesting that: i) drill music lyrics have not become more negative over time if anything they have become more positive; ii) individual drill artists have similar sentiment trajectories to other artists in the drill genre, and iii) there is no meaningful relationship between drill music and real-life violence when compared to three kinds of police-recorded violent crime data in London. We suggest ideas for new work that can help build a much-needed evidence base around the problem.
Distributional representations of words, also known as word vectors, have become crucial for modern natural language processing tasks due to their wide applications. Recently, a growing body of word vector postprocessing algorithm has emerged, aiming to render off-the-shelf word vectors even stronger. In line with these investigations, we introduce a novel word vector postprocessing scheme under a causal inference framework. Concretely, the postprocessing pipeline is realized by Half-Sibling Regression (HSR), which allows us to identify and remove confounding noise contained in word vectors. Compared to previous work, our proposed method has the advantages of interpretability and transparency due to its causal inference grounding. Evaluated on a battery of standard lexical-level evaluation tasks and downstream sentiment analysis tasks, our method reaches state-of-the-art performance.
Face detection is a very important task and a necessary pre-processing step for many applications such as facial landmark detection, pose estimation, sentiment analysis and face recognition. Not only is face detection an important pre-processing step in computer vision applications but also in computational psychology, behavioral imaging and other fields where researchers might not be initiated in computer vision frameworks and state-of-the-art detection applications. A large part of existing research that includes face detection as a pre-processing step uses existing out-of-the-box detectors such as the HoG-based dlib and the OpenCV Haar face detector which are no longer state-of-the-art - they are primarily used because of their ease of use and accessibility. We introduce Dockerface, a very accurate Faster R-CNN face detector in a Docker container which requires no training and is easy to install and use.
Social media platforms such as Twitter and Facebook are becoming popular in multilingual societies. This trend induces portmanteau of South Asian languages with English. The blend of multiple languages as code-mixed data has recently become popular in research communities for various NLP tasks. Code-mixed data consist of anomalies such as grammatical errors and spelling variations. In this paper, we leverage the contextual property of words where the different spelling variation of words share similar context in a large noisy social media text. We capture different variations of words belonging to same context in an unsupervised manner using distributed representations of words. Our experiments reveal that preprocessing of the code-mixed dataset based on our approach improves the performance in state-of-the-art part-of-speech tagging (POS-tagging) and sentiment analysis tasks.
Word-level adversarial attacks have shown success in NLP models, drastically decreasing the performance of transformer-based models in recent years. As a countermeasure, adversarial defense has been explored, but relatively few efforts have been made to detect adversarial examples. However, detecting adversarial examples may be crucial for automated tasks (e.g. review sentiment analysis) that wish to amass information about a certain population and additionally be a step towards a robust defense system. To this end, we release a dataset for four popular attack methods on four datasets and four models to encourage further research in this field. Along with it, we propose a competitive baseline based on density estimation that has the highest AUC on 29 out of 30 dataset-attack-model combinations. Source code is available in https://github.com/anoymous92874838/text-adv-detection.
Modern classification algorithms are susceptible to adversarial examples--perturbations to inputs that cause the algorithm to produce undesirable behavior. In this work, we seek to understand and extend adversarial examples across domains in which inputs are discrete, particularly across new domains, such as computational biology. As a step towards this goal, we formalize a notion of synonymous adversarial examples that applies in any discrete setting and describe a simple domain-agnostic algorithm to construct such examples. We apply this algorithm across multiple domains--including sentiment analysis and DNA sequence classification--and find that it consistently uncovers adversarial examples. We seek to understand their prevalence theoretically and we attribute their existence to spurious token correlations, a statistical phenomenon that is specific to discrete spaces. Our work is a step towards a domain-agnostic treatment of discrete adversarial examples analogous to that of continuous inputs.
The study presented here provides numerical insight into ghazal -- the most appreciated genre in Urdu poetry. Using 48,761 poetic works from 4,754 poets produced over a period of 800 years, this study explores the main features of Urdu ghazal that make it popular and admired more than other forms. A detailed explanation is provided as to the types of words used for expressing love, nature, birds, and flowers etc. Also considered is the way in which the poets addressed their loved ones in their poetry. The style of poetry is numerically analyzed using Multi Dimensional Scaling to reveal the lexical diversity and similarities/differences between the different poetic works that have drawn the attention of critics, such as Iqbal and Ghalib, Mir Taqi Mir and Mir Dard. The analysis produced here is particularly helpful for research in computational stylistics, neurocognitive poetics, and sentiment analysis.