Legislation and public sentiment throughout the world have promoted fairness metrics, explainability, and interpretability as prescriptions for the responsible development of ethical artificial intelligence systems. Despite the importance of these three pillars in the foundation of the field, they can be challenging to operationalize and attempts to solve the problems in production environments often feel Sisyphean. This difficulty stems from a number of factors: fairness metrics are computationally difficult to incorporate into training and rarely alleviate all of the harms perpetrated by these systems. Interpretability and explainability can be gamed to appear fair, may inadvertently reduce the privacy of personal information contained in training data, and increase user confidence in predictions -- even when the explanations are wrong. In this work, we propose a framework for responsibly developing artificial intelligence systems by incorporating lessons from the field of information security and the secure development lifecycle to overcome challenges associated with protecting users in adversarial settings. In particular, we propose leveraging the concepts of threat modeling, design review, penetration testing, and incident response in the context of developing AI systems as ways to resolve shortcomings in the aforementioned methods.
Text classification is one of the fundamental tasks in natural language processing to label an open-ended text and is useful for various applications such as sentiment analysis. In this paper, we discuss various classification approaches for Khmer text, ranging from a classical TF-IDF algorithm with support vector machine classifier to modern word embedding-based neural network classifiers including linear layer model, recurrent neural network and convolutional neural network. A Khmer word embedding model is trained on a 30-million-Khmer-word corpus to construct word vector representations that are used to train three different neural network classifiers. We evaluate the performance of different approaches on a news article dataset for both multi-class and multi-label text classification tasks. The result suggests that neural network classifiers using a word embedding model consistently outperform the traditional classifier using TF-IDF. The recurrent neural network classifier provides a slightly better result compared to the convolutional network and the linear layer network.
The explosion in the availability of natural language data in the era of social media has given rise to a host of applications such as sentiment analysis and opinion mining. Simultaneously, the growing availability of precise geolocation information is enabling visualization of global phenomena such as environmental changes and disease propagation. Opportunities for tracking spatial variations in language use, however, have largely been overlooked, especially on small spatial scales. Here we explore the use of Twitter data with precise geolocation information to resolve spatial variations in language use on an urban scale down to single city blocks. We identify several categories of language tokens likely to show distinctive patterns of use and develop quantitative methods to visualize the spatial distributions associated with these patterns. Our analysis concentrates on comparison of contrasting pairs of Tweet distributions from the same category, each defined by a set of tokens. Our work shows that analysis of small-scale variations can provide unique information on correlations between language use and social context which are highly valuable to a wide range of fields from linguistic science and commercial advertising to social services.
Weighting strategy prevails in machine learning. For example, a common approach in robust machine learning is to exert lower weights on samples which are likely to be noisy or hard. This study reveals another undiscovered strategy, namely, compensating, that has also been widely used in machine learning. Learning with compensating is called compensation learning and a systematic taxonomy is constructed for it in this study. In our taxonomy, compensation learning is divided on the basis of the compensation targets, inference manners, and granularity levels. Many existing learning algorithms including some classical ones can be seen as a special case of compensation learning or partially leveraging compensating. Furthermore, a family of new learning algorithms can be obtained by plugging the compensation learning into existing learning algorithms. Specifically, three concrete new learning algorithms are proposed for robust machine learning. Extensive experiments on text sentiment analysis, image classification, and graph classification verify the effectiveness of the three new algorithms. Compensation learning can also be used in various learning scenarios, such as imbalance learning, clustering, regression, and so on.
Most previous studies integrate cognitive language processing signals (e.g., eye-tracking or EEG data) into neural models of natural language processing (NLP) just by directly concatenating word embeddings with cognitive features, ignoring the gap between the two modalities (i.e., textual vs. cognitive) and noise in cognitive features. In this paper, we propose a CogAlign approach to these issues, which learns to align textual neural representations to cognitive features. In CogAlign, we use a shared encoder equipped with a modality discriminator to alternatively encode textual and cognitive inputs to capture their differences and commonalities. Additionally, a text-aware attention mechanism is proposed to detect task-related information and to avoid using noise in cognitive features. Experimental results on three NLP tasks, namely named entity recognition, sentiment analysis and relation extraction, show that CogAlign achieves significant improvements with multiple cognitive features over state-of-the-art models on public datasets. Moreover, our model is able to transfer cognitive information to other datasets that do not have any cognitive processing signals.
While the long-term effects of COVID-19 are yet to be determined, its immediate impact on crowdfunding is nonetheless significant. This study takes a computational approach to more deeply comprehend this change. Using a unique data set of all the campaigns published over the past two years on GoFundMe, we explore the factors that have led to the successful funding of a crowdfunding project. In particular, we study a corpus of crowdfunded projects, analyzing cover images and other variables commonly present on crowdfunding sites. Furthermore, we construct a classifier and a regression model to assess the significance of features based on XGBoost. In addition, we employ counterfactual analysis to investigate the causality between features and the success of crowdfunding. More importantly, sentiment analysis and the paired sample t-test are performed to examine the differences in crowdfunding campaigns before and after the COVID-19 outbreak that started in March 2020. First, we note that there is significant racial disparity in crowdfunding success. Second, we find that sad emotion expressed through the campaign's description became significant after the COVID-19 outbreak. Considering all these factors, our findings shed light on the impact of COVID-19 on crowdfunding campaigns.
Tsetlin Machine (TM) is an interpretable pattern recognition algorithm based on propositional logic. The algorithm has demonstrated competitive performance in many Natural Language Processing (NLP) tasks, including sentiment analysis, text classification, and Word Sense Disambiguation (WSD). To obtain human-level interpretability, legacy TM employs Boolean input features such as bag-of-words (BOW). However, the BOW representation makes it difficult to use any pre-trained information, for instance, word2vec and GloVe word representations. This restriction has constrained the performance of TM compared to deep neural networks (DNNs) in NLP. To reduce the performance gap, in this paper, we propose a novel way of using pre-trained word representations for TM. The approach significantly enhances the TM performance and maintains interpretability at the same time. We achieve this by extracting semantically related words from pre-trained word representations as input features to the TM. Our experiments show that the accuracy of the proposed approach is significantly higher than the previous BOW-based TM, reaching the level of DNN-based models.
In this paper, we study bidirectional LSTM network for the task of text classification using both supervised and semi-supervised approaches. Several prior works have suggested that either complex pretraining schemes using unsupervised methods such as language modeling (Dai and Le 2015; Miyato, Dai, and Goodfellow 2016) or complicated models (Johnson and Zhang 2017) are necessary to achieve a high classification accuracy. However, we develop a training strategy that allows even a simple BiLSTM model, when trained with cross-entropy loss, to achieve competitive results compared with more complex approaches. Furthermore, in addition to cross-entropy loss, by using a combination of entropy minimization, adversarial, and virtual adversarial losses for both labeled and unlabeled data, we report state-of-the-art results for text classification task on several benchmark datasets. In particular, on the ACL-IMDB sentiment analysis and AG-News topic classification datasets, our method outperforms current approaches by a substantial margin. We also show the generality of the mixed objective function by improving the performance on relation extraction task.
Regularization is used to avoid overfitting when training a neural network; unfortunately, this reduces the attainable level of detail hindering the ability to capture high-frequency information present in the training data. Even though various approaches may be used to re-introduce high-frequency detail, it typically does not match the training data and is often not time coherent. In the case of network inferred cloth, these sentiments manifest themselves via either a lack of detailed wrinkles or unnaturally appearing and/or time incoherent surrogate wrinkles. Thus, we propose a general strategy whereby high-frequency information is procedurally embedded into low-frequency data so that when the latter is smeared out by the network the former still retains its high-frequency detail. We illustrate this approach by learning texture coordinates which when smeared do not in turn smear out the high-frequency detail in the texture itself but merely smoothly distort it. Notably, we prescribe perturbed texture coordinates that are subsequently used to correct the over-smoothed appearance of inferred cloth, and correcting the appearance from multiple camera views naturally recovers lost geometric information.