Pre-trained language models have been dominating the field of natural language processing in recent years, and have led to significant performance gains for various complex natural language tasks. One of the most prominent pre-trained language models is BERT (Bi-directional Encoders for Transformers), which was released as an English as well as a multilingual version. Although multilingual BERT performs well on many tasks, recent studies showed that BERT models trained on a single language significantly outperform the multilingual results. Training a Dutch BERT model thus has a lot of potential for a wide range of Dutch NLP tasks. While previous approaches have used earlier implementations of BERT to train their Dutch BERT, we used RoBERTa, a robustly optimized BERT approach, to train a Dutch language model called RobBERT. We show that RobBERT improves state of the art results in Dutch-specific language tasks, and also outperforms other existing Dutch BERT-based models in sentiment analysis. These results indicate that RobBERT is a powerful pre-trained model for fine-tuning for a large variety of Dutch language tasks. We publicly release this pre-trained model in hope of supporting further downstream Dutch NLP applications.
The text classification is one of the most critical areas in machine learning and artificial intelligence research. It has been actively adopted in many business applications such as conversational intelligence systems, news articles categorizations, sentiment analysis, emotion detection systems, and many other recommendation systems in our daily life. One of the problems in supervised text classification models is that the models performance depend heavily on the quality of data labeling that are typically done by humans. In this study, we propose a new network community detection-based approach to automatically label and classify text data into multiclass value spaces. Specifically, we build a network with sentences as the network nodes and pairwise cosine similarities between TFIDF vector representations of the sentences as the network link weights. We use the Louvain method to detect the communities in the sentence network. We train and test Support vector machine and Random forest models on both the human labeled data and network community detection labeled data. Results showed that models with the data labeled by network community detection outperformed the models with the human-labeled data by 2.68-3.75% of classification accuracy. Our method may help development of a more accurate conversational intelligence system and other text classification systems.
K-fold cross validation (CV) is a popular method for estimating the true performance of machine learning models, allowing model selection and parameter tuning. However, the very process of CV requires random partitioning of the data and so our performance estimates are in fact stochastic, with variability that can be substantial for natural language processing tasks. We demonstrate that these unstable estimates cannot be relied upon for effective parameter tuning. The resulting tuned parameters are highly sensitive to how our data is partitioned, meaning that we often select sub-optimal parameter choices and have serious reproducibility issues. Instead, we propose to use the less variable J-K-fold CV, in which J independent K-fold cross validations are used to assess performance. Our main contributions are extending J-K-fold CV from performance estimation to parameter tuning and investigating how to choose J and K. We argue that variability is more important than bias for effective tuning and so advocate lower choices of K than are typically seen in the NLP literature, instead use the saved computation to increase J. To demonstrate the generality of our recommendations we investigate a wide range of case-studies: sentiment classification (both general and target-specific), part-of-speech tagging and document classification.
The investment on the stock market is prone to be affected by the Internet. For the purpose of improving the prediction accuracy, we propose a multi-task stock prediction model that not only considers the stock correlations but also supports multi-source data fusion. Our proposed model first utilizes tensor to integrate the multi-sourced data, including financial Web news, investors' sentiments extracted from the social network and some quantitative data on stocks. In this way, the intrinsic relationships among different information sources can be captured, and meanwhile, multi-sourced information can be complemented to solve the data sparsity problem. Secondly, we propose an improved sub-mode coordinate algorithm (SMC). SMC is based on the stock similarity, aiming to reduce the variance of their subspace in each dimension produced by the tensor decomposition. The algorithm is able to improve the quality of the input features, and thus improves the prediction accuracy. And the paper utilizes the Long Short-Term Memory (LSTM) neural network model to predict the stock fluctuation trends. Finally, the experiments on 78 A-share stocks in CSI 100 and thirteen popular HK stocks in the year 2015 and 2016 are conducted. The results demonstrate the improvement on the prediction accuracy and the effectiveness of the proposed model.
Deep multimodal learning has achieved great progress in recent years. However, current fusion approaches are static in nature, i.e., they process and fuse multimodal inputs with identical computation, without accounting for diverse computational demands of different multimodal data. In this work, we propose dynamic multimodal fusion (DynMM), a new approach that adaptively fuses multimodal data and generates data-dependent forward paths during inference. DynMM can reduce redundant computations for "easy" multimodal inputs (that can be predicted correctly using only one modality or simple fusion techniques) and retain representation power for "hard" samples by adopting all modalities and complex fusion operations for prediction. Results on various multimodal tasks demonstrate the efficiency and wide applicability of our approach. For instance, DynMM can reduce the computation cost by 46.5% with a negligible accuracy loss on CMU-MOSEI sentiment analysis. For RGB-D semantic segmentation on NYU Depth data, DynMM achieves a +0.7% mIoU improvement with over 21% reductions for the depth encoder when compared with strong baselines. We believe this opens a novel direction towards dynamic multimodal network design, with applications to a wide range of multimodal tasks.
Federated Learning (FL) facilitates distributed model learning to protect users' privacy. In the absence of labels for a new user's data, the knowledge transfer in FL allows a learned global model to adapt to the new samples quickly. The multi-source domain adaptation in FL aims to improve the model's generality in a target domain by learning domain-invariant features from different clients. In this paper, we propose Federated Knowledge Alignment (FedKA) that aligns features from different clients and those of the target task. We identify two types of negative transfer arising in multi-source domain adaptation of FL and demonstrate how FedKA can alleviate such negative transfers with the help of a global features disentangler enhanced by embedding matching. To further facilitate representation learning of the target task, we devise a federated voting mechanism to provide labels for samples from the target domain via a consensus from querying local models and fine-tune the global model with these labeled samples. Extensive experiments, including an ablation study, on an image classification task of Digit-Five and a text sentiment classification task of Amazon Review, show that FedKA could be augmented to existing FL algorithms to improve the generality of the learned model for tackling a new task.
Sentiment analysis is the most basic NLP task to determine the polarity of text data. There has been a significant amount of work in the area of multilingual text as well. Still hate and offensive speech detection faces a challenge due to inadequate availability of data, especially for Indian languages like Hindi and Marathi. In this work, we consider hate and offensive speech detection in Hindi and Marathi texts. The problem is formulated as a text classification task using the state of the art deep learning approaches. We explore different deep learning architectures like CNN, LSTM, and variations of BERT like multilingual BERT, IndicBERT, and monolingual RoBERTa. The basic models based on CNN and LSTM are augmented with fast text word embeddings. We use the HASOC 2021 Hindi and Marathi hate speech datasets to compare these algorithms. The Marathi dataset consists of binary labels and the Hindi dataset consists of binary as well as more-fine grained labels. We show that the transformer-based models perform the best and even the basic models along with FastText embeddings give a competitive performance. Moreover, with normal hyper-parameter tuning, the basic models perform better than BERT-based models on the fine-grained Hindi dataset.
Twitter is a useful resource to analyze peoples' opinions on various topics. Often these topics are correlated or associated with locations from where these Tweet posts are made. For example, restaurant owners may need to know where their target customers eat with respect to the sentiment of the posts made related to food, policy planners may need to analyze citizens' opinion on relevant issues such as crime, safety, congestion, etc. with respect to specific parts of the city, or county or state. As promising as this is, less than $1\%$ of the crawled Tweet posts come with geolocation tags. That makes accurate prediction of Tweet posts for the non geo-tagged tweets very critical to analyze data in various domains. In this research, we utilized millions of Twitter posts and end-users domain expertise to build a set of deep neural network models using natural language processing (NLP) techniques, that predicts the geolocation of non geo-tagged Tweet posts at various level of granularities such as neighborhood, zipcode, and longitude with latitudes. With multiple neural architecture experiments, and a collaborative human-machine workflow design, our ongoing work on geolocation detection shows promising results that empower end-users to correlate relationship between variables of choice with the location information.
Disinformation entails the purposeful dissemination of falsehoods towards a greater dubious agenda and the chaotic fracturing of a society. The general public has grown aware of the misuse of social media towards these nefarious ends, where even global public health crises have not been immune to misinformation (deceptive content spread without intended malice). In this paper, we examine nearly 505K COVID-19-related tweets from the initial months of the pandemic to understand misinformation as a function of bot-behavior and engagement. Using a correlation-based feature selection method, we selected the 11 most relevant feature subsets among over 170 features to distinguish misinformation from facts, and to predict highly engaging misinformation tweets about COVID-19. We achieved an average F-score of at least 72\% with ten popular multi-class classifiers, reinforcing the relevance of the selected features. We found that (i) real users tweet both facts and misinformation, while bots tweet proportionally more misinformation; (ii) misinformation tweets were less engaging than facts; (iii) the textual content of a tweet was the most important to distinguish fact from misinformation while (iv) user account metadata and human-like activity were most important to predict high engagement in factual and misinformation tweets; and (v) sentiment features were not relevant.