We present the shared task on artificial text detection in Russian, which is organized as a part of the Dialogue Evaluation initiative, held in 2022. The shared task dataset includes texts from 14 text generators, i.e., one human writer and 13 text generative models fine-tuned for one or more of the following generation tasks: machine translation, paraphrase generation, text summarization, text simplification. We also consider back-translation and zero-shot generation approaches. The human-written texts are collected from publicly available resources across multiple domains. The shared task consists of two sub-tasks: (i) to determine if a given text is automatically generated or written by a human; (ii) to identify the author of a given text. The first task is framed as a binary classification problem. The second task is a multi-class classification problem. We provide count-based and BERT-based baselines, along with the human evaluation on the first sub-task. A total of 30 and 8 systems have been submitted to the binary and multi-class sub-tasks, correspondingly. Most teams outperform the baselines by a wide margin. We publicly release our codebase, human evaluation results, and other materials in our GitHub repository (https://github.com/dialogue-evaluation/RuATD).
Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture. In this work, we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models. On linear-probe classification accuracy averaging over 7 tasks, our best unsupervised model achieves a relative improvement of 4% and 1.8% over previous best unsupervised and supervised text embedding models respectively. The same text embeddings when evaluated on large-scale semantic search attains a relative improvement of 23.4%, 14.7%, and 10.6% over previous best unsupervised methods on MSMARCO, Natural Questions and TriviaQA benchmarks, respectively. Similarly to text embeddings, we train code embedding models on (text, code) pairs, obtaining a 20.8% relative improvement over prior best work on code search.
Volume of text based documents have been increasing day by day. Medical documents are located within this growing text documents. In this study, the techniques used for text classification applied on medical documents and evaluated classification performance. Used data sets are multi class and multi labelled. Chi Square (CHI) technique was used for feature selection also SMO, NB, C4.5, RF and KNN algorithms was used for classification. The aim of this study, success of various classifiers is evaluated on multi class and multi label data sets consisting of medical documents. The first 400 features, while the most successful in the KNN classifier, feature number 400 and after the SMO has become the most successful classifier.
Text classification is one of the most widely studied tasks in natural language processing. Motivated by the principle of compositionality, large multilayer neural network models have been employed for this task in an attempt to effectively utilize the constituent expressions. Almost all of the reported work train large networks using discriminative approaches, which come with a caveat of no proper capacity control, as they tend to latch on to any signal that may not generalize. Using various recent state-of-the-art approaches for text classification, we explore whether these models actually learn to compose the meaning of the sentences or still just focus on some keywords or lexicons for classifying the document. To test our hypothesis, we carefully construct datasets where the training and test splits have no direct overlap of such lexicons, but overall language structure would be similar. We study various text classifiers and observe that there is a big performance drop on these datasets. Finally, we show that even simple models with our proposed regularization techniques, which disincentivize focusing on key lexicons, can substantially improve classification accuracy.
Semisupervised text classification has become a major focus of research over the past few years. Hitherto, most of the research has been based on supervised learning, but its main drawback is the unavailability of labeled data samples in practical applications. It is still a key challenge to train the deep generative models and learn comprehensive representations without supervision. Even though continuous latent variables are employed primarily in deep latent variable models, discrete latent variables, with their enhanced understandability and better compressed representations, are effectively used by researchers. In this paper, we propose a semisupervised discrete latent variable model for multi-class text classification and text generation. The proposed model employs the concept of transfer learning for training a quantized transformer model, which is able to learn competently using fewer labeled instances. The model applies decomposed vector quantization technique to overcome problems like posterior collapse and index collapse. Shannon entropy is used for the decomposed sub-encoders, on which a variable DropConnect is applied, to retain maximum information. Moreover, gradients of the Loss function are adaptively modified during backpropagation from decoder to encoder to enhance the performance of the model. Three conventional datasets of diversified range have been used for validating the proposed model on a variable number of labeled instances. Experimental results indicate that the proposed model has surpassed the state-of-the-art models remarkably.
This paper introduces "Czech Text Document Corpus v 2.0", a collection of text documents for automatic document classification in Czech language. It is composed of the text documents provided by the Czech News Agency and is freely available for research purposes at http://ctdc.kiv.zcu.cz/. This corpus was created in order to facilitate a straightforward comparison of the document classification approaches on Czech data. It is particularly dedicated to evaluation of multi-label document classification approaches, because one document is usually labelled with more than one label. Besides the information about the document classes, the corpus is also annotated at the morphological layer. This paper further shows the results of selected state-of-the-art methods on this corpus to offer the possibility of an easy comparison with these approaches.
Adversarial training is the most empirically successful approach in improving the robustness of deep neural networks for image classification. For text classification, however, existing synonym substitution based adversarial attacks are effective but not efficient to be incorporated into practical text adversarial training. Gradient-based attacks, which are very efficient for images, are hard to be implemented for synonym substitution based text attacks due to the lexical, grammatical and semantic constraints and the discrete text input space. Thereby, we propose a fast text adversarial attack method called Fast Gradient Projection Method (FGPM) based on synonym substitution, which is about 20 times faster than existing text attack methods and could achieve similar attack performance. We then incorporate FGPM with adversarial training and propose a text defense method called Adversarial Training with FGPM enhanced by Logit pairing (ATFL). Experiments show that ATFL could significantly improve the model robustness and block the transferability of adversarial examples.
Objective. Epidemiological studies require data that are in alignment with the classifications established for occupations or economic activities. The classifications usually include hundreds of codes and titles. Manual coding of raw data may result in misclassification and be time consuming. The goal was to develop and test a web-tool, named Procode, for coding of free-texts against classifications and recoding between different classifications. Methods. Three text classifiers, i.e. Complement Naive Bayes (CNB), Support Vector Machine (SVM) and Random Forest Classifier (RFC), were investigated using a k-fold cross-validation. 30 000 free-texts with manually assigned classification codes of French classification of occupations (PCS) and French classification of activities (NAF) were available. For recoding, Procode integrated a workflow that converts codes of one classification to another according to existing crosswalks. Since this is a straightforward operation, only the recoding time was measured. Results. Among the three investigated text classifiers, CNB resulted in the best performance, where the classifier predicted accurately 57-81% and 63-83% classification codes for PCS and NAF, respectively. SVM lead to somewhat lower results (by 1-2%), while RFC coded accurately up to 30% of the data. The coding operation required one minute per 10 000 records, while the recoding was faster, i.e. 5-10 seconds. Conclusion. The algorithm integrated in Procode showed satisfactory performance, since the tool had to assign the right code by choosing between 500-700 different choices. Based on the results, the authors decided to implement CNB in Procode. In future, if another classifier shows a superior performance, an update will include the required modifications.
Extracting text objects from the PDF images is a challenging problem. The text data present in the PDF images contain certain useful information for automatic annotation, indexing etc. However variations of the text due to differences in text style, font, size, orientation, alignment as well as complex structure make the problem of automatic text extraction extremely difficult and challenging job. This paper presents two techniques under block-based classification. After a brief introduction of the classification methods, two methods were enhanced and results were evaluated. The performance metrics for segmentation and time consumption are tested for both the models.
One of the principal tasks of machine learning with major applications is text classification. This paper focuses on the legal domain and, in particular, on the classification of lengthy legal documents. The main challenge that this study addresses is the limitation that current models impose on the length of the input text. In addition, the present paper shows that dividing the text into segments and later combining the resulting embeddings with a BiLSTM architecture to form a single document embedding can improve results. These advancements are achieved by utilising a simpler structure, rather than an increasingly complex one, which is often the case in NLP research. The dataset used in this paper is obtained from an online public database containing lengthy legal documents with highly domain-specific vocabulary and thus, the comparison of our results to the ones produced by models implemented on the commonly used datasets would be unjustified. This work provides the foundation for future work in document classification in the legal field.