Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

Jun 08, 2021
Marcely Zanon Boito, Bolaji Yusuf, Lucas Ondel, Aline Villavicencio, Laurent Besacier

When documenting oral-languages, Unsupervised Word Segmentation (UWS) from speech is a useful, yet challenging, task. It can be performed from phonetic transcriptions, or in the absence of these, from the output of unsupervised speech discretization models. These discretization models are trained using raw speech only, producing discrete speech units which can be applied for downstream (text-based) tasks. In this paper we compare five of these models: three Bayesian and two neural approaches, with regards to the exploitability of the produced units for UWS. Two UWS models are experimented with and we report results for Finnish, Hungarian, Mboshi, Romanian and Russian in a low-resource setting (using only 5k sentences). Our results suggest that neural models for speech discretization are difficult to exploit in our setting, and that it might be necessary to adapt them to limit sequence length. We obtain our best UWS results by using the SHMM and H-SHMM Bayesian models, which produce high quality, yet compressed, discrete representations of the input speech signal.

  Access Paper or Ask Questions

Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods

May 05, 2021
Lifeng Han, Gareth J. F. Jones, Alan F. Smeaton

To facilitate effective translation modeling and translation studies, one of the crucial questions to address is how to assess translation quality. From the perspectives of accuracy, reliability, repeatability and cost, translation quality assessment (TQA) itself is a rich and challenging task. In this work, we present a high-level and concise survey of TQA methods, including both manual judgement criteria and automated evaluation metrics, which we classify into further detailed sub-categories. We hope that this work will be an asset for both translation model researchers and quality assessment researchers. In addition, we hope that it will enable practitioners to quickly develop a better understanding of the conventional TQA field, and to find corresponding closely relevant evaluation solutions for their own needs. This work may also serve inspire further development of quality assessment and evaluation methodologies for other natural language processing (NLP) tasks in addition to machine translation (MT), such as automatic text summarization (ATS), natural language understanding (NLU) and natural language generation (NLG).

* Accepted to 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021): Workshop on Modelling Translation: Translatology in the Digital Age (MoTra21). arXiv admin note: substantial text overlap with arXiv:1605.04515 

  Access Paper or Ask Questions

Active learning for medical code assignment

Apr 12, 2021
Martha Dais Ferreira, Michal Malyska, Nicola Sahar, Riccardo Miotto, Fernando Paulovich, Evangelos Milios

Machine Learning (ML) is widely used to automatically extract meaningful information from Electronic Health Records (EHR) to support operational, clinical, and financial decision-making. However, ML models require a large number of annotated examples to provide satisfactory results, which is not possible in most healthcare scenarios due to the high cost of clinician-labeled data. Active Learning (AL) is a process of selecting the most informative instances to be labeled by an expert to further train a supervised algorithm. We demonstrate the effectiveness of AL in multi-label text classification in the clinical domain. In this context, we apply a set of well-known AL methods to help automatically assign ICD-9 codes on the MIMIC-III dataset. Our results show that the selection of informative instances provides satisfactory classification with a significantly reduced training set (8.3\% of the total instances). We conclude that AL methods can significantly reduce the manual annotation cost while preserving model performance.

* It was accepted in the ACM CHIL 2021 workshop track 

  Access Paper or Ask Questions

Part of speech and gramset tagging algorithms for unknown words based on morphological dictionaries of the Veps and Karelian languages

Mar 22, 2021
Andrew Krizhanovsky, Natalia Krizhanovsky, Irina Novak

This research devoted to the low-resource Veps and Karelian languages. Algorithms for assigning part of speech tags to words and grammatical properties to words are presented in the article. These algorithms use our morphological dictionaries, where the lemma, part of speech and a set of grammatical features (gramset) are known for each word form. The algorithms are based on the analogy hypothesis that words with the same suffixes are likely to have the same inflectional models, the same part of speech and gramset. The accuracy of these algorithms were evaluated and compared. 313 thousand Vepsian and 66 thousand Karelian words were used to verify the accuracy of these algorithms. The special functions were designed to assess the quality of results of the developed algorithms. 92.4% of Vepsian words and 86.8% of Karelian words were assigned a correct part of speech by the developed algorithm. 95.3% of Vepsian words and 90.7% of Karelian words were assigned a correct gramset by our algorithm. Morphological and semantic tagging of texts, which are closely related and inseparable in our corpus processes, are described in the paper.

* 17 pages, 4 tables, 7 figures, published in the conference proceeding 

  Access Paper or Ask Questions

MixSpeech: Data Augmentation for Low-resource Automatic Speech Recognition

Feb 25, 2021
Linghui Meng, Jin Xu, Xu Tan, Jindong Wang, Tao Qin, Bo Xu

In this paper, we propose MixSpeech, a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR). MixSpeech trains an ASR model by taking a weighted combination of two different speech features (e.g., mel-spectrograms or MFCC) as the input, and recognizing both text sequences, where the two recognition losses use the same combination weight. We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer, and conduct experiments on several low-resource datasets including TIMIT, WSJ, and HKUST. Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation, and outperforms a strong data augmentation method SpecAugment on these recognition tasks. Specifically, MixSpeech outperforms SpecAugment with a relative PER improvement of 10.6$\%$ on TIMIT dataset, and achieves a strong WER of 4.7$\%$ on WSJ dataset.

* To appear at ICASSP 2021 

  Access Paper or Ask Questions

ICodeNet -- A Hierarchical Neural Network Approach for Source Code Author Identification

Jan 30, 2021
Pranali Bora, Tulika Awalgaonkar, Himanshu Palve, Raviraj Joshi, Purvi Goel

With the open-source revolution, source codes are now more easily accessible than ever. This has, however, made it easier for malicious users and institutions to copy the code without giving regards to the license, or credit to the original author. Therefore, source code author identification is a critical task with paramount importance. In this paper, we propose ICodeNet - a hierarchical neural network that can be used for source code file-level tasks. The ICodeNet processes source code in image format and is employed for the task of per file author identification. The ICodeNet consists of an ImageNet trained VGG encoder followed by a shallow neural network. The shallow network is based either on CNN or LSTM. Different variations of models are evaluated on a source code author classification dataset. We have also compared our image-based hierarchical neural network model with simple image-based CNN architecture and text-based CNN and LSTM models to highlight its novelty and efficiency.

* Accepted at ICMLC 2021 

  Access Paper or Ask Questions

Explanation as a Defense of Recommendation

Jan 24, 2021
Aobo Yang, Nan Wang, Hongbo Deng, Hongning Wang

Textual explanations have proved to help improve user satisfaction on machine-made recommendations. However, current mainstream solutions loosely connect the learning of explanation with the learning of recommendation: for example, they are often separately modeled as rating prediction and content generation tasks. In this work, we propose to strengthen their connection by enforcing the idea of sentiment alignment between a recommendation and its corresponding explanation. At training time, the two learning tasks are joined by a latent sentiment vector, which is encoded by the recommendation module and used to make word choices for explanation generation. At both training and inference time, the explanation module is required to generate explanation text that matches sentiment predicted by the recommendation module. Extensive experiments demonstrate our solution outperforms a rich set of baselines in both recommendation and explanation tasks, especially on the improved quality of its generated explanations. More importantly, our user studies confirm our generated explanations help users better recognize the differences between recommended items and understand why an item is recommended.

* WSDM 2021 

  Access Paper or Ask Questions

A System for Efficiently Hunting for Cyber Threats in Computer Systems Using Threat Intelligence

Jan 17, 2021
Peng Gao, Fei Shao, Xiaoyuan Liu, Xusheng Xiao, Haoyuan Liu, Zheng Qin, Fengyuan Xu, Prateek Mittal, Sanjeev R. Kulkarni, Dawn Song

Log-based cyber threat hunting has emerged as an important solution to counter sophisticated cyber attacks. However, existing approaches require non-trivial efforts of manual query construction and have overlooked the rich external knowledge about threat behaviors provided by open-source Cyber Threat Intelligence (OSCTI). To bridge the gap, we build ThreatRaptor, a system that facilitates cyber threat hunting in computer systems using OSCTI. Built upon mature system auditing frameworks, ThreatRaptor provides (1) an unsupervised, light-weight, and accurate NLP pipeline that extracts structured threat behaviors from unstructured OSCTI text, (2) a concise and expressive domain-specific query language, TBQL, to hunt for malicious system activities, (3) a query synthesis mechanism that automatically synthesizes a TBQL query from the extracted threat behaviors, and (4) an efficient query execution engine to search the big system audit logging data.

* Accepted paper at ICDE 2021 demonstrations track. arXiv admin note: substantial text overlap with arXiv:2010.13637 

  Access Paper or Ask Questions

Polyjuice: Automated, General-purpose Counterfactual Generation

Jan 01, 2021
Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, Daniel S. Weld

Counterfactual examples have been shown to be useful for many applications, including calibrating, evaluating, and explaining model decision boundaries. However, previous methods for generating such counterfactual examples have been tightly tailored to a specific application, used a limited range of linguistic patterns, or are hard to scale. We propose to disentangle counterfactual generation from its use cases, i.e., gather general-purpose counterfactuals first, and then select them for specific applications. We frame the automated counterfactual generation as text generation, and finetune GPT-2 into a generator, Polyjuice, which produces fluent and diverse counterfactuals. Our method also allows control over where perturbations happen and what they do. We show Polyjuice supports multiple use cases: by generating diverse counterfactuals for humans to label, Polyjuice helps produce high-quality datasets for model training and evaluation, requiring 40% less human effort. When used to generate explanations, Polyjuice helps augment feature attribution methods to reveal models' erroneous behaviors.

  Access Paper or Ask Questions

Experiments on transfer learning architectures for biomedical relation extraction

Nov 24, 2020
Walid Hafiane, Joel Legrand, Yannick Toussaint, Adrien Coulet

Relation extraction (RE) consists in identifying and structuring automatically relations of interest from texts. Recently, BERT improved the top performances for several NLP tasks, including RE. However, the best way to use BERT, within a machine learning architecture, and within a transfer learning strategy is still an open question since it is highly dependent on each specific task and domain. Here, we explore various BERT-based architectures and transfer learning strategies (i.e., frozen or fine-tuned) for the task of biomedical RE on two corpora. Among tested architectures and strategies, our *BERT-segMCNN with finetuning reaches performances higher than the state-of-the-art on the two corpora (1.73 % and 32.77 % absolute improvement on ChemProt and PGxCorpus corpora respectively). More generally, our experiments illustrate the expected interest of fine-tuning with BERT, but also the unexplored advantage of using structural information (with sentence segmentation), in addition to the context classically leveraged by BERT.

* 12 pages, 2 figures,Extraction et Gestion des Connaissances (EGC) 

  Access Paper or Ask Questions