Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Med7: a transferable clinical natural language processing model for electronic health records

Mar 03, 2020
Andrey Kormilitzin, Nemanja Vaci, Qiang Liu, Alejo Nevado-Holgado

The field of clinical natural language processing has been advanced significantly since the introduction of deep learning models. The self-supervised representation learning and the transfer learning paradigm became the methods of choice in many natural language processing application, in particular in the settings with the dearth of high quality manually annotated data. Electronic health record systems are ubiquitous and the majority of patients' data are now being collected electronically and in particular in the form of free text. Identification of medical concepts and information extraction is a challenging task, yet important ingredient for parsing unstructured data into structured and tabulated format for downstream analytical tasks. In this work we introduced a named-entity recognition model for clinical natural language processing. The model is trained to recognise seven categories: drug names, route, frequency, dosage, strength, form, duration. The model was first self-supervisedly pre-trained by predicting the next word, using a collection of 2 million free-text patients' records from MIMIC-III corpora and then fine-tuned on the named-entity recognition task. The model achieved a lenient (strict) micro-averaged F1 score of 0.957 (0.893) across all seven categories. Additionally, we evaluated the transferability of the developed model using the data from the Intensive Care Unit in the US to secondary care mental health records (CRIS) in the UK. A direct application of the trained NER model to CRIS data resulted in reduced performance of F1=0.762, however after fine-tuning on a small sample from CRIS, the model achieved a reasonable performance of F1=0.944. This demonstrated that despite a close similarity between the data sets and the NER tasks, it is essential to fine-tune on the target domain data in order to achieve more accurate results.

* 16 pages, 1 figure, 15 tables 

  Access Paper or Ask Questions

Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training

Aug 13, 2021
Conglong Li, Minjia Zhang, Yuxiong He

Recent works have demonstrated great success in training high-capacity autoregressive language models (GPT, GPT-2, GPT-3) on a huge amount of unlabeled text corpus for text generation. Despite showing great results, this generates two training efficiency challenges. First, training large corpora can be extremely timing consuming, and how to present training samples to the model to improve the token-wise convergence speed remains a challenging and open question. Second, many of these large models have to be trained with hundreds or even thousands of processors using data-parallelism with a very large batch size. Despite of its better compute efficiency, it has been observed that large-batch training often runs into training instability issue or converges to solutions with bad generalization performance. To overcome these two challenges, we present a study of a curriculum learning based approach, which helps improves the pre-training convergence speed of autoregressive models. More importantly, we find that curriculum learning, as a regularization method, exerts a gradient variance reduction effect and enables to train autoregressive models with much larger batch sizes and learning rates without training instability, further improving the training speed. Our evaluations demonstrate that curriculum learning enables training GPT-2 models (with up to 1.5B parameters) with 8x larger batch size and 4x larger learning rate, whereas the baseline approach struggles with training divergence. To achieve the same validation perplexity targets during pre-training, curriculum learning reduces the required number of tokens and wall clock time by up to 59% and 54%, respectively. To achieve the same or better zero-shot WikiText-103/LAMBADA evaluation results at the end of pre-training, curriculum learning reduces the required number of tokens and wall clock time by up to 13% and 61%, respectively.

  Access Paper or Ask Questions

DeepTriage: Exploring the Effectiveness of Deep Learning for Bug Triaging

Jan 04, 2018
Senthil Mani, Anush Sankaran, Rahul Aralikatte

For a given software bug report, identifying an appropriate developer who could potentially fix the bug is the primary task of a bug triaging process. A bug title (summary) and a detailed description is present in most of the bug tracking systems. Automatic bug triaging algorithm can be formulated as a classification problem, with the bug title and description as the input, mapping it to one of the available developers (classes). The major challenge is that the bug description usually contains a combination of free unstructured text, code snippets, and stack trace making the input data noisy. The existing bag-of-words (BOW) feature models do not consider the syntactical and sequential word information available in the unstructured text. We propose a novel bug report representation algorithm using an attention based deep bidirectional recurrent neural network (DBRNN-A) model that learns a syntactic and semantic feature from long word sequences in an unsupervised manner. Instead of BOW features, the DBRNN-A based bug representation is then used for training the classifier. Using an attention mechanism enables the model to learn the context representation over a long word sequence, as in a bug report. To provide a large amount of data to learn the feature learning model, the unfixed bug reports (~70% bugs in an open source bug tracking system) are leveraged, which were completely ignored in the previous studies. Another contribution is to make this research reproducible by making the source code available and creating a public benchmark dataset of bug reports from three open source bug tracking system: Google Chromium (383,104 bug reports), Mozilla Core (314,388 bug reports), and Mozilla Firefox (162,307 bug reports). Experimentally we compare our approach with BOW model and machine learning approaches and observe that DBRNN-A provides a higher rank-10 average accuracy.

  Access Paper or Ask Questions

LIDER: An Efficient High-dimensional Learned Index for Large-scale Dense Passage Retrieval

May 02, 2022
Yifan Wang, Haodi Ma, Daisy Zhe Wang

Text retrieval using dense embeddings generated from deep neural models is called "dense passage retrieval". Dense passage retrieval systems normally deploy a deep neural model followed by an approximate nearest neighbor (ANN) search module. The model generates text embeddings, which are then indexed by the ANN module. With the increasing data scale, the ANN module unavoidably becomes the bottleneck on efficiency, because of its linear or sublinear time complexity with data scale. An alternative is the learned index which has a theoretically constant time complexity. But most of the existing learned indexes are designed for low dimensional data. Thus they are not suitable for dense passage retrieval tasks with high-dimensional dense embeddings. We propose LIDER, an efficient high-dimensional Learned Index for large-scale DEnse passage Retrieval. LIDER has a clustering-based hierarchical architecture formed by two layers of core models. As the basic unit of LIDER to index and search data, each core model includes an adapted recursive model index (RMI) and a dimension reduction component which consists of an extended SortingKeys-LSH (SK-LSH) and a key re-scaling module. The dimension reduction component reduces the high-dimensional dense embeddings into one-dimensional keys and sorts them in a specific order, which are then used by the RMI. And the RMI consists of multiple simple linear regression models that make fast prediction in only O(1) time. We successfully optimize and combine SK-LSH and RMI together into the core model, and organize multiple core models into a two-layer structure based on a clustering-based partitioning of the whole data space. Experiments show that LIDER has a higher search speed with high retrieval quality comparing to the state-of-the-art ANN indexes commonly used in dense passage retrieval. Furthermore, LIDER has a better capability of speed-quality trade-off.

  Access Paper or Ask Questions

Neural Language Model for Automated Classification of Electronic Medical Records at the Emergency Room. The Significant Benefit of Unsupervised Generative Pre-training

Sep 13, 2019
Binbin Xu, Cédric Gil-Jardiné, Frantz Thiessard, Eric Tellier, Marta Avalos, Emmanuel Lagarde

In order to build a national injury surveillance system based on emergency room (ER) visits we are developing a coding system to classify their causes from clinical notes content. Supervised learning techniques have shown good results in this area but require to manually build a large learning annotated dataset. New levels of performance have been recently achieved in neural language models (NLM) with the use of models based on the Transformer architecture with an unsupervised generative pre-training step. Our hypothesis is that methods involving a generative self-supervised pre-training step significantly reduce the number of annotated samples required for supervised fine-tuning. In this case study, we assessed whether we could predict from free text clinical notes whether a visit was the consequence of a traumatic or a non-traumatic event. We compared two strategies: Strategy A consisted in training the GPT-2 NLM on the full 161 930 samples dataset with all labels (trauma/non-trauma). In Strategy B, we split the training dataset in two parts, a large one of 151 930 samples without any label for the self-supervised pre-training phase and a smaller one (up to 10 000 samples) for the supervised fine-tuning with labels. While strategy A needed to process 40 000 samples to achieve good performance (AUC>0.95), strategy B needed only 500 samples, a gain of 80. Moreover, an AUC of 0.93 was measured with only 30 labeled samples processed 3 times (3 epochs). To conclude, it is possible to adapt a multi-purpose NLM model such as the GPT-2 to create a powerful tool for classification of free-text notes with the need of a very small number of labeled samples. Only two modalities (trauma/non-trauma) were predicted for this case study but the same method can be applied for multimodal classification tasks such as diagnosis/disease terminologies.

* 8 pages, 5 figures 

  Access Paper or Ask Questions

ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

Jul 13, 2020
Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Debsindhu Bhowmik, Burkhard Rost

Motivation: NLP continues improving substantially through auto-regressive and auto-encoding Language Models. These LMs require expensive computing resources for self-supervised or un-supervised learning from huge unlabelled text corpora. The information learned is transferred through so-called embeddings to downstream prediction tasks. Bioinformatics provide vast gold-mines of structured and sequentially ordered text data leading to extraordinarily successful protein sequence LMs that promise new frontiers for generative and predictive tasks at low inference cost. Here, we addressed two questions: (1) To which extent can HPC up-scale protein LMs to larger databases and larger models? (2) To which extent can LMs extract features from single proteins to get closer to the performance of methods using evolutionary information? Methodology: Here, we trained two auto-regressive language models (Transformer-XL and XLNet) and two auto-encoder models (BERT and Albert) using 80 billion amino acids from 200 million protein sequences (UniRef100) and 393 billion amino acids from 2.1 billion protein sequences (BFD). The LMs were trained on the Summit supercomputer, using 5616 GPUs and one TPU Pod, using V3-512 cores. Results: The results of training these LMs on proteins was assessed by predicting secondary structure in three- and eight-states (Q3=75-83, Q8=63-72), localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabelled data (only protein sequences) captured important biophysical properties of the protein alphabet, namely the amino acids, and their well orchestrated interplay in governing the shape of proteins. In the analogy of NLP, this implied having learned some of the grammar of the language of life realized in protein sequences.

  Access Paper or Ask Questions

Jira: a Kurdish Speech Recognition System Designing and Building Speech Corpus and Pronunciation Lexicon

Feb 15, 2021
Hadi Veisi, Hawre Hosseini, Mohammad Mohammadamini, Wirya Fathy, Aso Mahmudi

In this paper, we introduce the first large vocabulary speech recognition system (LVSR) for the Central Kurdish language, named Jira. The Kurdish language is an Indo-European language spoken by more than 30 million people in several countries, but due to the lack of speech and text resources, there is no speech recognition system for this language. To fill this gap, we introduce the first speech corpus and pronunciation lexicon for the Kurdish language. Regarding speech corpus, we designed a sentence collection in which the ratio of di-phones in the collection resembles the real data of the Central Kurdish language. The designed sentences are uttered by 576 speakers in a controlled environment with noise-free microphones (called AsoSoft Speech-Office) and in Telegram social network environment using mobile phones (denoted as AsoSoft Speech-Crowdsourcing), resulted in 43.68 hours of speech. Besides, a test set including 11 different document topics is designed and recorded in two corresponding speech conditions (i.e., Office and Crowdsourcing). Furthermore, a 60K pronunciation lexicon is prepared in this research in which we faced several challenges and proposed solutions for them. The Kurdish language has several dialects and sub-dialects that results in many lexical variations. Our methods for script standardization of lexical variations and automatic pronunciation of the lexicon tokens are presented in detail. To setup the recognition engine, we used the Kaldi toolkit. A statistical tri-gram language model that is extracted from the AsoSoft text corpus is used in the system. Several standard recipes including HMM-based models (i.e., mono, tri1, tr2, tri2, tri3), SGMM, and DNN methods are used to generate the acoustic model. These methods are trained with AsoSoft Speech-Office and AsoSoft Speech-Crowdsourcing and a combination of them. The best performance achieved by the SGMM acoustic model which results in 13.9% of the average word error rate (on different document topics) and 4.9% for the general topic.

  Access Paper or Ask Questions

Escaping Local Optima using Crossover with Emergent or Reinforced Diversity

Aug 10, 2016
Duc-Cuong Dang, Tobias Friedrich, Timo Kötzing, Martin S. Krejca, Per Kristian Lehre, Pietro S. Oliveto, Dirk Sudholt, Andrew M. Sutton

Population diversity is essential for avoiding premature convergence in Genetic Algorithms (GAs) and for the effective use of crossover. Yet the dynamics of how diversity emerges in populations are not well understood. We use rigorous runtime analysis to gain insight into population dynamics and GA performance for the ($\mu$+1) GA and the $\text{Jump}_k$ test function. We show that the interplay of crossover and mutation may serve as a catalyst leading to a sudden burst of diversity. This leads to improvements of the expected optimisation time of order $\Omega(n/\log n)$ compared to mutation-only algorithms like (1+1) EA. Moreover, increasing the mutation rate by an arbitrarily small constant factor can facilitate the generation of diversity, leading to speedups of order $\Omega(n)$. We also compare seven commonly used diversity mechanisms and evaluate their impact on runtime bounds for the ($\mu$+1) GA. All previous results in this context only hold for unrealistically low crossover probability $p_c=O(k/n)$, while we give analyses for the setting of constant $p_c < 1$ in all but one case. For the typical case of constant $k > 2$ and constant $p_c$, we can compare the resulting expected runtimes for different diversity mechanisms assuming an optimal choice of $\mu$: $O(n^{k-1})$ for duplicate elimination/minim., $O(n^2\log n)$ for maximising the convex hull, $O(n\log n)$ for deterministic crowding (assuming $p_c = k/n$), $O(n\log n)$ for maximising Hamming distance, $O(n\log n)$ for fitness sharing, $O(n\log n)$ for single-receiver island model. This proves a sizeable advantage of all variants of the ($\mu$+1) GA compared to (1+1) EA, which requires time $\Theta(n^k)$. Experiments complement our theoretical findings and further highlight the benefits of crossover and diversity on $\text{Jump}_k$.

  Access Paper or Ask Questions

Dictionary LASSO: Guaranteed Sparse Recovery under Linear Transformation

Jul 20, 2013
Ji Liu, Lei Yuan, Jieping Ye

We consider the following signal recovery problem: given a measurement matrix $\Phi\in \mathbb{R}^{n\times p}$ and a noisy observation vector $c\in \mathbb{R}^{n}$ constructed from $c = \Phi\theta^* + \epsilon$ where $\epsilon\in \mathbb{R}^{n}$ is the noise vector whose entries follow i.i.d. centered sub-Gaussian distribution, how to recover the signal $\theta^*$ if $D\theta^*$ is sparse {\rca under a linear transformation} $D\in\mathbb{R}^{m\times p}$? One natural method using convex optimization is to solve the following problem: $$\min_{\theta} {1\over 2}\|\Phi\theta - c\|^2 + \lambda\|D\theta\|_1.$$ This paper provides an upper bound of the estimate error and shows the consistency property of this method by assuming that the design matrix $\Phi$ is a Gaussian random matrix. Specifically, we show 1) in the noiseless case, if the condition number of $D$ is bounded and the measurement number $n\geq \Omega(s\log(p))$ where $s$ is the sparsity number, then the true solution can be recovered with high probability; and 2) in the noisy case, if the condition number of $D$ is bounded and the measurement increases faster than $s\log(p)$, that is, $s\log(p)=o(n)$, the estimate error converges to zero with probability 1 when $p$ and $s$ go to infinity. Our results are consistent with those for the special case $D=\bold{I}_{p\times p}$ (equivalently LASSO) and improve the existing analysis. The condition number of $D$ plays a critical role in our analysis. We consider the condition numbers in two cases including the fused LASSO and the random graph: the condition number in the fused LASSO case is bounded by a constant, while the condition number in the random graph case is bounded with high probability if $m\over p$ (i.e., $#text{edge}\over #text{vertex}$) is larger than a certain constant. Numerical simulations are consistent with our theoretical results.

* 26 pages, 3 figures, ICML2013 

  Access Paper or Ask Questions

Design and implementation of audio communication system for social-humanoid robot Lumen as an exhibition guide in Electrical Engineering Days 2015

Jul 16, 2016
Putri Nhirun Rikasofiadewi, Ary Setijadi Prihatmanto

Social Robot Lumen is a humanoid robot created to act like human and be human friend. In this study, Lumen scenario is limited on Lumen as an exhibition guide in Electrical Engineering Days 2015, a seminar and exhibition of electrical engineering undergraduate and graduate student of Bandung Institute of Technology. To be an exhibition guide, Lumen is equipped by Nao robot, a server, and processing applications. Audio communication system is one of the processing applications. The purpose of the system is to create verbal communication that allow Lumen to receive human voice and respond naturally to it. To be able to communicate like a human, audio communication system is built with speech recognition module to transform speech data into text, speech synthesizer module to transform text data into speech, and gender identification module to distinguish adult female and male voice. Speech recognition module is implemented using Google Speech Recognition API, speech synthesizer module is implemented using Acapela engine, and gender identification module implemented by utilizing speech signal feature that is extracted using Fast Fourier Transform algorithm. Hardware used for implementation are Nao robot, computer, and wireless modem. ----- Lumen Robot Sosial Robot merupakan robot humanoid yang diciptakan agar dapat bersikap seperti manusia dan menjadi teman bagi manusia. Sistem komunikasi audio merupakan salah satu aplikasi pengolah yang bertujuan agar Lumen dapat menerima suara manusia dan meresponnya dengan natural, yaitu seperti cara manusia merespon manusia lainnya. Untuk dapat berkomunikasi seperti manusia, sistem komunikasi audio dilengkapi dengan tiga buah modul: speech recognition untuk mengubah data suara menjadi teks, speech synthesizer untuk mengubah data teks menjadi suara, dan gender identification untuk membedakan suara wanita dan pria.

* Keywords: robot, audio, communication system, speech recognition, speech synthesizer, gender identification, Fast Fourier Transform. Kata kunci: robot, audio, system komunikasi, speech recognition, speech synthesizer, gender identification, Fast Fourier Transform 

  Access Paper or Ask Questions