Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Statistical Sign Language Machine Translation: from English written text to American Sign Language Gloss

Dec 01, 2011
Achraf Othman, Mohamed Jemni

This works aims to design a statistical machine translation from English text to American Sign Language (ASL). The system is based on Moses tool with some modifications and the results are synthesized through a 3D avatar for interpretation. First, we translate the input text to gloss, a written form of ASL. Second, we pass the output to the WebSign Plug-in to play the sign. Contributions of this work are the use of a new couple of language English/ASL and an improvement of statistical machine translation based on string matching thanks to Jaro-distance.

* IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 5, No 3, 2011, 65-73 
* 9 pages 

  Access Paper or Ask Questions

Use of 'off-the-shelf' information extraction algorithms in clinical informatics: a feasibility study of MetaMap annotation of Italian medical notes

Apr 02, 2021
Emma Chiaramello, Francesco Pinciroli, Alberico Bonalumi, Angelo Caroli, Gabriella Tognola

Information extraction from narrative clinical notes is useful for patient care, as well as for secondary use of medical data, for research or clinical purposes. Many studies focused on information extraction from English clinical texts, but less dealt with clinical notes in languages other than English. This study tested the feasibility of using 'off the shelf' information extraction algorithms to identify medical concepts from Italian clinical notes. We used MetaMap to map medical concepts to the Unified Medical Language System (UMLS). The study addressed two questions: (Q1) to understand if it would be possible to properly map medical terms found in clinical notes and related to the semantic group of 'Disorders' to the Italian UMLS resources; (Q2) to investigate if it would be feasible to use MetaMap as it is to extract these medical concepts from Italian clinical notes. Results in EXP1 showed that the Italian UMLS Metathesaurus sources covered 91% of the medical terms of the 'Disorders' semantic group, as found in the studied dataset. Even if MetaMap was built to analyze texts written in English, it worked properly also with texts written in Italian. MetaMap identified correctly about half of the concepts in the Italian clinical notes. Using MetaMap's annotation on Italian clinical notes instead of a simple text search improved our results of about 15 percentage points. MetaMap showed recall, precision and F-measure of 0.53, 0.98 and 0.69, respectively. Most of the failures were due to the impossibility for MetaMap to generate Italian meaningful variants. MetaMap's performance in annotating automatically translated English clinical notes was in line with findings in the literature, with similar recall (0.75), F-measure (0.83) and even higher precision (0.95).

* Journal of biomedical informatics, Volume 63, October 2016, Pages 22-32 
* This paper has been published in the Journal of biomedical informatics, Volume 63, October 2016, Pages 22-32 

  Access Paper or Ask Questions

ATCSpeechNet: A multilingual end-to-end speech recognition framework for air traffic control systems

Feb 17, 2021
Yi Lin, Bo Yang, Linchao Li, Dongyue Guo, Jianwei Zhang, Hu Chen, Yi Zhang

In this paper, a multilingual end-to-end framework, called as ATCSpeechNet, is proposed to tackle the issue of translating communication speech into human-readable text in air traffic control (ATC) systems. In the proposed framework, we focus on integrating the multilingual automatic speech recognition (ASR) into one model, in which an end-to-end paradigm is developed to convert speech waveform into text directly, without any feature engineering or lexicon. In order to make up for the deficiency of the handcrafted feature engineering caused by ATC challenges, a speech representation learning (SRL) network is proposed to capture robust and discriminative speech representations from the raw wave. The self-supervised training strategy is adopted to optimize the SRL network from unlabeled data, and further to predict the speech features, i.e., wave-to-feature. An end-to-end architecture is improved to complete the ASR task, in which a grapheme-based modeling unit is applied to address the multilingual ASR issue. Facing the problem of small transcribed samples in the ATC domain, an unsupervised approach with mask prediction is applied to pre-train the backbone network of the ASR model on unlabeled data by a feature-to-feature process. Finally, by integrating the SRL with ASR, an end-to-end multilingual ASR framework is formulated in a supervised manner, which is able to translate the raw wave into text in one model, i.e., wave-to-text. Experimental results on the ATCSpeech corpus demonstrate that the proposed approach achieves a high performance with a very small labeled corpus and less resource consumption, only 4.20% label error rate on the 58-hour transcribed corpus. Compared to the baseline model, the proposed approach obtains over 100% relative performance improvement which can be further enhanced with the increasing of the size of the transcribed samples.

* An improved work based on our previous Interspeech 2020 paper (

  Access Paper or Ask Questions

Heterogeneous Graph Neural Networks for Multi-label Text Classification

Mar 26, 2021
Irene Li, Tianxiao Li, Yixin Li, Ruihai Dong, Toyotaro Suzumura

Multi-label text classification (MLTC) is an attractive and challenging task in natural language processing (NLP). Compared with single-label text classification, MLTC has a wider range of applications in practice. In this paper, we propose a heterogeneous graph convolutional network model to solve the MLTC problem by modeling tokens and labels as nodes in a heterogeneous graph. In this way, we are able to take into account multiple relationships including token-level relationships. Besides, the model allows a good explainability as the token-label edges are exposed. We evaluate our method on three real-world datasets and the experimental results show that it achieves significant improvements and outperforms state-of-the-art comparison methods.

* 8 tables, 4 figures 

  Access Paper or Ask Questions

Generating Adversarial Examples in Chinese Texts Using Sentence-Pieces

Dec 29, 2020
Linyang Li, Yunfan Shao, Demin Song, Xipeng Qiu, Xuanjing Huang

Adversarial attacks in texts are mostly substitution-based methods that replace words or characters in the original texts to achieve success attacks. Recent methods use pre-trained language models as the substitutes generator. While in Chinese, such methods are not applicable since words in Chinese require segmentations first. In this paper, we propose a pre-train language model as the substitutes generator using sentence-pieces to craft adversarial examples in Chinese. The substitutions in the generated adversarial examples are not characters or words but \textit{'pieces'}, which are more natural to Chinese readers. Experiments results show that the generated adversarial samples can mislead strong target models and remain fluent and semantically preserved.

* pre-print 

  Access Paper or Ask Questions

Few-Shot and Zero-Shot Learning for Historical Text Normalization

Mar 12, 2019
Marcel Bollmann, Natalia Korchagina, Anders Søgaard

Historical text normalization often relies on small training datasets. Recent work has shown that multi-task learning can sometimes lead to significant improvements by exploiting synergies with related datasets, but there has been no systematic study of multi-task learning strategies across different datasets from different languages. This paper evaluates 63 multi-task learning strategies for sequence-to-sequence-based historical text normalization across ten datasets from eight languages, using autoencoding, grapheme-to-phoneme mapping, and lemmatization as auxiliary tasks. We observe consistent, significant improvements across languages when training data for the target task is limited, but minimal or no improvements when training data is abundant. Finally, we show that zero-shot learning outperforms the simple, but relatively strong, identity baseline.

  Access Paper or Ask Questions

Using Text Analytics for Health to Get Meaningful Insights from a Corpus of COVID Scientific Papers

Oct 28, 2021
Dmitry Soshnikov, Vickie Soshnikova

Since the beginning of COVID pandemic, there have been around 700000 scientific papers published on the subject. A human researcher cannot possibly get acquainted with such a huge text corpus -- and therefore developing AI-based tools to help navigating this corpus and deriving some useful insights from it is highly needed. In this paper, we will use Text Analytics for Health pre-trained service together with some cloud tools to extract some knowledge from scientific papers, gain insights, and build a tool to help researcher navigate the paper collection in a meaningful way.

* 13 pages, 11 figures 

  Access Paper or Ask Questions

Monitoring Energy Trends through Automatic Information Extraction

Jan 05, 2022
Dilek Küçük

Energy research is of crucial public importance but the use of computer science technologies like automatic text processing and data management for the energy domain is still rare. Employing these technologies in the energy domain will be a significant contribution to the interdisciplinary topic of ``energy informatics", just like the related progress within the interdisciplinary area of ``bioinformatics". In this paper, we present the architecture of a Web-based semantic system called EneMonIE (Energy Monitoring through Information Extraction) for monitoring up-to-date energy trends through the use of automatic, continuous, and guided information extraction from diverse types of media available on the Web. The types of media handled by the system will include online news articles, social media texts, online news videos, and open-access scholarly papers and technical reports as well as various numeric energy data made publicly available by energy organizations. The system will utilize and contribute to the energy-related ontologies and its ultimate form will comprise components for (i) text categorization, (ii) named entity recognition, (iii) temporal expression extraction, (iv) event extraction, (v) social network construction, (vi) sentiment analysis, (vii) information fusion and summarization, (viii) media interlinking, and (ix) Web-based information retrieval and visualization. Wits its diverse data sources, automatic text processing capabilities, and presentation facilities open for public use; EneMonIE will be an important source of distilled and concise information for decision-makers including energy generation, transmission, and distribution system operators, energy research centres, related investors and entrepreneurs as well as for academicians, students, other individuals interested in the pace of energy events and technologies.

* 5 pages 

  Access Paper or Ask Questions

Semi-Automating Knowledge Base Construction for Cancer Genetics

May 26, 2020
Somin Wadhwa, Kanhua Yin, Kevin S. Hughes, Byron C. Wallace

In this work, we consider the exponentially growing subarea of genetics in cancer. The need to synthesize and centralize this evidence for dissemination has motivated a team of physicians to manually construct and maintain a knowledge base that distills key results reported in the literature. This is a laborious process that entails reading through full-text articles to understand the study design, assess study quality, and extract the reported cancer risk estimates associated with particular hereditary cancer genes (i.e., penetrance). In this work, we propose models to automatically surface key elements from full-text cancer genetics articles, with the ultimate aim of expediting the manual workflow currently in place. We propose two challenging tasks that are critical for characterizing the findings reported cancer genetics studies: (i) Extracting snippets of text that describe \emph{ascertainment mechanisms}, which in turn inform whether the population studied may introduce bias owing to deviations from the target population; (ii) Extracting reported risk estimates (e.g., odds or hazard ratios) associated with specific germline mutations. The latter task may be viewed as a joint entity tagging and relation extraction problem. To train models for these tasks, we induce distant supervision over tokens and snippets in full-text articles using the manually constructed knowledge base. We propose and evaluate several model variants, including a transformer-based joint entity and relation extraction model to extract } pairs. We observe strong empirical performance, highlighting the practical potential for such models to aid KB construction in this space. We ablate components of our model, observing, e.g., that a joint model for fares substantially better than a pipelined approach.

* In proceedings of the Conference on Automated Knowledge Base Construction (AKBC), 2020 

  Access Paper or Ask Questions