Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Towards Arabic Sentence Simplification via Classification and Generative Approaches

Apr 20, 2022
Nouran Khallaf, Serge Sharoff

This paper presents an attempt to build a Modern Standard Arabic (MSA) sentence-level simplification system. We experimented with sentence simplification using two approaches: (i) a classification approach leading to lexical simplification pipelines which use Arabic-BERT, a pre-trained contextualised model, as well as a model of fastText word embeddings; and (ii) a generative approach, a Seq2Seq technique by applying a multilingual Text-to-Text Transfer Transformer mT5. We developed our training corpus by aligning the original and simplified sentences from the internationally acclaimed Arabic novel "Saaq al-Bambuu". We evaluate effectiveness of these methods by comparing the generated simple sentences to the target simple sentences using the BERTScore evaluation metric. The simple sentences produced by the mT5 model achieve P 0.72, R 0.68 and F-1 0.70 via BERTScore, while, combining Arabic-BERT and fastText achieves P 0.97, R 0.97 and F-1 0.97. In addition, we report a manual error analysis for these experiments. \url{}

  Access Paper or Ask Questions

A Part-of-Speech Tagger for Yiddish: First Steps in Tagging the Yiddish Book Center Corpus

Apr 03, 2022
Seth Kulick, Neville Ryant, Beatrice Santorini, Joel Wallenberg

We describe the construction and evaluation of a part-of-speech tagger for Yiddish (the first one, to the best of our knowledge). This is the first step in a larger project of automatically assigning part-of-speech tags and syntactic structure to Yiddish text for purposes of linguistic research. We combine two resources for the current work - an 80K word subset of the Penn Parsed Corpus of Historical Yiddish (PPCHY) (Santorini, 2021) and 650 million words of OCR'd Yiddish text from the Yiddish Book Center (YBC). We compute word embeddings on the YBC corpus, and these embeddings are used with a tagger model trained and evaluated on the PPCHY. Yiddish orthography in the YBC corpus has many spelling inconsistencies, and we present some evidence that even simple non-contextualized embeddings are able to capture the relationships among spelling variants without the need to first "standardize" the corpus. We evaluate the tagger performance on a 10-fold cross-validation split, with and without the embeddings, showing that the embeddings improve tagger performance. However, a great deal of work remains to be done, and we conclude by discussing some next steps, including the need for additional annotated training and test data.

  Access Paper or Ask Questions

Applying Syntax$\unicode{x2013}$Prosody Mapping Hypothesis and Prosodic Well-Formedness Constraints to Neural Sequence-to-Sequence Speech Synthesis

Mar 29, 2022
Kei Furukawa, Takeshi Kishiyama, Satoshi Nakamura

End-to-end text-to-speech synthesis (TTS), which generates speech sounds directly from strings of texts or phonemes, has improved the quality of speech synthesis over the conventional TTS. However, most previous studies have been evaluated based on subjective naturalness and have not objectively examined whether they can reproduce pitch patterns of phonological phenomena such as downstep, rhythmic boost, and initial lowering that reflect syntactic structures in Japanese. These phenomena can be linguistically explained by phonological constraints and the syntax$\unicode{x2013}$prosody mapping hypothesis (SPMH), which assumes projections from syntactic structures to phonological hierarchy. Although some experiments in psycholinguistics have verified the validity of the SPMH, it is crucial to investigate whether it can be implemented in TTS. To synthesize linguistic phenomena involving syntactic or phonological constraints, we propose a model using phonological symbols based on the SPMH and prosodic well-formedness constraints. Experimental results showed that the proposed method synthesized similar pitch patterns to those reported in linguistics experiments for the phenomena of initial lowering and rhythmic boost. The proposed model efficiently synthesizes phonological phenomena in the test data that were not explicitly included in the training data.

* Submitted to INTERSPEECH 2022 

  Access Paper or Ask Questions

Ontology-enhanced Prompt-tuning for Few-shot Learning

Jan 27, 2022
Hongbin Ye, Ningyu Zhang, Shumin Deng, Xiang Chen, Hui Chen, Feiyu Xiong, Xi Chen, Huajun Chen

Few-shot Learning (FSL) is aimed to make predictions based on a limited number of samples. Structured data such as knowledge graphs and ontology libraries has been leveraged to benefit the few-shot setting in various tasks. However, the priors adopted by the existing methods suffer from challenging knowledge missing, knowledge noise, and knowledge heterogeneity, which hinder the performance for few-shot learning. In this study, we explore knowledge injection for FSL with pre-trained language models and propose ontology-enhanced prompt-tuning (OntoPrompt). Specifically, we develop the ontology transformation based on the external knowledge graph to address the knowledge missing issue, which fulfills and converts structure knowledge to text. We further introduce span-sensitive knowledge injection via a visible matrix to select informative knowledge to handle the knowledge noise issue. To bridge the gap between knowledge and text, we propose a collective training algorithm to optimize representations jointly. We evaluate our proposed OntoPrompt in three tasks, including relation extraction, event extraction, and knowledge graph completion, with eight datasets. Experimental results demonstrate that our approach can obtain better few-shot performance than baselines.

* Accepted by WWW2022 

  Access Paper or Ask Questions

Factorized Neural Transducer for Efficient Language Model Adaptation

Oct 18, 2021
Xie Chen, Zhong Meng, Sarangarajan Parthasarathy, Jinyu Li

In recent years, end-to-end (E2E) based automatic speech recognition (ASR) systems have achieved great success due to their simplicity and promising performance. Neural Transducer based models are increasingly popular in streaming E2E based ASR systems and have been reported to outperform the traditional hybrid system in some scenarios. However, the joint optimization of acoustic model, lexicon and language model in neural Transducer also brings about challenges to utilize pure text for language model adaptation. This drawback might prevent their potential applications in practice. In order to address this issue, in this paper, we propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction, and adopting a standalone language model for the vocabulary prediction. It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition, which allows various language model adaptation techniques to be applied. We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation, at the cost of a minor degradation in WER on a general test set.

  Access Paper or Ask Questions

WOVe: Incorporating Word Order in GloVe Word Embeddings

May 18, 2021
Mohammed Ibrahim, Susan Gauch, Tyler Gerth, Brandon Cox

Word vector representations open up new opportunities to extract useful information from unstructured text. Defining a word as a vector made it easy for the machine learning algorithms to understand a text and extract information from. Word vector representations have been used in many applications such word synonyms, word analogy, syntactic parsing, and many others. GloVe, based on word contexts and matrix vectorization, is an ef-fective vector-learning algorithm. It improves on previous vector-learning algorithms. However, the GloVe model fails to explicitly consider the order in which words appear within their contexts. In this paper, multiple methods of incorporating word order in GloVe word embeddings are proposed. Experimental results show that our Word Order Vector (WOVe) word embeddings approach outperforms unmodified GloVe on the natural lan-guage tasks of analogy completion and word similarity. WOVe with direct concatenation slightly outperformed GloVe on the word similarity task, increasing average rank by 2%. However, it greatly improved on the GloVe baseline on a word analogy task, achieving an average 36.34% improvement in accuracy.

  Access Paper or Ask Questions

Towards Semi-Supervised Semantics Understanding from Speech

Nov 11, 2020
Cheng-I Lai, Jin Cao, Sravan Bodapati, Shang-Wen Li

Much recent work on Spoken Language Understanding (SLU) falls short in at least one of three ways: models were trained on oracle text input and neglected the Automatics Speech Recognition (ASR) outputs, models were trained to predict only intents without the slot values, or models were trained on a large amount of in-house data. We proposed a clean and general framework to learn semantics directly from speech with semi-supervision from transcribed speech to address these. Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU corpus. In parallel, we identified two inadequate settings under which SLU models have been tested: noise-robustness and E2E semantics evaluation. We tested the proposed framework under realistic environmental noises and with a new metric, the slots edit F1 score, on two public SLU corpora. Experiments show that our SLU framework with speech as input can perform on par with those with oracle text as input in semantics understanding, while environmental noises are present, and a limited amount of labeled semantics data is available.

* arXiv admin note: text overlap with arXiv:2010.13826 

  Access Paper or Ask Questions

Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages

Nov 04, 2020
Kushal Jain, Adwait Deshpande, Kumar Shridhar, Felix Laumann, Ayushman Dash

Language models based on the Transformer architecture have achieved state-of-the-art performance on a wide range of NLP tasks such as text classification, question-answering, and token classification. However, this performance is usually tested and reported on high-resource languages, like English, French, Spanish, and German. Indian languages, on the other hand, are underrepresented in such benchmarks. Despite some Indian languages being included in training multilingual Transformer models, they have not been the primary focus of such work. In order to evaluate the performance on Indian languages specifically, we analyze these language models through extensive experiments on multiple downstream tasks in Hindi, Bengali, and Telugu language. Here, we compare the efficacy of fine-tuning model parameters of pre-trained models against that of training a language model from scratch. Moreover, we empirically argue against the strict dependency between the dataset size and model performance, but rather encourage task-specific model and method selection. We achieve state-of-the-art performance on Hindi and Bengali languages for text classification task. Finally, we present effective strategies for handling the modeling of Indian languages and we release our model checkpoints for the community :

* Accepted at ML-RSA @ NeurIPS 2020 

  Access Paper or Ask Questions

MINVO Basis: Finding Simplexes with Minimum Volume Enclosing Polynomial Curves

Oct 21, 2020
Jesus Tordesillas, Jonathan P. How

Outer polyhedral representations of a given polynomial curve are extensively exploited in computer graphics rendering, computer gaming, path planning for robots, and finite element simulations. B\'ezier curves (which use the Bernstein basis) or B-Splines are a very common choice for these polyhedral representations because their non-negativity and partition-of-unity properties guarantee that each interval of the curve is contained inside the convex hull of its control points. However, the convex hull provided by these bases is not the one with smallest volume, producing therefore undesirable levels of conservatism in all of the applications mentioned above. This paper presents the MINVO basis, a polynomial basis that generates the smallest $n$-simplex that encloses any given $n^\text{th}$-order polynomial curve. The results obtained for $n=3$ show that, for any given $3^{\text{rd}}$-order polynomial curve, the MINVO basis is able to obtain an enclosing simplex whose volume is $2.36$ and $254.9$ times smaller than the ones obtained by the Bernstein and B-Spline bases, respectively. When $n=7$, these ratios increase to $902.7$ and $2.997\cdot10^{21}$, respectively.

* 17 pages, 16 figures 

  Access Paper or Ask Questions