Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Text": models, code, and papers

Cross-lingual Transfer Learning for COVID-19 Outbreak Alignment

Jun 05, 2020
Sharon Levy, William Yang Wang

The spread of COVID-19 has become a significant and troubling aspect of society in 2020. With millions of cases reported across countries, new outbreaks have occurred and followed patterns of previously affected areas. Many disease detection models do not incorporate the wealth of social media data that can be utilized for modeling and predicting its spread. In this case, it is useful to ask, can we utilize this knowledge in one country to model the outbreak in another? To answer this, we propose the task of cross-lingual transfer learning for epidemiological alignment. Utilizing both macro and micro text features, we train on Italy's early COVID-19 outbreak through Twitter and transfer to several other countries. Our experiments show strong results with up to 0.85 Spearman correlation in cross-country predictions.

  Access Paper or Ask Questions

Self-Supervised Representation Learning on Document Images

May 27, 2020
Adrian Cosma, Mihai Ghidoveanu, Michael Panaitescu-Liess, Marius Popescu

This work analyses the impact of self-supervised pre-training on document images in the context of document image classification. While previous approaches explore the effect of self-supervision on natural images, we show that patch-based pre-training performs poorly on document images because of their different structural properties and poor intra-sample semantic information. We propose two context-aware alternatives to improve performance on the Tobacco-3482 image classification task. We also propose a novel method for self-supervision, which makes use of the inherent multi-modality of documents (image and text), which performs better than other popular self-supervised methods, including supervised ImageNet pre-training, on document image classification scenarios with a limited amount of data.

* 15 pages, 5 figures. Accepted at DAS 2020: IAPR International Workshop on Document Analysis Systems 

  Access Paper or Ask Questions

BERTweet: A pre-trained language model for English Tweets

May 20, 2020
Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen

We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet is trained using the RoBERTa pre-training procedure (Liu et al., 2019), with the same model configuration as BERT-base (Devlin et al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al., 2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks: Part-of-speech tagging, Named-entity recognition and text classification. We release BERTweet to facilitate future research and downstream applications on Tweet data. Our BERTweet is available at:

  Access Paper or Ask Questions

Real-time information retrieval from Identity cards

Mar 26, 2020
Niloofar Tavakolian, Azadeh Nazemi, Donal Fitzpatrick

Information is frequently retrieved from valid personal ID cards by the authorised organisation to address different purposes. The successful information retrieval (IR) depends on the accuracy and timing process. A process which necessitates a long time to respond is frustrating for both sides in the exchange of data. This paper aims to propose a series of state-of-the-art methods for the journey of an Identification card (ID) from the scanning or capture phase to the point before Optical character recognition (OCR). The key factors for this proposal are the accuracy and speed of the process during the journey. The experimental results of this research prove that utilising the methods based on deep learning, such as Efficient and Accurate Scene Text (EAST) detector and Deep Neural Network (DNN) for face detection, instead of traditional methods increase the efficiency considerably.

* 6pages,10 figures,conference 

  Access Paper or Ask Questions

Data Augmentation using Pre-trained Transformer Models

Mar 04, 2020
Varun Kumar, Ashutosh Choudhary, Eunah Cho

Language model based pre-trained models such as BERT have provided significant gains across different NLP tasks. In this paper, we study different types of pre-trained transformer based models such as auto-regressive models (GPT-2), auto-encoder models (BERT), and seq2seq models (BART) for conditional data augmentation. We show that prepending the class labels to text sequences provides a simple yet effective way to condition the pre-trained models for data augmentation. On three classification benchmarks, pre-trained Seq2Seq model outperforms other models. Further, we explore how different pre-trained model based data augmentation differs in-terms of data diversity, and how well such methods preserve the class-label information.

* 7 pages 

  Access Paper or Ask Questions

Parameter Sharing Decoder Pair for Auto Composing

Nov 09, 2019
Xu Zhao

Auto Composing is an active and appealing research area in the past few years, and lots of efforts have been put into inventing more robust models to solve this problem. With the fast evolution of deep learning techniques, some deep neural network-based language models are becoming dominant. Notably, the transformer structure has been proven to be very efficient and promising in modeling texts. However, the transformer-based language models usually contain huge number of parameters and the size of the model is usually too large to put in production for some storage limited applications. In this paper, we propose a parameter sharing decoder pair (PSDP), which reduces the number of parameters dramatically and at the same time maintains the capability of generating understandable and reasonable compositions. Works created by the proposed model are presented to demonstrate the effectiveness of the model.

* The author information of the old version of this paper is wrong. Removed it. Please use this version if need to cite 

  Access Paper or Ask Questions

Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates

Nov 08, 2019
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, Javier Jorge, Nahuel Roselló, Adrià Giménez, Albert Sanchis, Jorge Civera, Alfons Juan

Current research into spoken language translation (SLT) is often hampered by the lack of specific data resources for this task, as currently available SLT datasets are restricted to a limited set of language pairs. In this paper we present Europarl-ST, a novel multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for a total of 30 different translation directions. This corpus has been compiled using the debates held in the European Parliament in the period between 2008 and 2012. This paper describes the corpus creation process and presents a series of automatic speech recognition, machine translation and spoken language translation experiments that highlight the potential of this new resource. The corpus is released under a Creative Commons license and is freely accessible and downloadable.

* Submitted to ICASSP2020 

  Access Paper or Ask Questions

Invariance and identifiability issues for word embeddings

Nov 06, 2019
Rachel Carrington, Karthik Bharath, Simon Preston

Word embeddings are commonly obtained as optimizers of a criterion function $f$ of a text corpus, but assessed on word-task performance using a different evaluation function $g$ of the test data. We contend that a possible source of disparity in performance on tasks is the incompatibility between classes of transformations that leave $f$ and $g$ invariant. In particular, word embeddings defined by $f$ are not unique; they are defined only up to a class of transformations to which $f$ is invariant, and this class is larger than the class to which $g$ is invariant. One implication of this is that the apparent superiority of one word embedding over another, as measured by word task performance, may largely be a consequence of the arbitrary elements selected from the respective solution sets. We provide a formal treatment of the above identifiability issue, present some numerical examples, and discuss possible resolutions.

* NIPS 2019 

  Access Paper or Ask Questions

Big Bidirectional Insertion Representations for Documents

Oct 29, 2019
Lala Li, William Chan

The Insertion Transformer is well suited for long form text generation due to its parallel generation capabilities, requiring $O(\log_2 n)$ generation steps to generate $n$ tokens. However, modeling long sequences is difficult, as there is more ambiguity captured in the attention mechanism. This work proposes the Big Bidirectional Insertion Representations for Documents (Big BIRD), an insertion-based model for document-level translation tasks. We scale up the insertion-based models to long form documents. Our key contribution is introducing sentence alignment via sentence-positional embeddings between the source and target document. We show an improvement of +4.3 BLEU on the WMT'19 English$\rightarrow$German document-level translation task compared with the Insertion Transformer baseline.

  Access Paper or Ask Questions