Alert button
Picture for Stefan Langer

Stefan Langer

Alert button

Domain Adaptive Pretraining for Multilingual Acronym Extraction

Jun 30, 2022
Usama Yaseen, Stefan Langer

Figure 1 for Domain Adaptive Pretraining for Multilingual Acronym Extraction
Figure 2 for Domain Adaptive Pretraining for Multilingual Acronym Extraction
Figure 3 for Domain Adaptive Pretraining for Multilingual Acronym Extraction
Figure 4 for Domain Adaptive Pretraining for Multilingual Acronym Extraction

This paper presents our findings from participating in the multilingual acronym extraction shared task SDU@AAAI-22. The task consists of acronym extraction from documents in 6 languages within scientific and legal domains. To address multilingual acronym extraction we employed BiLSTM-CRF with multilingual XLM-RoBERTa embeddings. We pretrained the XLM-RoBERTa model on the shared task corpus to further adapt XLM-RoBERTa embeddings to the shared task domain(s). Our system (team: SMR-NLP) achieved competitive performance for acronym extraction across all the languages.

* SDU@AAAI-22 
Viaarxiv icon

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

Dec 06, 2021
Kaustubh D. Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahendiran, Simon Mille, Ashish Srivastava, Samson Tan, Tongshuang Wu, Jascha Sohl-Dickstein, Jinho D. Choi, Eduard Hovy, Ondrej Dusek, Sebastian Ruder, Sajant Anand, Nagender Aneja, Rabin Banjade, Lisa Barthe, Hanna Behnke, Ian Berlot-Attwell, Connor Boyle, Caroline Brun, Marco Antonio Sobrevilla Cabezudo, Samuel Cahyawijaya, Emile Chapuis, Wanxiang Che, Mukund Choudhary, Christian Clauss, Pierre Colombo, Filip Cornell, Gautier Dagan, Mayukh Das, Tanay Dixit, Thomas Dopierre, Paul-Alexis Dray, Suchitra Dubey, Tatiana Ekeinhor, Marco Di Giovanni, Rishabh Gupta, Rishabh Gupta, Louanes Hamla, Sang Han, Fabrice Harel-Canada, Antoine Honore, Ishan Jindal, Przemyslaw K. Joniak, Denis Kleyko, Venelin Kovatchev, Kalpesh Krishna, Ashutosh Kumar, Stefan Langer, Seungjae Ryan Lee, Corey James Levinson, Hualou Liang, Kaizhao Liang, Zhexiong Liu, Andrey Lukyanenko, Vukosi Marivate, Gerard de Melo, Simon Meoni, Maxime Meyer, Afnan Mir, Nafise Sadat Moosavi, Niklas Muennighoff, Timothy Sum Hon Mun, Kenton Murray, Marcin Namysl, Maria Obedkova, Priti Oli, Nivranshu Pasricha, Jan Pfister, Richard Plant, Vinay Prabhu, Vasile Pais, Libo Qin, Shahab Raji, Pawan Kumar Rajpoot, Vikas Raunak, Roy Rinberg, Nicolas Roberts, Juan Diego Rodriguez, Claude Roux, Vasconcellos P. H. S., Ananya B. Sai, Robin M. Schmidt, Thomas Scialom, Tshephisho Sefara, Saqib N. Shamsi, Xudong Shen, Haoyue Shi, Yiwen Shi, Anna Shvets, Nick Siegel, Damien Sileo, Jamie Simon, Chandan Singh, Roman Sitelew, Priyank Soni, Taylor Sorensen, William Soto, Aman Srivastava, KV Aditya Srivatsa, Tony Sun, Mukund Varma T, A Tabassum, Fiona Anting Tan, Ryan Teehan, Mo Tiwari, Marie Tolkiehn, Athena Wang, Zijian Wang, Gloria Wang, Zijie J. Wang, Fuxuan Wei, Bryan Wilie, Genta Indra Winata, Xinyi Wu, Witold Wydmański, Tianbao Xie, Usama Yaseen, M. Yee, Jing Zhang, Yue Zhang

Figure 1 for NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
Figure 2 for NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
Figure 3 for NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
Figure 4 for NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of natural language tasks. We demonstrate the efficacy of NL-Augmenter by using several of its transformations to analyze the robustness of popular natural language models. The infrastructure, datacards and robustness analysis results are available publicly on the NL-Augmenter repository (\url{https://github.com/GEM-benchmark/NL-Augmenter}).

* 39 pages, repository at https://github.com/GEM-benchmark/NL-Augmenter 
Viaarxiv icon

Data Augmentation for Low-Resource Named Entity Recognition Using Backtranslation

Aug 26, 2021
Usama Yaseen, Stefan Langer

Figure 1 for Data Augmentation for Low-Resource Named Entity Recognition Using Backtranslation
Figure 2 for Data Augmentation for Low-Resource Named Entity Recognition Using Backtranslation
Figure 3 for Data Augmentation for Low-Resource Named Entity Recognition Using Backtranslation
Figure 4 for Data Augmentation for Low-Resource Named Entity Recognition Using Backtranslation

The state of art natural language processing systems relies on sizable training datasets to achieve high performance. Lack of such datasets in the specialized low resource domains lead to suboptimal performance. In this work, we adapt backtranslation to generate high quality and linguistically diverse synthetic data for low-resource named entity recognition. We perform experiments on two datasets from the materials science (MaSciP) and biomedical domains (S800). The empirical results demonstrate the effectiveness of our proposed augmentation strategy, particularly in the low-resource scenario.

Viaarxiv icon

Neural Text Classification and Stacked Heterogeneous Embeddings for Named Entity Recognition in SMM4H 2021

Jun 11, 2021
Usama Yaseen, Stefan Langer

Figure 1 for Neural Text Classification and Stacked Heterogeneous Embeddings for Named Entity Recognition in SMM4H 2021
Figure 2 for Neural Text Classification and Stacked Heterogeneous Embeddings for Named Entity Recognition in SMM4H 2021
Figure 3 for Neural Text Classification and Stacked Heterogeneous Embeddings for Named Entity Recognition in SMM4H 2021
Figure 4 for Neural Text Classification and Stacked Heterogeneous Embeddings for Named Entity Recognition in SMM4H 2021

This paper presents our findings from participating in the SMM4H Shared Task 2021. We addressed Named Entity Recognition (NER) and Text Classification. To address NER we explored BiLSTM-CRF with Stacked Heterogeneous Embeddings and linguistic features. We investigated various machine learning algorithms (logistic regression, Support Vector Machine (SVM) and Neural Networks) to address text classification. Our proposed approaches can be generalized to different languages and we have shown its effectiveness for English and Spanish. Our text classification submissions (team:MIC-NLP) have achieved competitive performance with F1-score of $0.46$ and $0.90$ on ADE Classification (Task 1a) and Profession Classification (Task 7a) respectively. In the case of NER, our submissions scored F1-score of $0.50$ and $0.82$ on ADE Span Detection (Task 1b) and Profession Span detection (Task 7b) respectively.

* NAACL 2021 
Viaarxiv icon

Neural Text Classification and StackedHeterogeneous Embeddings for Named Entity Recognition in SMM4H 2021

Jun 10, 2021
Usama Yaseen, Stefan Langer

Figure 1 for Neural Text Classification and StackedHeterogeneous Embeddings for Named Entity Recognition in SMM4H 2021
Figure 2 for Neural Text Classification and StackedHeterogeneous Embeddings for Named Entity Recognition in SMM4H 2021
Figure 3 for Neural Text Classification and StackedHeterogeneous Embeddings for Named Entity Recognition in SMM4H 2021
Figure 4 for Neural Text Classification and StackedHeterogeneous Embeddings for Named Entity Recognition in SMM4H 2021

This paper presents our findings from participating in the SMM4H Shared Task 2021. We addressed Named Entity Recognition (NER) and Text Classification. To address NER we explored BiLSTM-CRF with Stacked Heterogeneous Embeddings and linguistic features. We investigated various machine learning algorithms (logistic regression, Support Vector Machine (SVM) and Neural Networks) to address text classification. Our proposed approaches can be generalized to different languages and we have shown its effectiveness for English and Spanish. Our text classification submissions (team:MIC-NLP) have achieved competitive performance with F1-score of $0.46$ and $0.90$ on ADE Classification (Task 1a) and Profession Classification (Task 7a) respectively. In the case of NER, our submissions scored F1-score of $0.50$ and $0.82$ on ADE Span Detection (Task 1b) and Profession Span detection (Task 7b) respectively.

* NAACL 2021 
Viaarxiv icon

Difficulty Classification of Mountainbike Downhill Trails utilizing Deep Neural Networks

Aug 05, 2019
Stefan Langer, Robert Müller, Kyrill Schmid, Claudia Linnhoff-Popien

Figure 1 for Difficulty Classification of Mountainbike Downhill Trails utilizing Deep Neural Networks
Figure 2 for Difficulty Classification of Mountainbike Downhill Trails utilizing Deep Neural Networks
Figure 3 for Difficulty Classification of Mountainbike Downhill Trails utilizing Deep Neural Networks
Figure 4 for Difficulty Classification of Mountainbike Downhill Trails utilizing Deep Neural Networks

The difficulty of mountainbike downhill trails is a subjective perception. However, sports-associations and mountainbike park operators attempt to group trails into different levels of difficulty with scales like the Singletrail-Skala (S0-S5) or colored scales (blue, red, black, ...) as proposed by The International Mountain Bicycling Association. Inconsistencies in difficulty grading occur due to the various scales, different people grading the trails, differences in topography, and more. We propose an end-to-end deep learning approach to classify trails into three difficulties easy, medium, and hard by using sensor data. With mbientlab Meta Motion r0.2 sensor units, we record accelerometer- and gyroscope data of one rider on multiple trail segments. A 2D convolutional neural network is trained with a stacked and concatenated representation of the aforementioned data as its input. We run experiments with five different sample- and five different kernel sizes and achieve a maximum Sparse Categorical Accuracy of 0.9097. To the best of our knowledge, this is the first work targeting computational difficulty classification of mountainbike downhill trails.

* 11 pages, 5 figures 
Viaarxiv icon

Soccer Team Vectors

Jul 30, 2019
Robert Müller, Stefan Langer, Fabian Ritz, Christoph Roch, Steffen Illium, Claudia Linnhoff-Popien

Figure 1 for Soccer Team Vectors
Figure 2 for Soccer Team Vectors
Figure 3 for Soccer Team Vectors
Figure 4 for Soccer Team Vectors

In this work we present STEVE - Soccer TEam VEctors, a principled approach for learning real valued vectors for soccer teams where similar teams are close to each other in the resulting vector space. STEVE only relies on freely available information about the matches teams played in the past. These vectors can serve as input to various machine learning tasks. Evaluating on the task of team market value estimation, STEVE outperforms all its competitors. Moreover, we use STEVE for similarity search and to rank soccer teams.

* 11 pages, 1 figure; This paper was accepted as a workshop paper at the 6th Workshop on Machine Learning and Data Mining for Sports Analytics at ECML/PKDD 2019, W\"urzburg, Germany; DOI will be added after publication 
Viaarxiv icon

Deep Neural Baselines for Computational Paralinguistics

Jul 05, 2019
Daniel Elsner, Stefan Langer, Fabian Ritz, Robert Müller, Steffen Illium

Figure 1 for Deep Neural Baselines for Computational Paralinguistics
Figure 2 for Deep Neural Baselines for Computational Paralinguistics
Figure 3 for Deep Neural Baselines for Computational Paralinguistics
Figure 4 for Deep Neural Baselines for Computational Paralinguistics

Detecting sleepiness from spoken language is an ambitious task, which is addressed by the Interspeech 2019 Computational Paralinguistics Challenge (ComParE). We propose an end-to-end deep learning approach to detect and classify patterns reflecting sleepiness in the human voice. Our approach is based solely on a moderately complex deep neural network architecture. It may be applied directly on the audio data without requiring any specific feature engineering, thus remaining transferable to other audio classification tasks. Nevertheless, our approach performs similar to state-of-the-art machine learning models.

* 5 pages, 3 figures; This paper was accepted at INTERSPEECH 2019, Graz, 15-19th September 2019. DOI will be added after publishment of the accepted paper 
Viaarxiv icon