Alert button
Picture for Thamar Solorio

Thamar Solorio

Alert button

OATS: Opinion Aspect Target Sentiment Quadruple Extraction Dataset for Aspect-Based Sentiment Analysis

Sep 23, 2023
Siva Uday Sampreeth Chebolu, Franck Dernoncourt, Nedim Lipka, Thamar Solorio

Figure 1 for OATS: Opinion Aspect Target Sentiment Quadruple Extraction Dataset for Aspect-Based Sentiment Analysis
Figure 2 for OATS: Opinion Aspect Target Sentiment Quadruple Extraction Dataset for Aspect-Based Sentiment Analysis
Figure 3 for OATS: Opinion Aspect Target Sentiment Quadruple Extraction Dataset for Aspect-Based Sentiment Analysis
Figure 4 for OATS: Opinion Aspect Target Sentiment Quadruple Extraction Dataset for Aspect-Based Sentiment Analysis

Aspect-based sentiment Analysis (ABSA) delves into understanding sentiments specific to distinct elements within textual content. It aims to analyze user-generated reviews to determine a) the target entity being reviewed, b) the high-level aspect to which it belongs, c) the sentiment words used to express the opinion, and d) the sentiment expressed toward the targets and the aspects. While various benchmark datasets have fostered advancements in ABSA, they often come with domain limitations and data granularity challenges. Addressing these, we introduce the OATS dataset, which encompasses three fresh domains and consists of 20,000 sentence-level quadruples and 13,000 review-level tuples. Our initiative seeks to bridge specific observed gaps: the recurrent focus on familiar domains like restaurants and laptops, limited data for intricate quadruple extraction tasks, and an occasional oversight of the synergy between sentence and review-level sentiments. Moreover, to elucidate OATS's potential and shed light on various ABSA subtasks that OATS can solve, we conducted in-domain and cross-domain experiments, establishing initial baselines. We hope the OATS dataset augments current resources, paving the way for an encompassing exploration of ABSA.

* Initial submission 
Viaarxiv icon

Positive and Risky Message Assessment for Music Products

Sep 18, 2023
Yigeng Zhang, Mahsa Shafaei, Fabio Gonzalez, Thamar Solorio

In this work, we propose a novel research problem: assessing positive and risky messages from music products. We first establish a benchmark for multi-angle multi-level music content assessment and then present an effective multi-task prediction model with ordinality-enforcement to solve this problem. Our result shows the proposed method not only significantly outperforms strong task-specific counterparts but can concurrently evaluate multiple aspects.

Viaarxiv icon

Context-aware Adversarial Attack on Named Entity Recognition

Sep 16, 2023
Shuguang Chen, Leonardo Neves, Thamar Solorio

Figure 1 for Context-aware Adversarial Attack on Named Entity Recognition
Figure 2 for Context-aware Adversarial Attack on Named Entity Recognition
Figure 3 for Context-aware Adversarial Attack on Named Entity Recognition
Figure 4 for Context-aware Adversarial Attack on Named Entity Recognition

In recent years, large pre-trained language models (PLMs) have achieved remarkable performance on many natural language processing benchmarks. Despite their success, prior studies have shown that PLMs are vulnerable to attacks from adversarial examples. In this work, we focus on the named entity recognition task and study context-aware adversarial attack methods to examine the model's robustness. Specifically, we propose perturbing the most informative words for recognizing entities to create adversarial examples and investigate different candidate replacement methods to generate natural and plausible adversarial examples. Experiments and analyses show that our methods are more effective in deceiving the model into making wrong predictions than strong baselines.

Viaarxiv icon

Overview of GUA-SPA at IberLEF 2023: Guarani-Spanish Code Switching Analysis

Sep 12, 2023
Luis Chiruzzo, Marvin Agüero-Torales, Gustavo Giménez-Lugo, Aldo Alvarez, Yliana Rodríguez, Santiago Góngora, Thamar Solorio

Figure 1 for Overview of GUA-SPA at IberLEF 2023: Guarani-Spanish Code Switching Analysis
Figure 2 for Overview of GUA-SPA at IberLEF 2023: Guarani-Spanish Code Switching Analysis
Figure 3 for Overview of GUA-SPA at IberLEF 2023: Guarani-Spanish Code Switching Analysis

We present the first shared task for detecting and analyzing code-switching in Guarani and Spanish, GUA-SPA at IberLEF 2023. The challenge consisted of three tasks: identifying the language of a token, NER, and a novel task of classifying the way a Spanish span is used in the code-switched context. We annotated a corpus of 1500 texts extracted from news articles and tweets, around 25 thousand tokens, with the information for the tasks. Three teams took part in the evaluation phase, obtaining in general good results for Task 1, and more mixed results for Tasks 2 and 3.

* Procesamiento del Lenguaje Natural, Revista no. 71, septiembre de 2023, pp. 321-328  
Viaarxiv icon

SafeWebUH at SemEval-2023 Task 11: Learning Annotator Disagreement in Derogatory Text: Comparison of Direct Training vs Aggregation

May 01, 2023
Sadat Shahriar, Thamar Solorio

Figure 1 for SafeWebUH at SemEval-2023 Task 11: Learning Annotator Disagreement in Derogatory Text: Comparison of Direct Training vs Aggregation
Figure 2 for SafeWebUH at SemEval-2023 Task 11: Learning Annotator Disagreement in Derogatory Text: Comparison of Direct Training vs Aggregation

Subjectivity and difference of opinion are key social phenomena, and it is crucial to take these into account in the annotation and detection process of derogatory textual content. In this paper, we use four datasets provided by SemEval-2023 Task 11 and fine-tune a BERT model to capture the disagreement in the annotation. We find individual annotator modeling and aggregation lowers the Cross-Entropy score by an average of 0.21, compared to the direct training on the soft labels. Our findings further demonstrate that annotator metadata contributes to the average 0.029 reduction in the Cross-Entropy score.

* SemEval Task 11 paper (System) 
Viaarxiv icon

Distillation of encoder-decoder transformers for sequence labelling

Feb 10, 2023
Marco Farina, Duccio Pappadopulo, Anant Gupta, Leslie Huang, Ozan İrsoy, Thamar Solorio

Figure 1 for Distillation of encoder-decoder transformers for sequence labelling
Figure 2 for Distillation of encoder-decoder transformers for sequence labelling
Figure 3 for Distillation of encoder-decoder transformers for sequence labelling
Figure 4 for Distillation of encoder-decoder transformers for sequence labelling

Driven by encouraging results on a wide range of tasks, the field of NLP is experiencing an accelerated race to develop bigger language models. This race for bigger models has also underscored the need to continue the pursuit of practical distillation approaches that can leverage the knowledge acquired by these big models in a compute-efficient manner. Having this goal in mind, we build on recent work to propose a hallucination-free framework for sequence tagging that is especially suited for distillation. We show empirical results of new state-of-the-art performance across multiple sequence labelling datasets and validate the usefulness of this framework for distilling a large model in a few-shot learning scenario.

* Accepted to Findings of EACL 2023 
Viaarxiv icon

The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges

Dec 19, 2022
Genta Indra Winata, Alham Fikri Aji, Zheng-Xin Yong, Thamar Solorio

Figure 1 for The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges
Figure 2 for The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges
Figure 3 for The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges
Figure 4 for The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges

Code-Switching, a common phenomenon in written text and conversation, has been studied over decades by the natural language processing (NLP) research community. Initially, code-switching is intensively explored by leveraging linguistic theories and, currently, more machine-learning oriented approaches to develop models. We introduce a comprehensive systematic survey on code-switching research in natural language processing to understand the progress of the past decades and conceptualize the challenges and tasks on the code-switching topic. Finally, we summarize the trends and findings and conclude with a discussion for future direction and open questions for further investigation.

* Preprint 
Viaarxiv icon

Style Transfer as Data Augmentation: A Case Study on Named Entity Recognition

Oct 14, 2022
Shuguang Chen, Leonardo Neves, Thamar Solorio

Figure 1 for Style Transfer as Data Augmentation: A Case Study on Named Entity Recognition
Figure 2 for Style Transfer as Data Augmentation: A Case Study on Named Entity Recognition
Figure 3 for Style Transfer as Data Augmentation: A Case Study on Named Entity Recognition
Figure 4 for Style Transfer as Data Augmentation: A Case Study on Named Entity Recognition

In this work, we take the named entity recognition task in the English language as a case study and explore style transfer as a data augmentation method to increase the size and diversity of training data in low-resource scenarios. We propose a new method to effectively transform the text from a high-resource domain to a low-resource domain by changing its style-related attributes to generate synthetic data for training. Moreover, we design a constrained decoding algorithm along with a set of key ingredients for data selection to guarantee the generation of valid and coherent data. Experiments and analysis on five different domain pairs under different data regimes demonstrate that our approach can significantly improve results compared to current state-of-the-art data augmentation methods. Our approach is a practical solution to data scarcity, and we expect it to be applicable to other NLP tasks.

* To appear at EMNLP 2022 main conference 
Viaarxiv icon

Survey of Aspect-based Sentiment Analysis Datasets

Apr 11, 2022
Siva Uday Sampreeth Chebolu, Franck Dernoncourt, Nedim Lipka, Thamar Solorio

Figure 1 for Survey of Aspect-based Sentiment Analysis Datasets
Figure 2 for Survey of Aspect-based Sentiment Analysis Datasets
Figure 3 for Survey of Aspect-based Sentiment Analysis Datasets
Figure 4 for Survey of Aspect-based Sentiment Analysis Datasets

Aspect-based sentiment analysis (ABSA) is a natural language processing problem that requires analyzing user-generated reviews in order to determine: a) The target entity being reviewed, b) The high-level aspect to which it belongs, and c) The sentiment expressed toward the targets and the aspects. Numerous yet scattered corpora for ABSA make it difficult for researchers to quickly identify corpora best suited for a specific ABSA subtask. This study aims to present a database of corpora that can be used to train and assess autonomous ABSA systems. Additionally, we provide an overview of the major corpora concerning the various ABSA and its subtasks and highlight several corpus features that researchers should consider when selecting a corpus. We conclude that further large-scale ABSA corpora are required. Additionally, because each corpus is constructed differently, it is time-consuming for researchers to experiment with a novel ABSA algorithm on many corpora and often employ just one or a few corpora. The field would benefit from an agreement on a data standard for ABSA corpora. Finally, we discuss the advantages and disadvantages of current collection approaches and make recommendations for future ABSA dataset gathering.

Viaarxiv icon

CALCS 2021 Shared Task: Machine Translation for Code-Switched Data

Feb 19, 2022
Shuguang Chen, Gustavo Aguilar, Anirudh Srinivasan, Mona Diab, Thamar Solorio

Figure 1 for CALCS 2021 Shared Task: Machine Translation for Code-Switched Data
Figure 2 for CALCS 2021 Shared Task: Machine Translation for Code-Switched Data
Figure 3 for CALCS 2021 Shared Task: Machine Translation for Code-Switched Data
Figure 4 for CALCS 2021 Shared Task: Machine Translation for Code-Switched Data

To date, efforts in the code-switching literature have focused for the most part on language identification, POS, NER, and syntactic parsing. In this paper, we address machine translation for code-switched social media data. We create a community shared task. We provide two modalities for participation: supervised and unsupervised. For the supervised setting, participants are challenged to translate English into Hindi-English (Eng-Hinglish) in a single direction. For the unsupervised setting, we provide the following language pairs: English and Spanish-English (Eng-Spanglish), and English and Modern Standard Arabic-Egyptian Arabic (Eng-MSAEA) in both directions. We share insights and challenges in curating the "into" code-switching language evaluation data. Further, we provide baselines for all language pairs in the shared task. The leaderboard for the shared task comprises 12 individual system submissions corresponding to 5 different teams. The best performance achieved is 12.67% BLEU score for English to Hinglish and 25.72% BLEU score for MSAEA to English.

Viaarxiv icon