Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mark Cieliebak

SDS-200: A Swiss German Speech to Standard German Text Corpus

May 19, 2022

Michel Plüss, Manuela Hürlimann, Marc Cuny, Alla Stöckli, Nikolaos Kapotis, Julia Hartmann, Malgorzata Anna Ulasik, Christian Scheller, Yanick Schraner, Amit Jain(+3 more)

Figure 1 for SDS-200: A Swiss German Speech to Standard German Text Corpus

Figure 2 for SDS-200: A Swiss German Speech to Standard German Text Corpus

Figure 3 for SDS-200: A Swiss German Speech to Standard German Text Corpus

Figure 4 for SDS-200: A Swiss German Speech to Standard German Text Corpus

Abstract:We present SDS-200, a corpus of Swiss German dialectal speech with Standard German text translations, annotated with dialect, age, and gender information of the speakers. The dataset allows for training speech translation, dialect recognition, and speech synthesis systems, among others. The data was collected using a web recording tool that is open to the public. Each participant was given a text in Standard German and asked to translate it to their Swiss German dialect before recording it. To increase the corpus quality, recordings were validated by other participants. The data consists of 200 hours of speech by around 4000 different speakers and covers a large part of the Swiss-German dialect landscape. We release SDS-200 alongside a baseline speech translation model, which achieves a word error rate (WER) of 30.3 and a BLEU score of 53.1 on the SDS-200 test set. Furthermore, we use SDS-200 to fine-tune a pre-trained XLS-R model, achieving 21.6 WER and 64.0 BLEU.

Via

Access Paper or Ask Questions

Probing the Robustness of Trained Metrics for Conversational Dialogue Systems

Feb 28, 2022

Jan Deriu, Don Tuggener, Pius von Däniken, Mark Cieliebak

Figure 1 for Probing the Robustness of Trained Metrics for Conversational Dialogue Systems

Figure 2 for Probing the Robustness of Trained Metrics for Conversational Dialogue Systems

Figure 3 for Probing the Robustness of Trained Metrics for Conversational Dialogue Systems

Figure 4 for Probing the Robustness of Trained Metrics for Conversational Dialogue Systems

Abstract:This paper introduces an adversarial method to stress-test trained metrics to evaluate conversational dialogue systems. The method leverages Reinforcement Learning to find response strategies that elicit optimal scores from the trained metrics. We apply our method to test recently proposed trained metrics. We find that they all are susceptible to giving high scores to responses generated by relatively simple and obviously flawed strategies that our method converges on. For instance, simply copying parts of the conversation context to form a response yields competitive scores or even outperforms responses written by humans.

Via

Access Paper or Ask Questions

Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

Oct 05, 2020

Jan Deriu, Don Tuggener, Pius von Däniken, Jon Ander Campos, Alvaro Rodrigo, Thiziri Belkacem, Aitor Soroa, Eneko Agirre, Mark Cieliebak

Figure 1 for Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

Figure 2 for Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

Figure 3 for Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

Figure 4 for Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

Abstract:The lack of time-efficient and reliable evaluation methods hamper the development of conversational dialogue systems (chatbots). Evaluations requiring humans to converse with chatbots are time and cost-intensive, put high cognitive demands on the human judges, and yield low-quality results. In this work, we introduce \emph{Spot The Bot}, a cost-efficient and robust evaluation framework that replaces human-bot conversations with conversations between bots. Human judges then only annotate for each entity in a conversation whether they think it is human or not (assuming there are humans participants in these conversations). These annotations then allow us to rank chatbots regarding their ability to mimic the conversational behavior of humans. Since we expect that all bots are eventually recognized as such, we incorporate a metric that measures which chatbot can uphold human-like behavior the longest, i.e., \emph{Survival Analysis}. This metric has the ability to correlate a bot's performance to certain of its characteristics (e.g., \ fluency or sensibleness), yielding interpretable results. The comparably low cost of our framework allows for frequent evaluations of chatbots during their evaluation cycle. We empirically validate our claims by applying \emph{Spot The Bot} to three domains, evaluating several state-of-the-art chatbots, and drawing comparisons to related work. The framework is released as a ready-to-use tool.

Via

Access Paper or Ask Questions

DoQA -- Accessing Domain-Specific FAQs via Conversational QA

May 18, 2020

Jon Ander Campos, Arantxa Otegi, Aitor Soroa, Jan Deriu, Mark Cieliebak, Eneko Agirre

Figure 1 for DoQA -- Accessing Domain-Specific FAQs via Conversational QA

Figure 2 for DoQA -- Accessing Domain-Specific FAQs via Conversational QA

Figure 3 for DoQA -- Accessing Domain-Specific FAQs via Conversational QA

Figure 4 for DoQA -- Accessing Domain-Specific FAQs via Conversational QA

Abstract:The goal of this work is to build conversational Question Answering (QA) interfaces for the large body of domain-specific information available in FAQ sites. We present DoQA, a dataset with 2,437 dialogues and 10,917 QA pairs. The dialogues are collected from three Stack Exchange sites using the Wizard of Oz method with crowdsourcing. Compared to previous work, DoQA comprises well-defined information needs, leading to more coherent and natural conversations with less factoid questions and is multi-domain. In addition, we introduce a more realistic information retrieval(IR) scenario where the system needs to find the answer in any of the FAQ documents. The results of an existing, strong, system show that, thanks to transfer learning from a Wikipedia QA dataset and fine tuning on a single FAQ domain, it is possible to build high quality conversational QA systems for FAQs without in-domain training data. The good results carry over into the more challenging IR scenario. In both cases, there is still ample room for improvement, as indicated by the higher human upperbound.

* Accepted at ACL 2020. 13 pages 4 figures

Via

Access Paper or Ask Questions

A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation

Apr 16, 2020

Jan Deriu, Katsiaryna Mlynchyk, Philippe Schläpfer, Alvaro Rodrigo, Dirk von Grünigen, Nicolas Kaiser, Kurt Stockinger, Eneko Agirre, Mark Cieliebak

Figure 1 for A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation

Figure 2 for A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation

Figure 3 for A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation

Figure 4 for A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation

Abstract:In this paper, we introduce a novel methodology to efficiently construct a corpus for question answering over structured data. For this, we introduce an intermediate representation that is based on the logical query plan in a database called Operation Trees (OT). This representation allows us to invert the annotation process without losing flexibility in the types of queries that we generate. Furthermore, it allows for fine-grained alignment of query tokens to OT operations. In our method, we randomly generate OTs from a context-free grammar. Afterwards, annotators have to write the appropriate natural language question that is represented by the OT. Finally, the annotators assign the tokens to the OT operations. We apply the method to create a new corpus OTTA (Operation Trees and Token Assignment), a large semantic parsing corpus for evaluating natural language interfaces to databases. We compare OTTA to Spider and LC-QuaD 2.0 and show that our methodology more than triples the annotation speed while maintaining the complexity of the queries. Finally, we train a state-of-the-art semantic parsing model on our data and show that our corpus is a challenging dataset and that the token alignment can be leveraged to increase the performance significantly.

Via

Access Paper or Ask Questions

Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement

Sep 26, 2019

Jan Deriu, Mark Cieliebak

Figure 1 for Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement

Figure 2 for Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement

Figure 3 for Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement

Figure 4 for Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement

Abstract:We present "AutoJudge", an automated evaluation method for conversational dialogue systems. The method works by first generating dialogues based on self-talk, i.e. dialogue systems talking to itself. Then, it uses human ratings on these dialogues to train an automated judgement model. Our experiments show that AutoJudge correlates well with the human ratings and can be used to automatically evaluate dialogue systems, even in deployed systems. In a second part, we attempt to apply AutoJudge to improve existing systems. This works well for re-ranking a set of candidate utterances. However, our experiments show that AutoJudge cannot be applied as reward for reinforcement learning, although the metric can distinguish good from bad dialogues. We discuss potential reasons, but state here already that this is still an open question for further research.

* 8 Pages, To be published at the INLG 2019 converence

Via

Access Paper or Ask Questions

Correlating Twitter Language with Community-Level Health Outcomes

Jun 24, 2019

Arno Schneuwly, Ralf Grubenmann, Séverine Rion Logean, Mark Cieliebak, Martin Jaggi

Figure 1 for Correlating Twitter Language with Community-Level Health Outcomes

Figure 2 for Correlating Twitter Language with Community-Level Health Outcomes

Figure 3 for Correlating Twitter Language with Community-Level Health Outcomes

Figure 4 for Correlating Twitter Language with Community-Level Health Outcomes

Abstract:We study how language on social media is linked to diseases such as atherosclerotic heart disease (AHD), diabetes and various types of cancer. Our proposed model leverages state-of-the-art sentence embeddings, followed by a regression model and clustering, without the need of additional labelled data. It allows to predict community-level medical outcomes from language, and thereby potentially translate these to the individual level. The method is applicable to a wide range of target variables and allows us to discover known and potentially novel correlations of medical outcomes with life-style aspects and other socioeconomic risk factors.

* ACL SMM4H Workshop (Social Media Mining for Health Applications)

Via

Access Paper or Ask Questions

Towards Integration of Statistical Hypothesis Tests into Deep Neural Networks

Jun 15, 2019

Ahmad Aghaebrahimian, Mark Cieliebak

Figure 1 for Towards Integration of Statistical Hypothesis Tests into Deep Neural Networks

Figure 2 for Towards Integration of Statistical Hypothesis Tests into Deep Neural Networks

Figure 3 for Towards Integration of Statistical Hypothesis Tests into Deep Neural Networks

Figure 4 for Towards Integration of Statistical Hypothesis Tests into Deep Neural Networks

Abstract:We report our ongoing work about a new deep architecture working in tandem with a statistical test procedure for jointly training texts and their label descriptions for multi-label and multi-class classification tasks. A statistical hypothesis testing method is used to extract the most informative words for each given class. These words are used as a class description for more label-aware text classification. Intuition is to help the model to concentrate on more informative words rather than more frequent ones. The model leverages the use of label descriptions in addition to the input text to enhance text classification performance. Our method is entirely data-driven, has no dependency on other sources of information than the training data, and is adaptable to different classification problems by providing appropriate training data without major hyper-parameter tuning. We trained and tested our system on several publicly available datasets, where we managed to improve the state-of-the-art on one set with a high margin, and to obtain competitive results on all other ones.

* Accepted to ACL 2019

Via

Access Paper or Ask Questions

Survey on Evaluation Methods for Dialogue Systems

May 10, 2019

Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, Mark Cieliebak

Figure 1 for Survey on Evaluation Methods for Dialogue Systems

Figure 2 for Survey on Evaluation Methods for Dialogue Systems

Figure 3 for Survey on Evaluation Methods for Dialogue Systems

Figure 4 for Survey on Evaluation Methods for Dialogue Systems

Abstract:In this paper we survey the methods and concepts developed for the evaluation of dialogue systems. Evaluation is a crucial part during the development process. Often, dialogue systems are evaluated by means of human evaluations and questionnaires. However, this tends to be very cost and time intensive. Thus, much work has been put into finding methods, which allow to reduce the involvement of human labour. In this survey, we present the main concepts and methods. For this, we differentiate between the various classes of dialogue systems (task-oriented dialogue systems, conversational dialogue systems, and question-answering dialogue systems). We cover each class by introducing the main technologies developed for the dialogue systems and then by presenting the evaluation methods regarding this class.

Via

Access Paper or Ask Questions

Leveraging Large Amounts of Weakly Supervised Data for Multi-Language Sentiment Classification

Mar 07, 2017

Jan Deriu, Aurelien Lucchi, Valeria De Luca, Aliaksei Severyn, Simon Müller, Mark Cieliebak, Thomas Hofmann, Martin Jaggi

Figure 1 for Leveraging Large Amounts of Weakly Supervised Data for Multi-Language Sentiment Classification

Figure 2 for Leveraging Large Amounts of Weakly Supervised Data for Multi-Language Sentiment Classification

Figure 3 for Leveraging Large Amounts of Weakly Supervised Data for Multi-Language Sentiment Classification

Figure 4 for Leveraging Large Amounts of Weakly Supervised Data for Multi-Language Sentiment Classification

Abstract:This paper presents a novel approach for multi-lingual sentiment classification in short texts. This is a challenging task as the amount of training data in languages other than English is very limited. Previously proposed multi-lingual approaches typically require to establish a correspondence to English for which powerful classifiers are already available. In contrast, our method does not require such supervision. We leverage large amounts of weakly-supervised data in various languages to train a multi-layer convolutional network and demonstrate the importance of using pre-training of such networks. We thoroughly evaluate our approach on various multi-lingual datasets, including the recent SemEval-2016 sentiment prediction benchmark (Task 4), where we achieved state-of-the-art performance. We also compare the performance of our model trained individually for each language to a variant trained for all languages at once. We show that the latter model reaches slightly worse - but still acceptable - performance when compared to the single language model, while benefiting from better generalization properties across languages.

* appearing at WWW 2017 - 26th International World Wide Web Conference

Via

Access Paper or Ask Questions