Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexander Waibel

The AI Co-Ethnographer: How Far Can Automation Take Qualitative Research?

Apr 21, 2025

Fabian Retkowski, Andreas Sudmann, Alexander Waibel

Figure 1 for The AI Co-Ethnographer: How Far Can Automation Take Qualitative Research?

Figure 2 for The AI Co-Ethnographer: How Far Can Automation Take Qualitative Research?

Figure 3 for The AI Co-Ethnographer: How Far Can Automation Take Qualitative Research?

Figure 4 for The AI Co-Ethnographer: How Far Can Automation Take Qualitative Research?

Abstract:Qualitative research often involves labor-intensive processes that are difficult to scale while preserving analytical depth. This paper introduces The AI Co-Ethnographer (AICoE), a novel end-to-end pipeline developed for qualitative research and designed to move beyond the limitations of simply automating code assignments, offering a more integrated approach. AICoE organizes the entire process, encompassing open coding, code consolidation, code application, and even pattern discovery, leading to a comprehensive analysis of qualitative data.

* Accepted to NLP4DH 2025

Via

Access Paper or Ask Questions

From Speech to Summary: A Comprehensive Survey of Speech Summarization

Apr 10, 2025

Fabian Retkowski, Maike Züfle, Andreas Sudmann, Dinah Pfau, Jan Niehues, Alexander Waibel

Abstract:Speech summarization has become an essential tool for efficiently managing and accessing the growing volume of spoken and audiovisual content. However, despite its increasing importance, speech summarization is still not clearly defined and intersects with several research areas, including speech recognition, text summarization, and specific applications like meeting summarization. This survey not only examines existing datasets and evaluation methodologies, which are crucial for assessing the effectiveness of summarization approaches but also synthesizes recent developments in the field, highlighting the shift from traditional systems to advanced models like fine-tuned cascaded architectures and end-to-end solutions.

Via

Access Paper or Ask Questions

Zero-Shot Strategies for Length-Controllable Summarization

Dec 31, 2024

Fabian Retkowski, Alexander Waibel

Abstract:Large language models (LLMs) struggle with precise length control, particularly in zero-shot settings. We conduct a comprehensive study evaluating LLMs' length control capabilities across multiple measures and propose practical methods to improve controllability. Our experiments with LLaMA 3 reveal stark differences in length adherence across measures and highlight inherent biases of the model. To address these challenges, we introduce a set of methods: length approximation, target adjustment, sample filtering, and automated revisions. By combining these methods, we demonstrate substantial improvements in length compliance while maintaining or enhancing summary quality, providing highly effective zero-shot strategies for precise length control without the need for model fine-tuning or architectural changes. With our work, we not only advance our understanding of LLM behavior in controlled text generation but also pave the way for more reliable and adaptable summarization systems in real-world applications.

Via

Access Paper or Ask Questions

MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models

Nov 27, 2024

Thai-Binh Nguyen, Alexander Waibel

Figure 1 for MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models

Figure 2 for MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models

Figure 3 for MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models

Figure 4 for MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models

Abstract:Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately. Existing methods often rely on complex modular systems or require extensive fine-tuning of joint modules, limiting their adaptability and general efficiency. This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions, using only standard monolingual ASR datasets. Our method involves training a speaker module to predict speaker embeddings based on weak labels without requiring additional ASR model modifications. Despite being trained exclusively with non-overlapping monolingual data, our approach effectively extracts speaker attributes across diverse multilingual datasets, including those with overlapping speech. Experimental results demonstrate competitive performance compared to strong baselines, highlighting the model's robustness and potential for practical applications.

Via

Access Paper or Ask Questions

Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS

Oct 19, 2024

Tuan Nam Nguyen, Seymanur Akti, Ngoc Quan Pham, Alexander Waibel

Abstract:Previous approaches on accent conversion (AC) mainly aimed at making non-native speech sound more native while maintaining the original content and speaker identity. However, non-native speakers sometimes have pronunciation issues, which can make it difficult for listeners to understand them. Hence, we developed a new AC approach that not only focuses on accent conversion but also improves pronunciation of non-native accented speaker. By providing the non-native audio and the corresponding transcript, we generate the ideal ground-truth audio with native-like pronunciation with original duration and prosody. This ground-truth data aids the model in learning a direct mapping between accented and native speech. We utilize the end-to-end VITS framework to achieve high-quality waveform reconstruction for the AC task. As a result, our system not only produces audio that closely resembles native accents and while retaining the original speaker's identity but also improve pronunciation, as demonstrated by evaluation results.

* Submitted to ICASSP 2025

Via

Access Paper or Ask Questions

Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck

Oct 15, 2024

Fevziye Irem Eyiokur, Christian Huber, Thai-Binh Nguyen, Tuan-Nam Nguyen, Fabian Retkowski, Enes Yavuz Ugan, Dogucan Yaman, Alexander Waibel

Figure 1 for Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck

Figure 2 for Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck

Figure 3 for Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck

Figure 4 for Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck

Abstract:In this paper, we report on communication experiments conducted in the summer of 2022 during a deep dive to the wreck of the Titanic. Radio transmission is not possible in deep sea water, and communication links rely on sonar signals. Due to the low bandwidth of sonar signals and the need to communicate readable data, text messaging is used in deep-sea missions. In this paper, we report results and experiences from a messaging system that converts speech to text in a submarine, sends text messages to the surface, and reconstructs those messages as synthetic lip-synchronous videos of the speakers. The resulting system was tested during an actual dive to Titanic in the summer of 2022. We achieved an acceptable latency for a system of such complexity as well as good quality. The system demonstration video can be found at the following link: https://youtu.be/C4lyM86-5Ig

Via

Access Paper or Ask Questions

Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems

Sep 30, 2024

Oswald Zink, Yosuke Higuchi, Carlos Mullov, Alexander Waibel, Tetsunori Kobayashi

Figure 1 for Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems

Figure 2 for Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems

Figure 3 for Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems

Figure 4 for Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems

Abstract:Effective spoken dialog systems should facilitate natural interactions with quick and rhythmic timing, mirroring human communication patterns. To reduce response times, previous efforts have focused on minimizing the latency in automatic speech recognition (ASR) to optimize system efficiency. However, this approach requires waiting for ASR to complete processing until a speaker has finished speaking, which limits the time available for natural language processing (NLP) to formulate accurate responses. As humans, we continuously anticipate and prepare responses even while the other party is still speaking. This allows us to respond appropriately without missing the optimal time to speak. In this work, as a pioneering study toward a conversational system that simulates such human anticipatory behavior, we aim to realize a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance (EOU), using the middle portion of an utterance. To achieve this, we propose a training strategy for an encoder-decoder-based ASR system, which involves masking future segments of an utterance and prompting the decoder to predict the words in the masked audio. Additionally, we develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information to accurately detect the EOU. The experimental results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU. Moreover, the proposed training strategy exhibits general improvements in ASR performance.

* Submitted to ICASSP2025

Via

Access Paper or Ask Questions

Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages

Aug 05, 2024

Carlos Mullov, Ngoc-Quan Pham, Alexander Waibel

Figure 1 for Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages

Figure 2 for Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages

Figure 3 for Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages

Figure 4 for Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages

Abstract:Multilingual neural machine translation systems learn to map sentences of different languages into a common representation space. Intuitively, with a growing number of seen languages the encoder sentence representation grows more flexible and easily adaptable to new languages. In this work, we test this hypothesis by zero-shot translating from unseen languages. To deal with unknown vocabularies from unknown languages we propose a setup where we decouple learning of vocabulary and syntax, i.e. for each language we learn word representations in a separate step (using cross-lingual word embeddings), and then train to translate while keeping those word representations frozen. We demonstrate that this setup enables zero-shot translation from entirely unseen languages. Zero-shot translating with a model trained on Germanic and Romance languages we achieve scores of 42.6 BLEU for Portuguese-English and 20.7 BLEU for Russian-English on TED domain. We explore how this zero-shot translation capability develops with varying number of languages seen by the encoder. Lastly, we explore the effectiveness of our decoupled learning strategy for unsupervised machine translation. By exploiting our model's zero-shot translation capability for iterative back-translation we attain near parity with a supervised setting.

* Accepted to ACL 2024

Via

Access Paper or Ask Questions

Handling Numeric Expressions in Automatic Speech Recognition

Jul 18, 2024

Christian Huber, Alexander Waibel

Abstract:This paper addresses the problem of correctly formatting numeric expressions in automatic speech recognition (ASR) transcripts. This is challenging since the expected transcript format depends on the context, e.g., 1945 (year) vs. 19:45 (timestamp). We compare cascaded and end-to-end approaches to recognize and format numeric expression, such as years, timestamps, currency amounts, and quantities. For the end-to-end approach we employed a data generation strategy using a large language model (LLM) together with a text to speech (TTS) model to generate adaptation data. The results on our test dataset show that while approaches based on LLMs perform well on recognizing formatted numeric expressions, adapted end-to-end models offer competitive performance with the advantage of lower latency and inference cost.

Via

Access Paper or Ask Questions

Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024

Jun 24, 2024

Sai Koneru, Thai-Binh Nguyen, Ngoc-Quan Pham, Danni Liu, Zhaolin Li, Alexander Waibel, Jan Niehues

Abstract:Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST). In this paper, we present KIT's offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. Specifically, we integrate Mistral-7B\footnote{mistralai/Mistral-7B-Instruct-v0.1} into our system to enhance it in two ways. Firstly, we refine the ASR outputs by utilizing the N-best lists generated by our system and fine-tuning the LLM to predict the transcript accurately. Secondly, we refine the MT outputs at the document level by fine-tuning the LLM, leveraging both ASR and MT predictions to improve translation quality. We find that integrating the LLM into the ASR and MT systems results in an absolute improvement of $0.3\%$ in Word Error Rate and $0.65\%$ in COMET for tst2019 test set. In challenging test sets with overlapping speakers and background noise, we find that integrating LLM is not beneficial due to poor ASR performance. Here, we use ASR with chunked long-form decoding to improve context usage that may be unavailable when transcribing with Voice Activity Detection segmentation alone.

Via

Access Paper or Ask Questions