Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Thai-Binh Nguyen

ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition

Jun 05, 2025

Thai-Binh Nguyen, Thi Van Nguyen, Quoc Truong Do, Chi Mai Luong

Abstract:Audio-Visual Speech Recognition (AVSR) has gained significant attention recently due to its robustness against noise, which often challenges conventional speech recognition systems that rely solely on audio features. Despite this advantage, AVSR models remain limited by the scarcity of extensive datasets, especially for most languages beyond English. Automated data collection offers a promising solution. This work presents a practical approach to generate AVSR datasets from raw video, refining existing techniques for improved efficiency and accessibility. We demonstrate its broad applicability by developing a baseline AVSR model for Vietnamese. Experiments show the automatically collected dataset enables a strong baseline, achieving competitive performance with robust ASR in clean conditions and significantly outperforming them in noisy environments like cocktail parties. This efficient method provides a pathway to expand AVSR to more languages, particularly under-resourced ones.

* Accepted at Interspeech 2025

Via

Access Paper or Ask Questions

KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025

May 19, 2025

Sai Koneru, Maike Züfle, Thai-Binh Nguyen, Seymanur Akti, Jan Niehues, Alexander Waibel

Figure 1 for KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025

Figure 2 for KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025

Figure 3 for KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025

Figure 4 for KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025

Abstract:The scope of the International Workshop on Spoken Language Translation (IWSLT) has recently broadened beyond traditional Speech Translation (ST) to encompass a wider array of tasks, including Speech Question Answering and Summarization. This shift is partly driven by the growing capabilities of modern systems, particularly with the success of Large Language Models (LLMs). In this paper, we present the Karlsruhe Institute of Technology's submissions for the Offline ST and Instruction Following (IF) tracks, where we leverage LLMs to enhance performance across all tasks. For the Offline ST track, we propose a pipeline that employs multiple automatic speech recognition systems, whose outputs are fused using an LLM with document-level context. This is followed by a two-step translation process, incorporating additional refinement step to improve translation quality. For the IF track, we develop an end-to-end model that integrates a speech encoder with an LLM to perform a wide range of instruction-following tasks. We complement it with a final document-level refinement stage to further enhance output quality by using contextual information.

Via

Access Paper or Ask Questions

MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models

Nov 27, 2024

Thai-Binh Nguyen, Alexander Waibel

Figure 1 for MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models

Figure 2 for MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models

Figure 3 for MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models

Figure 4 for MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models

Abstract:Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately. Existing methods often rely on complex modular systems or require extensive fine-tuning of joint modules, limiting their adaptability and general efficiency. This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions, using only standard monolingual ASR datasets. Our method involves training a speaker module to predict speaker embeddings based on weak labels without requiring additional ASR model modifications. Despite being trained exclusively with non-overlapping monolingual data, our approach effectively extracts speaker attributes across diverse multilingual datasets, including those with overlapping speech. Experimental results demonstrate competitive performance compared to strong baselines, highlighting the model's robustness and potential for practical applications.

Via

Access Paper or Ask Questions

Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck

Oct 15, 2024

Fevziye Irem Eyiokur, Christian Huber, Thai-Binh Nguyen, Tuan-Nam Nguyen, Fabian Retkowski, Enes Yavuz Ugan, Dogucan Yaman, Alexander Waibel

Figure 1 for Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck

Figure 2 for Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck

Figure 3 for Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck

Figure 4 for Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck

Abstract:In this paper, we report on communication experiments conducted in the summer of 2022 during a deep dive to the wreck of the Titanic. Radio transmission is not possible in deep sea water, and communication links rely on sonar signals. Due to the low bandwidth of sonar signals and the need to communicate readable data, text messaging is used in deep-sea missions. In this paper, we report results and experiences from a messaging system that converts speech to text in a submarine, sends text messages to the surface, and reconstructs those messages as synthetic lip-synchronous videos of the speakers. The resulting system was tested during an actual dive to Titanic in the summer of 2022. We achieved an acceptable latency for a system of such complexity as well as good quality. The system demonstration video can be found at the following link: https://youtu.be/C4lyM86-5Ig

Via

Access Paper or Ask Questions

Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024

Jun 24, 2024

Sai Koneru, Thai-Binh Nguyen, Ngoc-Quan Pham, Danni Liu, Zhaolin Li, Alexander Waibel, Jan Niehues

Abstract:Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST). In this paper, we present KIT's offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. Specifically, we integrate Mistral-7B\footnote{mistralai/Mistral-7B-Instruct-v0.1} into our system to enhance it in two ways. Firstly, we refine the ASR outputs by utilizing the N-best lists generated by our system and fine-tuning the LLM to predict the transcript accurately. Secondly, we refine the MT outputs at the document level by fine-tuning the LLM, leveraging both ASR and MT predictions to improve translation quality. We find that integrating the LLM into the ASR and MT systems results in an absolute improvement of $0.3\%$ in Word Error Rate and $0.65\%$ in COMET for tst2019 test set. In challenging test sets with overlapping speakers and background noise, we find that integrating LLM is not beneficial due to poor ASR performance. Here, we use ASR with chunked long-form decoding to improve context usage that may be unavailable when transcribing with Voice Activity Detection segmentation alone.

Via

Access Paper or Ask Questions

Convoifilter: A case study of doing cocktail party speech recognition

Aug 22, 2023

Thai-Binh Nguyen, Alexander Waibel

Abstract:This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. The model utilizes a single-channel speech enhancement module that isolates the speaker's voice from background noise, along with an ASR module. Through this approach, the model is able to decrease the word error rate (WER) of ASR from 80% to 26.4%. Typically, these two components are adjusted independently due to variations in data requirements. However, speech enhancement can create anomalies that decrease ASR efficiency. By implementing a joint fine-tuning strategy, the model can reduce the WER from 26.4% in separate tuning to 14.5% in joint tuning.

* 6 pages

Via

Access Paper or Ask Questions

NOWJ at COLIEE 2023 -- Multi-Task and Ensemble Approaches in Legal Information Processing

Jun 08, 2023

Thi-Hai-Yen Vuong, Hai-Long Nguyen, Tan-Minh Nguyen, Hoang-Trung Nguyen, Thai-Binh Nguyen, Ha-Thanh Nguyen

Abstract:This paper presents the NOWJ team's approach to the COLIEE 2023 Competition, which focuses on advancing legal information processing techniques and applying them to real-world legal scenarios. Our team tackles the four tasks in the competition, which involve legal case retrieval, legal case entailment, statute law retrieval, and legal textual entailment. We employ state-of-the-art machine learning models and innovative approaches, such as BERT, Longformer, BM25-ranking algorithm, and multi-task learning models. Although our team did not achieve state-of-the-art results, our findings provide valuable insights and pave the way for future improvements in legal information processing.

* COLIEE 2023

Via

Access Paper or Ask Questions

LBMT team at VLSP2022-Abmusu: Hybrid method with text correlation and generative models for Vietnamese multi-document summarization

Apr 11, 2023

Tan-Minh Nguyen, Thai-Binh Nguyen, Hoang-Trung Nguyen, Hai-Long Nguyen, Tam Doan Thanh, Ha-Thanh Nguyen, Thi-Hai-Yen Vuong

Figure 1 for LBMT team at VLSP2022-Abmusu: Hybrid method with text correlation and generative models for Vietnamese multi-document summarization

Figure 2 for LBMT team at VLSP2022-Abmusu: Hybrid method with text correlation and generative models for Vietnamese multi-document summarization

Figure 3 for LBMT team at VLSP2022-Abmusu: Hybrid method with text correlation and generative models for Vietnamese multi-document summarization

Figure 4 for LBMT team at VLSP2022-Abmusu: Hybrid method with text correlation and generative models for Vietnamese multi-document summarization

Abstract:Multi-document summarization is challenging because the summaries should not only describe the most important information from all documents but also provide a coherent interpretation of the documents. This paper proposes a method for multi-document summarization based on cluster similarity. In the extractive method we use hybrid model based on a modified version of the PageRank algorithm and a text correlation considerations mechanism. After generating summaries by selecting the most important sentences from each cluster, we apply BARTpho and ViT5 to construct the abstractive models. Both extractive and abstractive approaches were considered in this study. The proposed method achieves competitive results in VLSP 2022 competition.

* In Proceedings of the 9th International Workshop on Vietnamese Language and Speech Processing (VLSP 2022)

Via

Access Paper or Ask Questions