Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Topic:Slurp

QUADS: QUAntized Distillation Framework for Efficient Speech Language Understanding

May 19, 2025

Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam

Abstract:Spoken Language Understanding (SLU) systems must balance performance and efficiency, particularly in resource-constrained environments. Existing methods apply distillation and quantization separately, leading to suboptimal compression as distillation ignores quantization constraints. We propose QUADS, a unified framework that optimizes both through multi-stage training with a pre-tuned model, enhancing adaptability to low-bit regimes while maintaining accuracy. QUADS achieves 71.13\% accuracy on SLURP and 99.20\% on FSC, with only minor degradations of up to 5.56\% compared to state-of-the-art models. Additionally, it reduces computational complexity by 60--73$\times$ (GMACs) and model size by 83--700$\times$, demonstrating strong robustness under extreme quantization. These results establish QUADS as a highly efficient solution for real-world, resource-constrained SLU applications.

* INTERSPEECH, 2025

Via

Access Paper or Ask Questions

Auto-SLURP: A Benchmark Dataset for Evaluating Multi-Agent Frameworks in Smart Personal Assistant

Apr 25, 2025

Lei Shen, Xiaoyu Shen

Abstract:In recent years, multi-agent frameworks powered by large language models (LLMs) have advanced rapidly. Despite this progress, there is still a notable absence of benchmark datasets specifically tailored to evaluate their performance. To bridge this gap, we introduce Auto-SLURP, a benchmark dataset aimed at evaluating LLM-based multi-agent frameworks in the context of intelligent personal assistants. Auto-SLURP extends the original SLURP dataset -- initially developed for natural language understanding tasks -- by relabeling the data and integrating simulated servers and external services. This enhancement enables a comprehensive end-to-end evaluation pipeline, covering language understanding, task execution, and response generation. Our experiments demonstrate that Auto-SLURP presents a significant challenge for current state-of-the-art frameworks, highlighting that truly reliable and intelligent multi-agent personal assistants remain a work in progress. The dataset and related code are available at https://github.com/lorashen/Auto-SLURP/.

Via

Access Paper or Ask Questions

Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding

Jan 13, 2025

Jiliang Hu, Zuchao Li, Mengjia Shen, Haojun Ai, Sheng Li, Jun Zhang

Figure 1 for Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding

Figure 2 for Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding

Figure 3 for Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding

Figure 4 for Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding

Abstract:Spoken language understanding (SLU) is a structure prediction task in the field of speech. Recently, many works on SLU that treat it as a sequence-to-sequence task have achieved great success. However, This method is not suitable for simultaneous speech recognition and understanding. In this paper, we propose a joint speech recognition and structure learning framework (JSRSL), an end-to-end SLU model based on span, which can accurately transcribe speech and extract structured content simultaneously. We conduct experiments on name entity recognition and intent classification using the Chinese dataset AISHELL-NER and the English dataset SLURP. The results show that our proposed method not only outperforms the traditional sequence-to-sequence method in both transcription and extraction capabilities but also achieves state-of-the-art performance on the two datasets.

* 5 pages, 2 figures, accepted by ICASSP 2025

Via

Access Paper or Ask Questions

WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding

Aug 29, 2024

Mohan Li, Cong-Thanh Do, Simon Keizer, Youmna Farag, Svetlana Stoyanchev, Rama Doddipatla

Figure 1 for WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding

Figure 2 for WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding

Figure 3 for WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding

Figure 4 for WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding

Abstract:Speech large language models (speech-LLMs) integrate speech and text-based foundation models to provide a unified framework for handling a wide range of downstream tasks. In this paper, we introduce WHISMA, a speech-LLM tailored for spoken language understanding (SLU) that demonstrates robust performance in various zero-shot settings. WHISMA combines the speech encoder from Whisper with the Llama-3 LLM, and is fine-tuned in a parameter-efficient manner on a comprehensive collection of SLU-related datasets. Our experiments show that WHISMA significantly improves the zero-shot slot filling performance on the SLURP benchmark, achieving a relative gain of 26.6% compared to the current state-of-the-art model. Furthermore, to evaluate WHISMA's generalisation capabilities to unseen domains, we develop a new task-agnostic benchmark named SLU-GLUE. The evaluation results indicate that WHISMA outperforms an existing speech-LLM (Qwen-Audio) with a relative gain of 33.0%.

* accepted to SLT 2024

Via

Access Paper or Ask Questions

Out-of-distribution generalisation in spoken language understanding

Jul 10, 2024

Dejan Porjazovski, Anssi Moisio, Mikko Kurimo

Figure 1 for Out-of-distribution generalisation in spoken language understanding

Figure 2 for Out-of-distribution generalisation in spoken language understanding

Figure 3 for Out-of-distribution generalisation in spoken language understanding

Abstract:Test data is said to be out-of-distribution (OOD) when it unexpectedly differs from the training data, a common challenge in real-world use cases of machine learning. Although OOD generalisation has gained interest in recent years, few works have focused on OOD generalisation in spoken language understanding (SLU) tasks. To facilitate research on this topic, we introduce a modified version of the popular SLU dataset SLURP, featuring data splits for testing OOD generalisation in the SLU task. We call our modified dataset SLURP For OOD generalisation, or SLURPFOOD. Utilising our OOD data splits, we find end-to-end SLU models to have limited capacity for generalisation. Furthermore, by employing model interpretability techniques, we shed light on the factors contributing to the generalisation difficulties of the models. To improve the generalisation, we experiment with two techniques, which improve the results on some, but not all the splits, emphasising the need for new techniques.

* Accepted for INTERSPEECH 2024

Via

Access Paper or Ask Questions

Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language Understanding

Jun 21, 2024

Mohan Li, Simon Keizer, Rama Doddipatla

Figure 1 for Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language Understanding

Figure 2 for Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language Understanding

Figure 3 for Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language Understanding

Figure 4 for Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language Understanding

Abstract:Zero-shot spoken language understanding (SLU) enables systems to comprehend user utterances in new domains without prior exposure to training data. Recent studies often rely on large language models (LLMs), leading to excessive footprints and complexity. This paper proposes the use of Whisper, a standalone speech processing model, for zero-shot end-to-end (E2E) SLU. To handle unseen semantic labels, SLU tasks are integrated into a question-answering (QA) framework, which prompts the Whisper decoder for semantics deduction. The system is efficiently trained with prefix-tuning, optimising a minimal set of parameters rather than the entire Whisper model. We show that the proposed system achieves a 40.7% absolute gain for slot filling (SLU-F1) on SLURP compared to a recently introduced zero-shot benchmark. Furthermore, it performs comparably to a Whisper-GPT-2 modular system under both in-corpus and cross-corpus evaluation settings, but with a relative 34.8% reduction in model parameters.

* Accepted to Interspeech 2024

Via

Access Paper or Ask Questions

SYN2REAL: Leveraging Task Arithmetic for Mitigating Synthetic-Real Discrepancies in ASR Domain Adaptation

Jun 05, 2024

Hsuan Su, Hua Farn, Shang-Tse Chen, Hung-yi Lee

Figure 1 for SYN2REAL: Leveraging Task Arithmetic for Mitigating Synthetic-Real Discrepancies in ASR Domain Adaptation

Figure 2 for SYN2REAL: Leveraging Task Arithmetic for Mitigating Synthetic-Real Discrepancies in ASR Domain Adaptation

Figure 3 for SYN2REAL: Leveraging Task Arithmetic for Mitigating Synthetic-Real Discrepancies in ASR Domain Adaptation

Figure 4 for SYN2REAL: Leveraging Task Arithmetic for Mitigating Synthetic-Real Discrepancies in ASR Domain Adaptation

Abstract:Recent advancements in large language models (LLMs) have introduced the 'task vector' concept, which has significantly impacted various domains but remains underexplored in speech recognition. This paper presents a novel 'SYN2REAL' task vector for domain adaptation in automatic speech recognition (ASR), specifically targeting text-only domains. Traditional fine-tuning on synthetic speech often results in performance degradation due to acoustic mismatches. To address this issue, we propose creating a 'SYN2REAL' vector by subtracting the parameter differences between models fine-tuned on real and synthetic speech. This vector effectively bridges the gap between the two domains. Experiments on the SLURP dataset demonstrate that our approach yields an average improvement of 11.15% in word error rate for unseen target domains, highlighting the potential of task vectors in enhancing speech domain adaptation.

Via

Access Paper or Ask Questions

Two-pass Endpoint Detection for Speech Recognition

Jan 17, 2024

Anirudh Raju, Aparna Khare, Di He, Ilya Sklyar, Long Chen, Sam Alptekin, Viet Anh Trinh, Zhe Zhang, Colin Vaz, Venkatesh Ravichandran(+2 more)

Abstract:Endpoint (EP) detection is a key component of far-field speech recognition systems that assist the user through voice commands. The endpoint detector has to trade-off between accuracy and latency, since waiting longer reduces the cases of users being cut-off early. We propose a novel two-pass solution for endpointing, where the utterance endpoint detected from a first pass endpointer is verified by a 2nd-pass model termed EP Arbitrator. Our method improves the trade-off between early cut-offs and latency over a baseline endpointer, as tested on datasets including voice-assistant transactional queries, conversational speech, and the public SLURP corpus. We demonstrate that our method shows improvements regardless of the first-pass EP model used.

* ASRU 2023

Via

Access Paper or Ask Questions

Speech-based Slot Filling using Large Language Models

Nov 13, 2023

Guangzhi Sun, Shutong Feng, Dongcheng Jiang, Chao Zhang, Milica Gašić, Philip C. Woodland

Figure 1 for Speech-based Slot Filling using Large Language Models

Figure 2 for Speech-based Slot Filling using Large Language Models

Figure 3 for Speech-based Slot Filling using Large Language Models

Figure 4 for Speech-based Slot Filling using Large Language Models

Abstract:Recently, advancements in large language models (LLMs) have shown an unprecedented ability across various language tasks. This paper investigates the potential application of LLMs to slot filling with noisy ASR transcriptions, via both in-context learning and task-specific fine-tuning. Dedicated prompt designs and fine-tuning approaches are proposed to improve the robustness of LLMs for slot filling with noisy ASR transcriptions. Moreover, a linearised knowledge injection (LKI) scheme is also proposed to integrate dynamic external knowledge into LLMs. Experiments were performed on SLURP to quantify the performance of LLMs, including GPT-3.5-turbo, GPT-4, LLaMA-13B and Vicuna-13B (v1.1 and v1.5) with different ASR error rates. The use of the proposed fine-tuning together with the LKI scheme for LLaMA-13B achieved an 8.3% absolute SLU-F1 improvement compared to the strong Flan-T5-base baseline system on a limited data setup.

Via

Access Paper or Ask Questions

1SPU: 1-step Speech Processing Unit

Nov 10, 2023

Karan Singla, Shahab Jalalvand, Yeon-Jun Kim, Antonio Moreno Daniel, Srinivas Bangalore, Andrej Ljolje, Ben Stern

Abstract:Recent studies have made some progress in refining end-to-end (E2E) speech recognition encoders by applying Connectionist Temporal Classification (CTC) loss to enhance named entity recognition within transcriptions. However, these methods have been constrained by their exclusive use of the ASCII character set, allowing only a limited array of semantic labels. We propose 1SPU, a 1-step Speech Processing Unit which can recognize speech events (e.g: speaker change) or an NL event (Intent, Emotion) while also transcribing vocal content. It extends the E2E automatic speech recognition (ASR) system's vocabulary by adding a set of unused placeholder symbols, conceptually akin to the <pad> tokens used in sequence modeling. These placeholders are then assigned to represent semantic events (in form of tags) and are integrated into the transcription process as distinct tokens. We demonstrate notable improvements on the SLUE benchmark and yields results that are on par with those for the SLURP dataset. Additionally, we provide a visual analysis of the system's proficiency in accurately pinpointing meaningful tokens over time, illustrating the enhancement in transcription quality through the utilization of supplementary semantic tags.

* It's a work in progress. More tasks and interesting experiments planned

Via

Access Paper or Ask Questions

Topic:Slurp

Papers and Code