Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qinfan Xiao

WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

Sep 26, 2025

Changli Tang, Qinfan Xiao, Ke Mei, Tianyi Wang, Fengyun Rao, Chao Zhang

Figure 1 for WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

Figure 2 for WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

Figure 3 for WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

Figure 4 for WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

Abstract:While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf{u}nified \& \textbf{v}ersatile \textbf{a}udio-\textbf{v}isual \textbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code, checkpoints, and data will be released.

Via

Access Paper or Ask Questions

Silent Speech Sentence Recognition with Six-Axis Accelerometers using Conformer and CTC Algorithm

Feb 25, 2025

Yudong Xie, Zhifeng Han, Qinfan Xiao, Liwei Liang, Lu-Qi Tao, Tian-Ling Ren

Figure 1 for Silent Speech Sentence Recognition with Six-Axis Accelerometers using Conformer and CTC Algorithm

Figure 2 for Silent Speech Sentence Recognition with Six-Axis Accelerometers using Conformer and CTC Algorithm

Figure 3 for Silent Speech Sentence Recognition with Six-Axis Accelerometers using Conformer and CTC Algorithm

Figure 4 for Silent Speech Sentence Recognition with Six-Axis Accelerometers using Conformer and CTC Algorithm

Abstract:Silent speech interfaces (SSI) are being actively developed to assist individuals with communication impairments who have long suffered from daily hardships and a reduced quality of life. However, silent sentences are difficult to segment and recognize due to elision and linking. A novel silent speech sentence recognition method is proposed to convert the facial motion signals collected by six-axis accelerometers into transcribed words and sentences. A Conformer-based neural network with the Connectionist-Temporal-Classification algorithm is used to gain contextual understanding and translate the non-acoustic signals into words sequences, solely requesting the constituent words in the database. Test results show that the proposed method achieves a 97.17% accuracy in sentence recognition, surpassing the existing silent speech recognition methods with a typical accuracy of 85%-95%, and demonstrating the potential of accelerometers as an available SSI modality for high-accuracy silent speech sentence recognition.

Via

Access Paper or Ask Questions