Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiangzhu Kong

Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform

Jun 13, 2025

Xiangzhu Kong, Huang Hao, Zhijian Ou

Figure 1 for Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform

Figure 2 for Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform

Figure 3 for Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform

Figure 4 for Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform

Abstract:This paper presents SHTNet, a lightweight spherical harmonic transform (SHT) based framework, which is designed to address cross-array generalization challenges in multi-channel automatic speech recognition (ASR) through three key innovations. First, SHT based spatial sound field decomposition converts microphone signals into geometry-invariant spherical harmonic coefficients, isolating signal processing from array geometry. Second, the Spatio-Spectral Attention Fusion Network (SSAFN) combines coordinate-aware spatial modeling, refined self-attention channel combinator, and spectral noise suppression without conventional beamforming. Third, Rand-SHT training enhances robustness through random channel selection and array geometry reconstruction. The system achieves 39.26\% average CER across heterogeneous arrays (e.g., circular, square, and binaural) on datasets including Aishell-4, Alimeeting, and XMOS, with 97.1\% fewer computations than conventional neural beamformers.

* Interspeech 2025

Via

Access Paper or Ask Questions

A Streaming Multi-Channel End-to-End Speech Recognition System with Realistic Evaluations

Jul 13, 2024

Xiangzhu Kong, Tianqi Ning, Hao Huang, Zhijian Ou

Figure 1 for A Streaming Multi-Channel End-to-End Speech Recognition System with Realistic Evaluations

Figure 2 for A Streaming Multi-Channel End-to-End Speech Recognition System with Realistic Evaluations

Figure 3 for A Streaming Multi-Channel End-to-End Speech Recognition System with Realistic Evaluations

Figure 4 for A Streaming Multi-Channel End-to-End Speech Recognition System with Realistic Evaluations

Abstract:Recently multi-channel end-to-end (ME2E) ASR systems have emerged. While streaming single-channel end-to-end ASR has been extensively studied, streaming ME2E ASR is limited in exploration. Additionally, recent studies call attention to the gap between in-distribution (ID) and out-of-distribution (OOD) tests and doing realistic evaluations. This paper focuses on two research problems: realizing streaming ME2E ASR and improving OOD generalization. We propose the CUSIDE-array method, which integrates the recent CUSIDE methodology (Chunking, Simulating Future Context and Decoding) into the neural beamformer approach of ME2E ASR. It enables streaming processing of both front-end and back-end with a total latency of 402ms. The CUSIDE-array ME2E models are shown to achieve superior streaming results in both ID and OOD tests. Realistic evaluations confirm the advantage of CUSIDE-array in its capability to consume single-channel data to improve OOD generalization via back-end pre-training and ME2E fine-tuning.

Via

Access Paper or Ask Questions