Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kangwook Jang

Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis

Jan 12, 2025

Minu Kim, Kangwook Jang, Hoirin Kim

Figure 1 for Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis

Figure 2 for Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis

Figure 3 for Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis

Figure 4 for Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis

Abstract:This paper examines how linguistic similarity affects cross-lingual phonetic representation in speech processing for low-resource languages, emphasizing effective source language selection. Previous cross-lingual research has used various source languages to enhance performance for the target low-resource language without thorough consideration of selection. Our study stands out by providing an in-depth analysis of language selection, supported by a practical approach to assess phonetic proximity among multiple language families. We investigate how within-family similarity impacts performance in multilingual training, which aids in understanding language dynamics. We also evaluate the effect of using phonologically similar languages, regardless of family. For the phoneme recognition task, utilizing phonologically similar languages consistently achieves a relative improvement of 55.6% over monolingual training, even surpassing the performance of a large-scale self-supervised learning model. Multilingual training within the same language family demonstrates that higher phonological similarity enhances performance, while lower similarity results in degraded performance compared to monolingual training.

* 10 pages, 5 figures, accepted to ICASSP 2025

Via

Access Paper or Ask Questions

Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition

Jul 04, 2024

Sungnyun Kim, Kangwook Jang, Sangmin Bae, Hoirin Kim, Se-Young Yun

Figure 1 for Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition

Figure 2 for Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition

Figure 3 for Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition

Figure 4 for Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition

Abstract:Audio-visual speech recognition (AVSR) aims to transcribe human speech using both audio and video modalities. In practical environments with noise-corrupted audio, the role of video information becomes crucial. However, prior works have primarily focused on enhancing audio features in AVSR, overlooking the importance of video features. In this study, we strengthen the video features by learning three temporal dynamics in video data: context order, playback direction, and the speed of video frames. Cross-modal attention modules are introduced to enrich video features with audio information so that speech variability can be taken into account when training on the video temporal dynamics. Based on our approach, we achieve the state-of-the-art performance on the LRS2 and LRS3 AVSR benchmarks for the noise-dominant settings. Our approach excels in scenarios especially for babble and speech noise, indicating the ability to distinguish the speech signal that should be recognized from lip movements in the video modality. We support the validity of our methodology by offering the ablation experiments for the temporal dynamics losses and the cross-modal attention architecture design.

Via

Access Paper or Ask Questions

One-Class Learning with Adaptive Centroid Shift for Audio Deepfake Detection

Jun 24, 2024

Hyun Myung Kim, Kangwook Jang, Hoirin Kim

Abstract:As speech synthesis systems continue to make remarkable advances in recent years, the importance of robust deepfake detection systems that perform well in unseen systems has grown. In this paper, we propose a novel adaptive centroid shift (ACS) method that updates the centroid representation by continually shifting as the weighted average of bonafide representations. Our approach uses only bonafide samples to define their centroid, which can yield a specialized centroid for one-class learning. Integrating our ACS with one-class learning gathers bonafide representations into a single cluster, forming well-separated embeddings robust to unseen spoofing attacks. Our proposed method achieves an equal error rate (EER) of 2.19% on the ASVspoof 2021 deepfake dataset, outperforming all existing systems. Furthermore, the t-SNE visualization illustrates that our method effectively maps the bonafide embeddings into a single cluster and successfully disentangles the bonafide and spoof classes.

* Accepted by Interspeech 2024

Via

Access Paper or Ask Questions

STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models

Dec 14, 2023

Kangwook Jang, Sungnyun Kim, Hoirin Kim

Abstract:Albeit great performance of Transformer-based speech selfsupervised learning (SSL) models, their large parameter size and computational cost make them unfavorable to utilize. In this study, we propose to compress the speech SSL models by distilling speech temporal relation (STaR). Unlike previous works that directly match the representation for each speech frame, STaR distillation transfers temporal relation between speech frames, which is more suitable for lightweight student with limited capacity. We explore three STaR distillation objectives and select the best combination as the final STaR loss. Our model distilled from HuBERT BASE achieves an overall score of 79.8 on SUPERB benchmark, the best performance among models with up to 27 million parameters. We show that our method is applicable across different speech SSL models and maintains robust performance with further reduced parameters.

* Accepted at ICASSP 2024

Via

Access Paper or Ask Questions

Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation

May 19, 2023

Kangwook Jang, Sungnyun Kim, Se-Young Yun, Hoirin Kim

Abstract:Transformer-based speech self-supervised learning (SSL) models, such as HuBERT, show surprising performance in various speech processing tasks. However, huge number of parameters in speech SSL models necessitate the compression to a more compact model for wider usage in academia or small companies. In this study, we suggest to reuse attention maps across the Transformer layers, so as to remove key and query parameters while retaining the number of layers. Furthermore, we propose a novel masking distillation strategy to improve the student model's speech representation quality. We extend the distillation loss to utilize both masked and unmasked speech frames to fully leverage the teacher model's high-quality representation. Our universal compression strategy yields the student model that achieves phoneme error rate (PER) of 7.72% and word error rate (WER) of 9.96% on the SUPERB benchmark.

* Interspeech 2023. Code URL: https://github.com/sungnyun/ARMHuBERT

Via

Access Paper or Ask Questions

FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

Jul 01, 2022

Yeonghyeon Lee, Kangwook Jang, Jahyun Goo, Youngmoon Jung, Hoirin Kim

Figure 1 for FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

Figure 2 for FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

Figure 3 for FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

Figure 4 for FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

Abstract:Large-scale speech self-supervised learning (SSL) has emerged to the main field of speech processing, however, the problem of computational cost arising from its vast size makes a high entry barrier to academia. In addition, existing distillation techniques of speech SSL models compress the model by reducing layers, which induces performance degradation in linguistic pattern recognition tasks such as phoneme recognition (PR). In this paper, we propose FitHuBERT, which makes thinner in dimension throughout almost all model components and deeper in layer compared to prior speech SSL distillation works. Moreover, we employ a time-reduction layer to speed up inference time and propose a method of hint-based distillation for less performance degradation. Our method reduces the model to 23.8% in size and 35.9% in inference time compared to HuBERT. Also, we achieve 12.1% word error rate and 13.3% phoneme error rate on the SUPERB benchmark which is superior than prior work.

* Accepted to Interspeech 2022

Via

Access Paper or Ask Questions