Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yerin Choi

FiLoRA: Focus-and-Ignore LoRA for Controllable Feature Reliance

Feb 02, 2026

Hyunsuk Chung, Caren Han, Yerin Choi, Seungyeon Ji, Jinwoo Kim, Eun-Jung Holden, Kyungreem Han

Abstract:Multimodal foundation models integrate heterogeneous signals across modalities, yet it remains poorly understood how their predictions depend on specific internal feature groups and whether such reliance can be deliberately controlled. Existing studies of shortcut and spurious behavior largely rely on post hoc analyses or feature removal, offering limited insight into whether reliance can be modulated without altering task semantics. We introduce FiLoRA (Focus-and-Ignore LoRA), an instruction-conditioned, parameter-efficient adaptation framework that enables explicit control over internal feature reliance while keeping the predictive objective fixed. FiLoRA decomposes adaptation into feature group-aligned LoRA modules and applies instruction-conditioned gating, allowing natural language instructions to act as computation-level control signals rather than task redefinitions. Across text--image and audio--visual benchmarks, we show that instruction-conditioned gating induces consistent and causal shifts in internal computation, selectively amplifying or suppressing core and spurious feature groups without modifying the label space or training objective. Further analyses demonstrate that FiLoRA yields improved robustness under spurious feature interventions, revealing a principled mechanism to regulate reliance beyond correlation-driven learning.

Via

Access Paper or Ask Questions

Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech

Dec 05, 2024

Yerin Choi, Jeehyun Lee, Myoung-Wan Koo

Figure 1 for Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech

Figure 2 for Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech

Figure 3 for Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech

Figure 4 for Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech

Abstract:Due to the subjective nature of current clinical evaluation, the need for automatic severity evaluation in dysarthric speech has emerged. DNN models outperform ML models but lack user-friendly explainability. ML models offer explainable results at a feature level, but their performance is comparatively lower. Current ML models extract various features from raw waveforms to predict severity. However, existing methods do not encompass all dysarthric features used in clinical evaluation. To address this gap, we propose a feature extraction method that minimizes information loss. We introduce an ASR transcription as a novel feature extraction source. We finetune the ASR model for dysarthric speech, then use this model to transcribe dysarthric speech and extract word segment boundary information. It enables capturing finer pronunciation and broader prosodic features. These features demonstrated an improved severity prediction performance to existing features: balanced accuracy of 83.72%.

* Accepted to SLT 2024

Via

Access Paper or Ask Questions

Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition

Feb 29, 2024

Jeehyun Lee, Yerin Choi, Tae-Jin Song, Myoung-Wan Koo

Figure 1 for Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition

Figure 2 for Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition

Figure 3 for Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition

Figure 4 for Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition

Abstract:Dysarthria, a common issue among stroke patients, severely impacts speech intelligibility. Inappropriate pauses are crucial indicators in severity assessment and speech-language therapy. We propose to extend a large-scale speech recognition model for inappropriate pause detection in dysarthric speech. To this end, we propose task design, labeling strategy, and a speech recognition model with an inappropriate pause prediction layer. First, we treat pause detection as speech recognition, using an automatic speech recognition (ASR) model to convert speech into text with pause tags. According to the newly designed task, we label pause locations at the text level and their appropriateness. We collaborate with speech-language pathologists to establish labeling criteria, ensuring high-quality annotated data. Finally, we extend the ASR model with an inappropriate pause prediction layer for end-to-end inappropriate pause detection. Moreover, we propose a task-tailored metric for evaluating inappropriate pause detection independent of ASR performance. Our experiments show that the proposed method better detects inappropriate pauses in dysarthric speech than baselines. (Inappropriate Pause Error Rate: 14.47%)

* Accepted to ICASSP 2024

Via

Access Paper or Ask Questions

DC CoMix TTS: An End-to-End Expressive TTS with Discrete Code Collaborated with Mixer

Jun 12, 2023

Yerin Choi, Myoung-Wan Koo

Abstract:Despite the huge successes made in neutral TTS, content-leakage remains a challenge. In this paper, we propose a new input representation and simple architecture to achieve improved prosody modeling. Inspired by the recent success in the use of discrete code in TTS, we introduce discrete code to the input of the reference encoder. Specifically, we leverage the vector quantizer from the audio compression model to exploit the diverse acoustic information it has already been trained on. In addition, we apply the modified MLP-Mixer to the reference encoder, making the architecture lighter. As a result, we train the prosody transfer TTS in an end-to-end manner. We prove the effectiveness of our method through both subjective and objective evaluations. We demonstrate that the reference encoder learns better speaker-independent prosody when discrete code is utilized as input in the experiments. In addition, we obtain comparable results even when fewer parameters are inputted.

* need revision

Via

Access Paper or Ask Questions