Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jaehyuk Jang

T-VSS: Test-Time Visual Subspace Steering for Adversarial Robustness of Vision-Language Models

Jun 22, 2026

Jaehyuk Jang, Minseok Seo. Seungju Cho, Kangwook Ko, Changick Kim

Abstract:Vision-language models (VLMs) achieve strong zero-shot recognition, but they remain highly vulnerable to adversarial perturbations. Recent test-time adaptations improve robustness without retraining, but they do not directly adapt the corrupted visual representation itself. Prompt-based methods adapt the learnable text prompts, while input-space methods optimize pixels or padding at test time. These approaches can improve predictions, but they do so through an indirect and expensive optimization path. We propose Test-time Visual Subspace Steering (T-VSS), a lightweight defense that performs test-time adaptation directly in the visual feature space. T-VSS first builds a sample-specific low-rank subspace from multi-view feature residuals anchored at the attacked image. It then learns a shared feature correction within this subspace using reliability-weighted entropy minimization. By constraining adaptation to a compact visual geometry, T-VSS steers attacked features toward more stable and discriminative predictions while avoiding noisy full-space updates. Experiments on fine-grained, ImageNet, and ImageNet-OOD benchmarks show that T-VSS improves adversarial robustness while maintaining competitive clean accuracy and better efficiency than prior test-time adaptations.

Via

Access Paper or Ask Questions

Acoustic Prompting via Stage-wise Modulation for Few-Shot Learning in Audio Language Models

Jun 14, 2026

Hyebin Cho, Jaehyuk Jang, Changick Kim, Joon Son Chung

Abstract:Audio-Language Models (ALMs) have shown remarkable success in zero-shot audio classification by aligning audio waveforms with text. Recent efforts to improve downstream performance focus on learning optimal text prompts. However, previous approaches focus on the text encoder, leaving the potential of learnable prompts within the audio encoder unexplored. In this paper, we propose a novel framework that introduces trainable prompts into the audio encoder to capture task-specific acoustic features. We demonstrate that integrating audio-side prompt learning with existing text-side approaches enhances few-shot adaptation. Through extensive experiments across 11 datasets show that integrating our method as a plug-and-play module alongside existing text prompt tuning generally leads to performance improvements. These findings suggest that explicitly modulating the audio representation space effectively complements text-only prompting approaches. The code is available at https://github.com/hyebin-c/aspl.

* Accepted to INTERSPEECH 2026

Via

Access Paper or Ask Questions

Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models

May 23, 2026

Seokil Ham, Jaehyuk Jang, Wonjun Lee, Changick Kim

Abstract:Fine-tuning-as-a-Service (FaaS) enables personalization of large language models (LLMs), but it can weaken safety-alignment under harmful fine-tuning attacks. Recent work has shown that activating harmful-behavior modules during fine-tuning can prevent models from learning undesired behaviors, but its mechanism remains unclear. In this paper, we revisit temporary jailbreaking as a defense against harmful fine-tuning and provide a gradient-level analysis showing that it saturates safety-degrading gradients while preserving benign task-relevant gradients. Based on this insight, we propose a Buffer-and-Reinforce fine-tuning framework that buffers harmful updates during user fine-tuning and reinforces safety after adaptation. Specifically, BufferLoRA induces temporary jailbreaking as a removable adapter to reduce harmful updates during user fine-tuning. After adaptation, ReinforceLoRA, trained to recover refusal behavior under the temporarily jailbroken state, is integrated with UserLoRA via QR decomposition-based merging to reinforce safety while preserving user-task performance. Extensive experiments show that our framework achieves superior safety and utility with no additional safety data during user fine-tuning and minimal computational cost.

* ICML 2026 Spotlight

Via

Access Paper or Ask Questions

Multimodal Self-Attention Network with Temporal Alignment for Audio-Visual Emotion Recognition

Mar 11, 2026

Inyong Koo, yeeun Seong, Minseok Son, Jaehyuk Jang, Changick Kim

Abstract:Audio-visual emotion recognition (AVER) methods typically fuse utterance-level features, and even frame-level attention models seldom address the frame-rate mismatch across modalities. In this paper, we propose a Transformer-based framework focusing on the temporal alignment of multimodal features. Our design employs a multimodal self-attention encoder that simultaneously captures intra- and inter-modal dependencies within a shared feature space. To address heterogeneous sampling rates, we incorporate Temporally-aligned Rotary Position Embeddings (TaRoPE), which implicitly synchronize audio and video tokens. Furthermore, we introduce a Cross-Temporal Matching (CTM) loss that enforces consistency among temporally proximate pairs, guiding the encoder toward better alignment. Experiments on CREMA-D and RAVDESS datasets demonstrate consistent improvements over recent baselines, suggesting that explicitly addressing frame-rate mismatch helps preserve temporal cues and enhances cross-modal fusion.

* 5 pages, 3 figures, accepted to ICASSP 2026

Via

Access Paper or Ask Questions

Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation

Mar 03, 2026

Minseok Seo, Wonjun Lee, Jaehyuk Jang, Changick Kim

Abstract:Zero-shot depth completion has gained attention for its ability to generalize across environments without sensor-specific datasets or retraining. However, most existing approaches rely on diffusion-based test-time optimization, which is computationally expensive due to iterative denoising. Recent visual-prompt-based methods reduce training cost but still require repeated forward--backward passes through the full frozen network to optimize input-level prompts, resulting in slow inference. In this work, we show that adapting only the decoder is sufficient for effective test-time optimization, as depth foundation models concentrate depth-relevant information within a low-dimensional decoder subspace. Based on this insight, we propose a lightweight test-time adaptation method that updates only this low-dimensional subspace using sparse depth supervision. Our approach achieves state-of-the-art performance, establishing a new Pareto frontier between accuracy and efficiency for test-time adaptation. Extensive experiments on five indoor and outdoor datasets demonstrate consistent improvements over prior methods, highlighting the practicality of fast zero-shot depth completion.

* 17 pages, 7 figures [We achieved a new Pareto frontier in test-time depth completion.]

Via

Access Paper or Ask Questions

Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

May 28, 2024

Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, Changick Kim

Abstract:This study addresses the issue observed in Large Vision Language Models (LVLMs), where excessive attention on a few image tokens, referred to as blind tokens, leads to hallucinatory responses in tasks requiring fine-grained understanding of visual objects. We found that tokens receiving lower attention weights often hold essential information for identifying nuanced object details -- ranging from merely recognizing object existence to identifying their attributes (color, position, etc.) and understanding their relationships. To counteract the over-emphasis on blind tokens and to accurately respond to user queries, we introduce a technique called Attentional Vision Calibration (AVC). During the decoding phase, AVC identifies blind tokens by analyzing the image-related attention distribution. It then dynamically adjusts the logits for the next token prediction by contrasting the logits conditioned on the original visual tokens with those conditioned on the blind tokens. This effectively lowers the dependency on blind tokens and promotes a more balanced consideration of all tokens. We validate AVC on benchmarks such as POPE, MME, and AMBER, where it consistently outperforms existing decoding techniques in mitigating object hallucinations in LVLMs.

* Project page: https://sangminwoo.github.io/AvisC/

Via

Access Paper or Ask Questions

RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in LVLMs

May 28, 2024

Sangmin Woo, Jaehyuk Jang, Donguk Kim, Yubin Choi, Changick Kim

Figure 1 for RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in LVLMs

Figure 2 for RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in LVLMs

Figure 3 for RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in LVLMs

Figure 4 for RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in LVLMs

Abstract:Recent advancements in Large Vision Language Models (LVLMs) have revolutionized how machines understand and generate textual responses based on visual inputs. Despite their impressive capabilities, they often produce "hallucinatory" outputs that do not accurately reflect the visual information, posing challenges in reliability and trustworthiness. Current methods such as contrastive decoding have made strides in addressing these issues by contrasting the original probability distribution of generated tokens with distorted counterparts; yet, generating visually-faithful outputs remains a challenge. In this work, we shift our focus to the opposite: What could serve as a complementary enhancement to the original probability distribution? We propose a simple, training-free method termed RITUAL to enhance robustness against hallucinations in LVLMs. Our approach employs random image transformations as complements to the original probability distribution, aiming to mitigate the likelihood of hallucinatory visual explanations by enriching the model's exposure to varied visual scenarios. Our empirical results show that while the isolated use of transformed images initially degrades performance, strategic implementation of these transformations can indeed serve as effective complements. Notably, our method is compatible with current contrastive decoding methods and does not require external models or costly self-feedback mechanisms, making it a practical addition. In experiments, RITUAL significantly outperforms existing contrastive decoding methods across several object hallucination benchmarks, including POPE, CHAIR, and MME.

* Project page: https://sangminwoo.github.io/RITUAL/

Via

Access Paper or Ask Questions

Towards Robust Multimodal Prompting With Missing Modalities

Dec 27, 2023

Jaehyuk Jang, Yooseung Wang, Changick Kim

Figure 1 for Towards Robust Multimodal Prompting With Missing Modalities

Figure 2 for Towards Robust Multimodal Prompting With Missing Modalities

Figure 3 for Towards Robust Multimodal Prompting With Missing Modalities

Figure 4 for Towards Robust Multimodal Prompting With Missing Modalities

Abstract:Recently, multimodal prompting, which introduces learnable missing-aware prompts for all missing modality cases, has exhibited impressive performance. However, it encounters two critical issues: 1) The number of prompts grows exponentially as the number of modalities increases; and 2) It lacks robustness in scenarios with different missing modality settings between training and inference. In this paper, we propose a simple yet effective prompt design to address these challenges. Instead of using missing-aware prompts, we utilize prompts as modality-specific tokens, enabling them to capture the unique characteristics of each modality. Furthermore, our prompt design leverages orthogonality between prompts as a key element to learn distinct information across different modalities and promote diversity in the learned representations. Extensive experiments demonstrate that our prompt design enhances both performance and robustness while reducing the number of prompts.

* Accepted to ICASSP 2024

Via

Access Paper or Ask Questions