Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Minseok Kim

Multimodal Distribution Matching for Vision-Language Dataset Distillation

May 22, 2026

Jongoh Jeong, Hoyong Kwon, Minseok Kim, Kuk-Jin Yoon

Abstract:Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image-text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal agreement and discrepancy directions along with symmetric contrastive learning. Across image-text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.

* Accepted for publication at CVPR 2026. Project Page: https://andyj1.github.io/mdm

Via

Access Paper or Ask Questions

Environment-Aware Channel Prediction for Vehicular Communications: A Multimodal Visual Feature Fusion Framework

Apr 02, 2026

Xuejian Zhang, Ruisi He, Minseok Kim, Inocent Calist, Mi Yang, Ziyi Qi

Abstract:The deep integration of communication with intelligence and sensing, as a defining vision of 6G, renders environment-aware channel prediction a key enabling technology. As a representative 6G application, vehicular communications require accurate and forward-looking channel prediction under stringent reliability, latency, and adaptability demands. Traditional empirical and deterministic models remain limited in balancing accuracy, generalization, and deployability, while the growing availability of onboard and roadside sensing devices offers a promising source of environmental priors. This paper proposes an environment-aware channel prediction framework based on multimodal visual feature fusion. Using GPS data and vehicle-side panoramic RGB images, together with semantic segmentation and depth estimation, the framework extracts semantic, depth, and position features through a three-branch architecture and performs adaptive multimodal fusion via a squeeze-excitation attention gating module. For 360-dimensional angular power spectrum (APS) prediction, a dedicated regression head and a composite multi-constraint loss are further designed. As a result, joint prediction of path loss (PL), delay spread (DS), azimuth spread of arrival (ASA), azimuth spread of departure (ASD), and APS is achieved. Experiments on a synchronized urban V2I measurement dataset yield the best root mean square error (RMSE) of 3.26 dB for PL, RMSEs of 37.66 ns, 5.05 degrees, and 5.08 degrees for DS, ASA, and ASD, respectively, and mean/median APS cosine similarities of 0.9342/0.9571, demonstrating strong accuracy, generalization, and practical potential for intelligent channel prediction in 6G vehicular communications.

* 13 pages, 14 figures

Via

Access Paper or Ask Questions

HarassGuard: Detecting Harassment Behaviors in Social Virtual Reality with Vision-Language Models

Apr 01, 2026

Junhee Lee, Minseok Kim, Hwanjo Heo, Seungwon Woo, Jinwoo Kim

Abstract:Social Virtual Reality (VR) platforms provide immersive social experiences but also expose users to serious risks of online harassment. Existing safety measures are largely reactive, while proactive solutions that detect harassment behavior during an incident often depend on sensitive biometric data, raising privacy concerns. In this paper, we present HarassGuard, a vision-language model (VLM) based system that detects physical harassment in social VR using only visual input. We construct an IRB-approved harassment vision dataset, apply prompt engineering, and fine-tune VLMs to detect harassment behavior by considering contextual information in social VR. Experimental results demonstrate that HarassGuard achieves competitive performance compared to state-of-the-art baselines (i.e., LSTM/CNN, Transformer), reaching an accuracy of up to 88.09% in binary classification and 68.85% in multi-class classification. Notably, HarassGuard matches these baselines while using significantly fewer fine-tuning samples (200 vs. 1,115), offering unique advantages in contextual reasoning and privacy-preserving detection.

* To appear in the 2026 TVCG Special Issue on the 2026 IEEE Conference on Virtual Reality and 3D User Interfaces (VR)

Via

Access Paper or Ask Questions

Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

Mar 16, 2026

Jingxiang Chen, Minseok Kim, Seong-Gyun Leem, Yin Huang, Rashi Rungta, Zhicheng Ouyang, Haibin Wu, Surya Teja Appini, Ankur Bansal, Yang Bai(+6 more)

Abstract:Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds--crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.

Via

Access Paper or Ask Questions

Fool Me If You Can: On the Robustness of Binary Code Similarity Detection Models against Semantics-preserving Transformations

Feb 13, 2026

Jiyong Uhm, Minseok Kim, Michalis Polychronakis, Hyungjoon Koo

Abstract:Binary code analysis plays an essential role in cybersecurity, facilitating reverse engineering to reveal the inner workings of programs in the absence of source code. Traditional approaches, such as static and dynamic analysis, extract valuable insights from stripped binaries, but often demand substantial expertise and manual effort. Recent advances in deep learning have opened promising opportunities to enhance binary analysis by capturing latent features and disclosing underlying code semantics. Despite the growing number of binary analysis models based on machine learning, their robustness to adversarial code transformations at the binary level remains underexplored. We evaluate the robustness of deep learning models for the task of binary code similarity detection (BCSD) under semantics-preserving transformations. The unique nature of machine instructions presents distinct challenges compared to the typical input perturbations found in other domains. We introduce asmFooler, a system that evaluates the resilience of BCSD models using a diverse set of adversarial code transformations that preserve functional semantics. We construct a dataset of 9,565 binary variants from 620 baseline samples by applying eight semantics-preserving transformations across six representative BCSD models. Our major findings highlight several key insights: i) model robustness relies on the processing pipeline, including code pre-processing, architecture, and feature selection; ii) adversarial transformation effectiveness is bounded by a budget shaped by model-specific constraints like input size and instruction expressive capacity; iii) well-crafted transformations can be highly effective with minimal perturbations; and iv) such transformations efficiently disrupt model decisions (e.g., misleading to false positives or false negatives) by focusing on semantically significant instructions.

* 23 pages, 9 figures, 5 tables. The paper has been accepted by The ACM International Conference on the Foundations of Software Engineering (FSE 2026)

Via

Access Paper or Ask Questions

Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks

Feb 06, 2026

Minjeong Ban, Jeonghwan Choi, Hyangsuk Min, Nicole Hee-Yeon Kim, Minseok Kim, Jae-Gil Lee, Hwanjun Song

Abstract:Information retrieval (IR) evaluation remains challenging due to incomplete IR benchmark datasets that contain unlabeled relevant chunks. While LLMs and LLM-human hybrid strategies reduce costly human effort, they remain prone to LLM overconfidence and ineffective AI-to-human escalation. To address this, we propose DREAM, a multi-round debate-based relevance assessment framework with LLM agents, built on opposing initial stances and iterative reciprocal critique. Through our agreement-based debate, it yields more accurate labeling for certain cases and more reliable AI-to-human escalation for uncertain ones, achieving 95.2% labeling accuracy with only 3.5% human involvement. Using DREAM, we build BRIDGE, a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison by uncovering 29,824 missing relevant chunks. We then re-benchmark IR systems and extend evaluation to RAG, showing that unaddressed holes not only distort retriever rankings but also drive retrieval-generation misalignment. The relevance assessment framework is available at https: //github.com/DISL-Lab/DREAM-ICLR-26; and the BRIDGE dataset is available at https://github.com/DISL-Lab/BRIDGE-Benchmark.

* Accepted at ICLR 2026

Via

Access Paper or Ask Questions

Synthesized-Isotropic Narrowband Channel Parameter Extraction from Angle-Resolved Wideband Channel Measurements

Feb 02, 2026

Minseok Kim, Masato Yomoda

Abstract:Angle-resolved channel sounding using antenna arrays or mechanically steered high-gain antennas is widely employed at millimeter-wave and terahertz bands. To extract antenna-independent large-scale channel parameters such as path loss, delay spread, and angular spread, the radiation-pattern effects embedded in the measured responses must be properly compensated. This paper revisits the technical challenges of path-gain calculation from angle-resolved wideband measurements, with emphasis on angular-domain power integration where the scan beams are inherently non-orthogonal and simple power summation leads to biased omni-equivalent power estimates. We first formulate the synthesized-isotropic narrowband power in a unified matrix form and introduce a beam-accumulation correction factor, including an offset-averaged variant to mitigate scalloping due to off-grid angles. The proposed framework is validated through simulations using channel models and 154~GHz corridor measurements.

Via

Access Paper or Ask Questions

Device-Free Localization Using Multi-Link MIMO Channels in Distributed Antenna Networks

May 07, 2025

Minseok Kim, Gesi Teng, Keita Nishi, Togo Ikegami, Masamune Sato

Abstract:This paper presented a novel device-free localization (DFL) framework based on distributed antenna networks (DANs), targeting integrated sensing and communication (ISAC) in future 6G radio access networks (RANs). In the proposed approach, radio tomographic imaging (RTI) leverages the spatial and temporal diversity of multi-link multiple-input multiple-output (MIMO) channels in DANs to improve localization accuracy. Furthermore, a prototype system was developed using software-defined radios (SDRs) operating in the sub-6 GHz band, and comprehensive evaluations were conducted under indoor conditions involving varying node densities and target types. The results demonstrate that the framework achieves sub-meter localization accuracy in most scenarios and maintains robust performance under complex multipath environments. In addition, the use of Bayesian optimization to fine-tune key parameters, such as sparsity and path thickness, led to significant improvements in image reconstruction quality and target estimation accuracy. These results demonstrate the feasibility and effectiveness of DAN-based DFL systems for accurate, robust, and scalable localization.

Via

Access Paper or Ask Questions

Physics-Informed Neural Networks for Optimal Vaccination Plan in SIR Epidemic Models

Feb 27, 2025

Minseok Kim, Yeongjong Kim, Yeoneung Kim

Figure 1 for Physics-Informed Neural Networks for Optimal Vaccination Plan in SIR Epidemic Models

Figure 2 for Physics-Informed Neural Networks for Optimal Vaccination Plan in SIR Epidemic Models

Figure 3 for Physics-Informed Neural Networks for Optimal Vaccination Plan in SIR Epidemic Models

Figure 4 for Physics-Informed Neural Networks for Optimal Vaccination Plan in SIR Epidemic Models

Abstract:This work focuses on understanding the minimum eradication time for the controlled Susceptible-Infectious-Recovered (SIR) model in the time-homogeneous setting, where the infection and recovery rates are constant. The eradication time is defined as the earliest time the infectious population drops below a given threshold and remains below it. For time-homogeneous models, the eradication time is well-defined due to the predictable dynamics of the infectious population, and optimal control strategies can be systematically studied. We utilize Physics-Informed Neural Networks (PINNs) to solve the partial differential equation (PDE) governing the eradication time and derive the corresponding optimal vaccination control. The PINN framework enables a mesh-free solution to the PDE by embedding the dynamics directly into the loss function of a deep neural network. We use a variable scaling method to ensure stable training of PINN and mathematically analyze that this method is effective in our setting. This approach provides an efficient computational alternative to traditional numerical methods, allowing for an approximation of the eradication time and the optimal control strategy. Through numerical experiments, we validate the effectiveness of the proposed method in computing the minimum eradication time and achieving optimal control. This work offers a novel application of PINNs to epidemic modeling, bridging mathematical theory and computational practice for time-homogeneous SIR models.

Via

Access Paper or Ask Questions

THz Channels for Short-Range Mobile Networks: Multipath Clusters and Human Body Shadowing

Dec 18, 2024

Minseok Kim, Jun-ichi Takada, Minghe Mao, Che Chia Kang, Xin Du, Anirban Ghosh

Abstract:The THz band (0.1-10 THz) is emerging as a crucial enabler for sixth-generation (6G) mobile communication systems, overcoming the limitations of current technologies and unlocking new opportunities for low-latency and ultra-high-speed communications by utilizing several tens of GHz transmission bandwidths. However, extremely high spreading losses and other interaction losses pose significant challenges to establishing wide-area communication coverage, while human body shadowing further complicates maintaining stable communication links. Although point-to-point (P2P) fixed wireless access in the THz band has been successfully demonstrated, realizing fully mobile and reliable wireless access remains a challenge due to numerous issues to be solved for highly directional communication. To provide insights into the design of THz communication systems, this article addresses the challenges associated with THz short-range mobile access networks. It offers an overview of recent findings on the environment-dependence of multipath cluster channel properties and the impact of human body shadowing, based on measurements at 300 GHz using a double-directional high-resolution channel sounder and a motion capture-integrated channel sounder.

Via

Access Paper or Ask Questions