Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jaemin Jung

Inference-Time Scaling for Joint Audio-Video Generation

Jun 02, 2026

Jaemin Jung, Kyeongha Rho, Inkyu Shin, Joon Son Chung

Abstract:Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial training resources to improve fidelity, Inference-Time Scaling (ITS) has recently emerged as a promising training-free alternative in single-modality domains. However, extending ITS from a single modality to multimodal domains is non-trivial, as it requires balancing multiple heterogeneous objectives. In this paper, we present the first comprehensive study of ITS for joint audio-video generation. We first demonstrate that a multi-verifier framework is essential to address the limitations of single-objective guidance, including asymmetric performance trade-offs and verifier hacking. Through systematic analysis, we then identify an optimal multi-verifier combination that yields balanced improvements across all quality dimensions. Finally, to effectively aggregate diverse reward signals, we propose Adaptive Reward Weighting (ARW), a novel test-time optimization algorithm. ARW treats reward aggregation as an online optimization problem, utilizing learnable parameters to calibrate reward variances without requiring prior knowledge of reward distributions, thereby ensuring robust multi-objective selection. Experimental results on VGGSound and JavisBench-mini benchmarks demonstrate that our framework significantly enhances semantic alignment, perceptual quality, and audio-visual synchronization of generated outputs. Synthesized samples and code are available on the project page: https://jung-jaemin.github.io/ITS-AVGen-Proj.

* Accepted by Transactions on Machine Learning Research (TMLR). Project page: https://jung-jaemin.github.io/ITS-AVGen-Proj/

Via

Access Paper or Ask Questions

RIS-assisted ISAC Systems for Industrial Revolution 6.0: Exploring the Near-field and Far-field Coexistence

Jul 10, 2025

Seonghoon Yoo, Jaemin Jung, Seongah Jeong, Jinkyu Kang, Markku Juntti, Joonhyuk Kang

Figure 1 for RIS-assisted ISAC Systems for Industrial Revolution 6.0: Exploring the Near-field and Far-field Coexistence

Figure 2 for RIS-assisted ISAC Systems for Industrial Revolution 6.0: Exploring the Near-field and Far-field Coexistence

Figure 3 for RIS-assisted ISAC Systems for Industrial Revolution 6.0: Exploring the Near-field and Far-field Coexistence

Figure 4 for RIS-assisted ISAC Systems for Industrial Revolution 6.0: Exploring the Near-field and Far-field Coexistence

Abstract:The Industrial Internet of Things (IIoT) has emerged as a key technology for realizing the vision of Industry 6.0, requiring the seamless integration of diverse connected devices. In particular, integrated sensing and communication (ISAC) plays a critical role in supporting real-time control and automation within IIoT systems. In this paper, we explore reconfigurable intelligent surface (RIS)-assisted ISAC systems for IIoT in the coexistence of near-field and far-field regions. The system consists of a full-duplex access point (AP), a RIS and multiple IIoT devices, where the near-field devices simultaneously perform sensing and communication, while the far-field devices rely on a RIS-assisted communication. To enhance spectral efficiency for both sensing and communication functionalities, we consider the use of both traditional sensing-only (SO) and ISAC frequency bands. Moreover, uplink non-orthogonal multiple access (NOMA) is employed to facilitate the sequential decoding of superimposed communication and sensing signals from IIoT devices. To maximize sensing accuracy in terms of Cram${\Grave{\textrm{e}}}$r-Rao bound (CRB), we formulate a joint optimization of RIS phase shift, bandwidth splitting ratio and receive beamforming vector subject to the minimum data rate requirements of IIoT devices and resource budget constraints. The algorithmic solution is developed via the successive convex approximation (SCA)-based alternating optimization (AO) method with the semi-definite relaxation (SDR) technique. Numerical results demonstrate that the proposed method significantly outperforms conventional methods relying solely on either ISAC or SO band by achieving superior performance across RIS and device configurations, while ensuring robust ISAC performance under the near-field and far-field coexistence scenarios.

Via

Access Paper or Ask Questions

Test-Time Augmentation for Pose-invariant Face Recognition

May 14, 2025

Jaemin Jung, Youngjoon Jang, Joon Son Chung

Abstract:The goal of this paper is to enhance face recognition performance by augmenting head poses during the testing phase. Existing methods often rely on training on frontalised images or learning pose-invariant representations, yet both approaches typically require re-training and testing for each dataset, involving a substantial amount of effort. In contrast, this study proposes Pose-TTA, a novel approach that aligns faces at inference time without additional training. To achieve this, we employ a portrait animator that transfers the source image identity into the pose of a driving image. Instead of frontalising a side-profile face -- which can introduce distortion -- Pose-TTA generates matching side-profile images for comparison, thereby reducing identity information loss. Furthermore, we propose a weighted feature aggregation strategy to address any distortions or biases arising from the synthetic data, thus enhancing the reliability of the augmented images. Extensive experiments on diverse datasets and with various pre-trained face recognition models demonstrate that Pose-TTA consistently improves inference performance. Moreover, our method is straightforward to integrate into existing face recognition pipelines, as it requires no retraining or fine-tuning of the underlying recognition models.

Via

Access Paper or Ask Questions

VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

Dec 26, 2024

Jaemin Jung, Junseok Ahn, Chaeyoung Jung, Tan Dat Nguyen, Youngjoon Jang, Joon Son Chung

Abstract:We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi-modal prompts. Extensive experimental results demonstrate that VoiceDiT outperforms previous models on real-world datasets, showcasing significant improvements in both audio quality and modality integration.

* Accepted to ICASSP 2025

Via

Access Paper or Ask Questions

Bridging the Gap between Audio and Text using Parallel-attention for User-defined Keyword Spotting

Aug 07, 2024

Youkyum Kim, Jaemin Jung, Jihwan Park, Byeong-Yeol Kim, Joon Son Chung

Figure 1 for Bridging the Gap between Audio and Text using Parallel-attention for User-defined Keyword Spotting

Figure 2 for Bridging the Gap between Audio and Text using Parallel-attention for User-defined Keyword Spotting

Figure 3 for Bridging the Gap between Audio and Text using Parallel-attention for User-defined Keyword Spotting

Abstract:This paper proposes a novel user-defined keyword spotting framework that accurately detects audio keywords based on text enrollment. Since audio data possesses additional acoustic information compared to text, there are discrepancies between these two modalities. To address this challenge, we present ParallelKWS, which utilises self- and cross-attention in a parallel architecture to effectively capture information both within and across the two modalities. We further propose a phoneme duration-based alignment loss that enforces the sequential correspondence between audio and text features. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art performance on several benchmark datasets in both seen and unseen domains, without incorporating extra data beyond the dataset used in previous studies.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

Disentangling Structure and Style: Political Bias Detection in News by Inducing Document Hierarchy

Apr 05, 2023

Jiwoo Hong, Yejin Cho, Jaemin Jung, Jiyoung Han, James Thorne

Figure 1 for Disentangling Structure and Style: Political Bias Detection in News by Inducing Document Hierarchy

Figure 2 for Disentangling Structure and Style: Political Bias Detection in News by Inducing Document Hierarchy

Figure 3 for Disentangling Structure and Style: Political Bias Detection in News by Inducing Document Hierarchy

Figure 4 for Disentangling Structure and Style: Political Bias Detection in News by Inducing Document Hierarchy

Abstract:We address an important gap in detection of political bias in news articles. Previous works that perform supervised document classification can be biased towards the writing style of each news outlet, leading to overfitting and limited generalizability. Our approach overcomes this limitation by considering both the sentence-level semantics and the document-level rhetorical structure, resulting in a more robust and style-agnostic approach to detecting political bias in news articles. We introduce a novel multi-head hierarchical attention model that effectively encodes the structure of long documents through a diverse ensemble of attention heads. While journalism follows a formalized rhetorical structure, the writing style may vary by news outlet. We demonstrate that our method overcomes this domain dependency and outperforms previous approaches for robustness and accuracy. Further analysis demonstrates the ability of our model to capture the discourse structures commonly used in the journalism domain.

* Preprint. Under review

Via

Access Paper or Ask Questions

Metric Learning for User-defined Keyword Spotting

Nov 01, 2022

Jaemin Jung, Youkyum Kim, Jihwan Park, Youshin Lim, Byeong-Yeol Kim, Youngjoon Jang, Joon Son Chung

Abstract:The goal of this work is to detect new spoken terms defined by users. While most previous works address Keyword Spotting (KWS) as a closed-set classification problem, this limits their transferability to unseen terms. The ability to define custom keywords has advantages in terms of user experience. In this paper, we propose a metric learning-based training strategy for user-defined keyword spotting. In particular, we make the following contributions: (1) we construct a large-scale keyword dataset with an existing speech corpus and propose a filtering method to remove data that degrade model training; (2) we propose a metric learning-based two-stage training strategy, and demonstrate that the proposed method improves the performance on the user-defined keyword spotting task by enriching their representations; (3) to facilitate the fair comparison in the user-defined KWS field, we propose unified evaluation protocol and metrics. Our proposed system does not require an incremental training on the user-defined keywords, and outperforms previous works by a significant margin on the Google Speech Commands dataset using the proposed as well as the existing metrics.

Via

Access Paper or Ask Questions