Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dong-Jin Kim

djnjusa@kaist.ac.kr

ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

Oct 29, 2025

Sungho Koh, SeungJu Cha, Hyunwoo Oh, Kwanyoung Lee, Dong-Jin Kim

Figure 1 for ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

Figure 2 for ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

Figure 3 for ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

Figure 4 for ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

Abstract:Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution. Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models. In this paper, we propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training. A core component of our framework is Neighborhood Patch Attention (NPA), an efficient mechanism that reduces computational redundancy in the self-attention layer with non-overlapping patches. We integrate NPA into an SDEdit pipeline and introduce Latent Frequency Mixing (LFM) to better generate fine details. Furthermore, we apply Structure Guidance to enhance global structure during the denoising process. Experimental results demonstrate that ScaleDiff achieves state-of-the-art performance among training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures.

* NeurIPS 2025. Code: https://github.com/KSH00906/ScaleDiff

Via

Access Paper or Ask Questions

Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning

Sep 04, 2025

MinJu Jeon, Si-Woo Kim, Ye-Chan Kim, HyunGee Kim, Dong-Jin Kim

Figure 1 for Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning

Figure 2 for Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning

Figure 3 for Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning

Figure 4 for Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning

Abstract:Dense video captioning aims to temporally localize events in video and generate captions for each event. While recent works propose end-to-end models, they suffer from two limitations: (1) applying timestamp supervision only to text while treating all video frames equally, and (2) retrieving captions from fixed-size video chunks, overlooking scene transitions. To address these, we propose Sali4Vid, a simple yet effective saliency-aware framework. We introduce Saliency-aware Video Reweighting, which converts timestamp annotations into sigmoid-based frame importance weights, and Semantic-based Adaptive Caption Retrieval, which segments videos by frame similarity to capture scene transitions and improve caption retrieval. Sali4Vid achieves state-of-the-art results on YouCook2 and ViTT, demonstrating the benefit of jointly improving video weighting and retrieval for dense video captioning

* Accepted in EMNLP 2025

Via

Access Paper or Ask Questions

SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning

Jul 24, 2025

Si-Woo Kim, MinJu Jeon, Ye-Chan Kim, Soeun Lee, Taewhan Kim, Dong-Jin Kim

Figure 1 for SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning

Figure 2 for SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning

Figure 3 for SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning

Figure 4 for SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning

Abstract:Zero-shot Image Captioning (ZIC) increasingly utilizes synthetic datasets generated by text-to-image (T2I) models to mitigate the need for costly manual annotation. However, these T2I models often produce images that exhibit semantic misalignments with their corresponding input captions (e.g., missing objects, incorrect attributes), resulting in noisy synthetic image-caption pairs that can hinder model training. Existing dataset pruning techniques are largely designed for removing noisy text in web-crawled data. However, these methods are ill-suited for the distinct challenges of synthetic data, where captions are typically well-formed, but images may be inaccurate representations. To address this gap, we introduce SynC, a novel framework specifically designed to refine synthetic image-caption datasets for ZIC. Instead of conventional filtering or regeneration, SynC focuses on reassigning captions to the most semantically aligned images already present within the synthetic image pool. Our approach employs a one-to-many mapping strategy by initially retrieving multiple relevant candidate images for each caption. We then apply a cycle-consistency-inspired alignment scorer that selects the best image by verifying its ability to retrieve the original caption via image-to-text retrieval. Extensive evaluations demonstrate that SynC consistently and significantly improves performance across various ZIC models on standard benchmarks (MS-COCO, Flickr30k, NoCaps), achieving state-of-the-art results in several scenarios. SynC offers an effective strategy for curating refined synthetic data to enhance ZIC.

* Accepted to ACM Multimedia 2025

Via

Access Paper or Ask Questions

SIDA: Synthetic Image Driven Zero-shot Domain Adaptation

Jul 24, 2025

Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, Taewhan Kim, Dong-Jin Kim

Abstract:Zero-shot domain adaptation is a method for adapting a model to a target domain without utilizing target domain image data. To enable adaptation without target images, existing studies utilize CLIP's embedding space and text description to simulate target-like style features. Despite the previous achievements in zero-shot domain adaptation, we observe that these text-driven methods struggle to capture complex real-world variations and significantly increase adaptation time due to their alignment process. Instead of relying on text descriptions, we explore solutions leveraging image data, which provides diverse and more fine-grained style cues. In this work, we propose SIDA, a novel and efficient zero-shot domain adaptation method leveraging synthetic images. To generate synthetic images, we first create detailed, source-like images and apply image translation to reflect the style of the target domain. We then utilize the style features of these synthetic images as a proxy for the target domain. Based on these features, we introduce Domain Mix and Patch Style Transfer modules, which enable effective modeling of real-world variations. In particular, Domain Mix blends multiple styles to expand the intra-domain representations, and Patch Style Transfer assigns different styles to individual patches. We demonstrate the effectiveness of our method by showing state-of-the-art performance in diverse zero-shot adaptation scenarios, particularly in challenging domains. Moreover, our approach achieves high efficiency by significantly reducing the overall adaptation time.

* Accepted to ACM MM 2025

Via

Access Paper or Ask Questions

VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness

Mar 20, 2025

SeungJu Cha, Kwanyoung Lee, Ye-Chan Kim, Hyunwoo Oh, Dong-Jin Kim

Abstract:Recent large-scale text-to-image diffusion models generate photorealistic images but often struggle to accurately depict interactions between humans and objects due to their limited ability to differentiate various interaction words. In this work, we propose VerbDiff to address the challenge of capturing nuanced interactions within text-to-image diffusion models. VerbDiff is a novel text-to-image generation model that weakens the bias between interaction words and objects, enhancing the understanding of interactions. Specifically, we disentangle various interaction words from frequency-based anchor words and leverage localized interaction regions from generated images to help the model better capture semantics in distinctive words without extra conditions. Our approach enables the model to accurately understand the intended interaction between humans and objects, producing high-quality images with accurate interactions aligned with specified verbs. Extensive experiments on the HICO-DET dataset demonstrate the effectiveness of our method compared to previous approaches.

* Accepted at CVPR 2025, code : https://github.com/SeungJuCha/VerbDiff.git

Via

Access Paper or Ask Questions

LensNet: Enhancing Real-time Microlensing Event Discovery with Recurrent Neural Networks in the Korea Microlensing Telescope Network

Jan 10, 2025

Javier Viaña, Kyu-Ha Hwang, Zoë de Beurs, Jennifer C. Yee, Andrew Vanderburg, Michael D. Albrow, Sun-Ju Chung, Andrew Gould, Cheongho Han, Youn Kil Jung(+13 more)

Abstract:Traditional microlensing event vetting methods require highly trained human experts, and the process is both complex and time-consuming. This reliance on manual inspection often leads to inefficiencies and constrains the ability to scale for widespread exoplanet detection, ultimately hindering discovery rates. To address the limits of traditional microlensing event vetting, we have developed LensNet, a machine learning pipeline specifically designed to distinguish legitimate microlensing events from false positives caused by instrumental artifacts, such as pixel bleed trails and diffraction spikes. Our system operates in conjunction with a preliminary algorithm that detects increasing trends in flux. These flagged instances are then passed to LensNet for further classification, allowing for timely alerts and follow-up observations. Tailored for the multi-observatory setup of the Korea Microlensing Telescope Network (KMTNet) and trained on a rich dataset of manually classified events, LensNet is optimized for early detection and warning of microlensing occurrences, enabling astronomers to organize follow-up observations promptly. The internal model of the pipeline employs a multi-branch Recurrent Neural Network (RNN) architecture that evaluates time-series flux data with contextual information, including sky background, the full width at half maximum of the target star, flux errors, PSF quality flags, and air mass for each observation. We demonstrate a classification accuracy above 87.5%, and anticipate further improvements as we expand our training set and continue to refine the algorithm.

* 23 pages, 13 figures, Accepted for publication in the The Astronomical Journal

Via

Access Paper or Ask Questions

ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning

Dec 26, 2024

Taewhan Kim, Soeun Lee, Si-Woo Kim, Dong-Jin Kim

Abstract:Recent lightweight image captioning models using retrieved data mainly focus on text prompts. However, previous works only utilize the retrieved text as text prompts, and the visual information relies only on the CLIP visual embedding. Because of this issue, there is a limitation that the image descriptions inherent in the prompt are not sufficiently reflected in the visual embedding space. To tackle this issue, we propose ViPCap, a novel retrieval text-based visual prompt for lightweight image captioning. ViPCap leverages the retrieved text with image information as visual prompts to enhance the ability of the model to capture relevant visual information. By mapping text prompts into the CLIP space and generating multiple randomized Gaussian distributions, our method leverages sampling to explore randomly augmented distributions and effectively retrieves the semantic features that contain image information. These retrieved features are integrated into the image and designated as the visual prompt, leading to performance improvements on the datasets such as COCO, Flickr30k, and NoCaps. Experimental results demonstrate that ViPCap significantly outperforms prior lightweight captioning models in efficiency and effectiveness, demonstrating the potential for a plug-and-play solution.

Via

Access Paper or Ask Questions

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Oct 07, 2024

Youngtaek Oh, Jae Won Cho, Dong-Jin Kim, In So Kweon, Junmo Kim

Figure 1 for Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Figure 2 for Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Figure 3 for Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Figure 4 for Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Abstract:In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without sacrificing performance in zero-shot multi-modal tasks. Traditional fine-tuning approaches often improve compositional reasoning at the cost of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations of images and texts. This global HN loss pushes HN texts that are highly similar to the original ones, damaging the model's multi-modal representations. To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard negative loss and selective calibrated regularization. These innovations provide fine-grained negative supervision while preserving the model's representational integrity. Our extensive evaluations across diverse benchmarks for both compositionality and multi-modal tasks show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities. Code is available at: https://github.com/ytaek-oh/fsc-clip.

* EMNLP 2024 (Long, Main). Project page: https://ytaek-oh.github.io/fsc-clip

Via

Access Paper or Ask Questions

IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning

Sep 26, 2024

Soeun Lee, Si-Woo Kim, Taewhan Kim, Dong-Jin Kim

Figure 1 for IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning

Figure 2 for IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning

Figure 3 for IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning

Figure 4 for IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning

Abstract:Recent advancements in image captioning have explored text-only training methods to overcome the limitations of paired image-text data. However, existing text-only training methods often overlook the modality gap between using text data during training and employing images during inference. To address this issue, we propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap. Our method further enhances the accuracy of generated captions by designing a Fusion Module that integrates retrieved captions with input features. Additionally, we introduce a Frequency-based Entity Filtering technique that significantly improves caption quality. We integrate these methods into a unified framework, which we refer to as IFCap ($\textbf{I}$mage-like Retrieval and $\textbf{F}$requency-based Entity Filtering for Zero-shot $\textbf{Cap}$tioning). Through extensive experimentation, our straightforward yet powerful approach has demonstrated its efficacy, outperforming the state-of-the-art methods by a significant margin in both image captioning and video captioning compared to zero-shot captioning based on text-only training.

* Accepted to EMNLP 2024

Via

Access Paper or Ask Questions

Self-Sufficient Framework for Continuous Sign Language Recognition

Mar 21, 2023

Youngjoon Jang, Youngtaek Oh, Jae Won Cho, Myungchul Kim, Dong-Jin Kim, In So Kweon, Joon Son Chung

Figure 1 for Self-Sufficient Framework for Continuous Sign Language Recognition

Figure 2 for Self-Sufficient Framework for Continuous Sign Language Recognition

Figure 3 for Self-Sufficient Framework for Continuous Sign Language Recognition

Figure 4 for Self-Sufficient Framework for Continuous Sign Language Recognition

Abstract:The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition (CSLR) that addresses key issues of sign language recognition. These include the need for complex multi-scale features such as hands, face, and mouth for understanding, and absence of frame-level annotations. To this end, we propose (1) Divide and Focus Convolution (DFConv) which extracts both manual and non-manual features without the need for additional networks or annotations, and (2) Dense Pseudo-Label Refinement (DPLR) which propagates non-spiky frame-level pseudo-labels by combining the ground truth gloss sequence labels with the predicted sequence. We demonstrate that our model achieves state-of-the-art performance among RGB-based methods on large-scale CSLR benchmarks, PHOENIX-2014 and PHOENIX-2014-T, while showing comparable results with better efficiency when compared to other approaches that use multi-modality or extra annotations.

Via

Access Paper or Ask Questions