Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dongjin Kim

Continuous Degradation Modeling via Latent Flow Matching for Real-World Super-Resolution

Feb 04, 2026

Hyeonjae Kim, Dongjin Kim, Eugene Jin, Tae Hyun Kim

Abstract:While deep learning-based super-resolution (SR) methods have shown impressive outcomes with synthetic degradation scenarios such as bicubic downsampling, they frequently struggle to perform well on real-world images that feature complex, nonlinear degradations like noise, blur, and compression artifacts. Recent efforts to address this issue have involved the painstaking compilation of real low-resolution (LR) and high-resolution (HR) image pairs, usually limited to several specific downscaling factors. To address these challenges, our work introduces a novel framework capable of synthesizing authentic LR images from a single HR image by leveraging the latent degradation space with flow matching. Our approach generates LR images with realistic artifacts at unseen degradation levels, which facilitates the creation of large-scale, real-world SR training datasets. Comprehensive quantitative and qualitative assessments verify that our synthetic LR images accurately replicate real-world degradations. Furthermore, both traditional and arbitrary-scale SR models trained using our datasets consistently yield much better HR outcomes.

* AAAI 2026

Via

Access Paper or Ask Questions

InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

Dec 19, 2025

Hoiyeong Jin, Hyojin Jang, Jeongho Kim, Junha Hyung, Kinam Kim, Dongjin Kim, Huijin Choi, Hyeonji Kim, Jaegul Choo

Abstract:Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present InsertAnywhere, a new VOI framework that achieves geometrically consistent object placement and appearance-faithful video synthesis. Our method begins with a 4D aware mask generation module that reconstructs the scene geometry and propagates user specified object placement across frames while maintaining temporal coherence and occlusion consistency. Building upon this spatial foundation, we extend a diffusion based video generation model to jointly synthesize the inserted object and its surrounding local variations such as illumination and shading. To enable supervised training, we introduce ROSE++, an illumination aware synthetic dataset constructed by transforming the ROSE object removal dataset into triplets of object removed video, object present video, and a VLM generated reference image. Through extensive experiments, we demonstrate that our framework produces geometrically plausible and visually coherent object insertions across diverse real world scenarios, significantly outperforming existing research and commercial models.

* 16 pages, project page: https://myyzzzoooo.github.io/InsertAnywhere/

Via

Access Paper or Ask Questions

Long-range electrostatics for machine learning interatomic potentials is easier than we thought

Dec 19, 2025

Dongjin Kim, Bingqing Cheng

Abstract:The lack of long-range electrostatics is a key limitation of modern machine learning interatomic potentials (MLIPs), hindering reliable applications to interfaces, charge-transfer reactions, polar and ionic materials, and biomolecules. In this Perspective, we distill two design principles behind the Latent Ewald Summation (LES) framework, which can capture long-range interactions, charges, and electrical response just by learning from standard energy and force training data: (i) use a Coulomb functional form with environment-dependent charges to capture electrostatic interactions, and (ii) avoid explicit training on ambiguous density functional theory (DFT) partial charges. When both principles are satisfied, substantial flexibility remains: essentially any short-range MLIP can be augmented; charge equilibration schemes can be added when desired; dipoles and Born effective charges can be inferred or finetuned; and charge/spin-state embeddings or tensorial targets can be further incorporated. We also discuss current limitations and open challenges. Together, these minimal, physics-guided design rules suggest that incorporating long-range electrostatics into MLIPs is simpler and perhaps more broadly applicable than is commonly assumed.

Via

Access Paper or Ask Questions

IDF: Iterative Dynamic Filtering Networks for Generalizable Image Denoising

Aug 27, 2025

Dongjin Kim, Jaekyun Ko, Muhammad Kashif Ali, Tae Hyun Kim

Figure 1 for IDF: Iterative Dynamic Filtering Networks for Generalizable Image Denoising

Figure 2 for IDF: Iterative Dynamic Filtering Networks for Generalizable Image Denoising

Figure 3 for IDF: Iterative Dynamic Filtering Networks for Generalizable Image Denoising

Figure 4 for IDF: Iterative Dynamic Filtering Networks for Generalizable Image Denoising

Abstract:Image denoising is a fundamental challenge in computer vision, with applications in photography and medical imaging. While deep learning-based methods have shown remarkable success, their reliance on specific noise distributions limits generalization to unseen noise types and levels. Existing approaches attempt to address this with extensive training data and high computational resources but they still suffer from overfitting. To address these issues, we conduct image denoising by utilizing dynamically generated kernels via efficient operations. This approach helps prevent overfitting and improves resilience to unseen noise. Specifically, our method leverages a Feature Extraction Module for robust noise-invariant features, Global Statistics and Local Correlation Modules to capture comprehensive noise characteristics and structural correlations. The Kernel Prediction Module then employs these cues to produce pixel-wise varying kernels adapted to local structures, which are then applied iteratively for denoising. This ensures both efficiency and superior restoration quality. Despite being trained on single-level Gaussian noise, our compact model (~ 0.04 M) excels across diverse noise types and levels, demonstrating the promise of iterative dynamic filtering for practical image denoising.

* ICCV 2025. Project Page: https://dongjinkim9.github.io/projects/idf/

Via

Access Paper or Ask Questions

Harnessing Meta-Learning for Controllable Full-Frame Video Stabilization

Aug 26, 2025

Muhammad Kashif Ali, Eun Woo Im, Dongjin Kim, Tae Hyun Kim, Vivek Gupta, Haonan Luo, Tianrui Li

Abstract:Video stabilization remains a fundamental problem in computer vision, particularly pixel-level synthesis solutions for video stabilization, which synthesize full-frame outputs, add to the complexity of this task. These methods aim to enhance stability while synthesizing full-frame videos, but the inherent diversity in motion profiles and visual content present in each video sequence makes robust generalization with fixed parameters difficult. To address this, we present a novel method that improves pixel-level synthesis video stabilization methods by rapidly adapting models to each input video at test time. The proposed approach takes advantage of low-level visual cues available during inference to improve both the stability and visual quality of the output. Notably, the proposed rapid adaptation achieves significant performance gains even with a single adaptation pass. We further propose a jerk localization module and a targeted adaptation strategy, which focuses the adaptation on high-jerk segments for maximizing stability with fewer adaptation steps. The proposed methodology enables modern stabilizers to overcome the longstanding SOTA approaches while maintaining the full frame nature of the modern methods, while offering users with control mechanisms akin to classical approaches. Extensive experiments on diverse real-world datasets demonstrate the versatility of the proposed method. Our approach consistently improves the performance of various full-frame synthesis models in both qualitative and quantitative terms, including results on downstream applications.

Via

Access Paper or Ask Questions

TV-LiVE: Training-Free, Text-Guided Video Editing via Layer Informed Vitality Exploitation

Jun 08, 2025

Min-Jung Kim, Dongjin Kim, Seokju Yun, Jaegul Choo

Figure 1 for TV-LiVE: Training-Free, Text-Guided Video Editing via Layer Informed Vitality Exploitation

Figure 2 for TV-LiVE: Training-Free, Text-Guided Video Editing via Layer Informed Vitality Exploitation

Figure 3 for TV-LiVE: Training-Free, Text-Guided Video Editing via Layer Informed Vitality Exploitation

Figure 4 for TV-LiVE: Training-Free, Text-Guided Video Editing via Layer Informed Vitality Exploitation

Abstract:Video editing has garnered increasing attention alongside the rapid progress of diffusion-based video generation models. As part of these advancements, there is a growing demand for more accessible and controllable forms of video editing, such as prompt-based editing. Previous studies have primarily focused on tasks such as style transfer, background replacement, object substitution, and attribute modification, while maintaining the content structure of the source video. However, more complex tasks, including the addition of novel objects and nonrigid transformations, remain relatively unexplored. In this paper, we present TV-LiVE, a Training-free and text-guided Video editing framework via Layerinformed Vitality Exploitation. We empirically identify vital layers within the video generation model that significantly influence the quality of generated outputs. Notably, these layers are closely associated with Rotary Position Embeddings (RoPE). Based on this observation, our method enables both object addition and non-rigid video editing by selectively injecting key and value features from the source model into the corresponding layers of the target model guided by the layer vitality. For object addition, we further identify prominent layers to extract the mask regions corresponding to the newly added target prompt. We found that the extracted masks from the prominent layers faithfully indicate the region to be edited. Experimental results demonstrate that TV-LiVE outperforms existing approaches for both object addition and non-rigid video editing. Project Page: https://emjay73.github.io/TV_LiVE/

Via

Access Paper or Ask Questions

Machine learning interatomic potential can infer electrical response

Apr 07, 2025

Peichen Zhong, Dongjin Kim, Daniel S. King, Bingqing Cheng

Figure 1 for Machine learning interatomic potential can infer electrical response

Figure 2 for Machine learning interatomic potential can infer electrical response

Figure 3 for Machine learning interatomic potential can infer electrical response

Figure 4 for Machine learning interatomic potential can infer electrical response

Abstract:Modeling the response of material and chemical systems to electric fields remains a longstanding challenge. Machine learning interatomic potentials (MLIPs) offer an efficient and scalable alternative to quantum mechanical methods but do not by themselves incorporate electrical response. Here, we show that polarization and Born effective charge (BEC) tensors can be directly extracted from long-range MLIPs within the Latent Ewald Summation (LES) framework, solely by learning from energy and force data. Using this approach, we predict the infrared spectra of bulk water under zero or finite external electric fields, ionic conductivities of high-pressure superionic ice, and the phase transition and hysteresis in ferroelectric PbTiO$_3$ perovskite. This work thus extends the capability of MLIPs to predict electrical response--without training on charges or polarization or BECs--and enables accurate modeling of electric-field-driven processes in diverse systems at scale.

Via

Access Paper or Ask Questions

Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection

Jan 05, 2025

Sung Jin Um, Dongjin Kim, Sangmin Lee, Jung Uk Kim

Figure 1 for Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection

Figure 2 for Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection

Figure 3 for Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection

Figure 4 for Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection

Abstract:The goal of video moment retrieval and highlight detection is to identify specific segments and highlights based on a given text query. With the rapid growth of video content and the overlap between these tasks, recent works have addressed both simultaneously. However, they still struggle to fully capture the overall video context, making it challenging to determine which words are most relevant. In this paper, we present a novel Video Context-aware Keyword Attention module that overcomes this limitation by capturing keyword variation within the context of the entire video. To achieve this, we introduce a video context clustering module that provides concise representations of the overall video context, thereby enhancing the understanding of keyword dynamics. Furthermore, we propose a keyword weight detection module with keyword-aware contrastive learning that incorporates keyword information to enhance fine-grained alignment between visual and textual features. Extensive experiments on the QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that our proposed method significantly improves performance in moment retrieval and highlight detection tasks compared to existing approaches. Our code is available at: https://github.com/VisualAIKHU/Keyword-DETR

* Accepted at AAAI 2025

Via

Access Paper or Ask Questions

Learning charges and long-range interactions from energies and forces

Dec 19, 2024

Dongjin Kim, Daniel S. King, Peichen Zhong, Bingqing Cheng

Figure 1 for Learning charges and long-range interactions from energies and forces

Figure 2 for Learning charges and long-range interactions from energies and forces

Figure 3 for Learning charges and long-range interactions from energies and forces

Figure 4 for Learning charges and long-range interactions from energies and forces

Abstract:Accurate modeling of long-range forces is critical in atomistic simulations, as they play a central role in determining the properties of materials and chemical systems. However, standard machine learning interatomic potentials (MLIPs) often rely on short-range approximations, limiting their applicability to systems with significant electrostatics and dispersion forces. We recently introduced the Latent Ewald Summation (LES) method, which captures long-range electrostatics without explicitly learning atomic charges or charge equilibration. Extending LES, we incorporate the ability to learn physical partial charges, encode charge states, and the option to impose charge neutrality constraints. We benchmark LES on diverse and challenging systems, including charged molecules, ionic liquid, electrolyte solution, polar dipeptides, surface adsorption, electrolyte/solid interfaces, and solid-solid interfaces. Our results show that LES can effectively infer physical partial charges, dipole and quadrupole moments, as well as achieve better accuracy compared to methods that explicitly learn charges. LES thus provides an efficient, interpretable, and generalizable MLIP framework for simulating complex systems with intricate charge transfer and long-range

Via

Access Paper or Ask Questions

Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Mar 26, 2024

Dongjin Kim, Sung Jin Um, Sangmin Lee, Jung Uk Kim

Figure 1 for Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Figure 2 for Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Figure 3 for Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Figure 4 for Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Abstract:The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance, they face challenges due to their reliance on prior information about the number of objects to be separated. In this paper, to overcome this limitation, we present a novel multi-sound source localization method that can perform localization without prior knowledge of the number of sound sources. To achieve this goal, we propose an iterative object identification (IOI) module, which can recognize sound-making objects in an iterative manner. After finding the regions of sound-making objects, we devise object similarity-aware clustering (OSC) loss to guide the IOI module to effectively combine regions of the same object but also distinguish between different objects and backgrounds. It enables our method to perform accurate localization of sound-making objects without any prior knowledge. Extensive experimental results on the MUSIC and VGGSound benchmarks show the significant performance improvements of the proposed method over the existing methods for both single and multi-source. Our code is available at: https://github.com/VisualAIKHU/NoPrior_MultiSSL

* Accepted at CVPR 2024

Via

Access Paper or Ask Questions