Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dahye Kim

Dissect and Prune: Enhancing Robustness in AI-Generated Image Detection

Jun 09, 2026

Dahye Kim, Jaehyun Choi, Hyun Seok Seong, Seongho Kim, Donghun Lee, Sungwon Yi, Jang-Ho Choi

Abstract:While existing AI-generated image detectors report high performance, we identify that this is largely driven by a critical prediction asymmetry: a bias toward the real class that severely limits sensitivity to generated content, especially under standard post-processing operations such as compression and resizing. We hypothesize that this stems from the model's reliance on spurious features, distracting signals that obscure true generative artifacts. To address this, we propose DEAR (Dissect and Prune), which leverages inpainted images to identify and prune these interfering components. Specifically, we find that features strongly aligned to either inpainted or non-inpainted regions are less robust to post-processing. By measuring the alignment between channel activations and inpaint masks, DEAR removes features at both extremes, retaining only those that capture genuine generative artifacts. Experimental results demonstrate that our approach significantly enhances robustness against unseen generators and post-processing, effectively mitigating the prediction asymmetry. Our code is available at https://github.com/dahyedahye/dear.

* 25 pages, 9 figures, 9 tables, Accepted to ICML 2026; includes appendix

Via

Access Paper or Ask Questions

Swift Sampling: Selecting Temporal Surprises via Taylor Series

May 21, 2026

Dahye Kim, Bhuvan Sachdeva, Karan Uppal, Naman Gupta, Vineeth N. Balasubramanian, Deepti Ghadiyaram

Abstract:While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.

Via

Access Paper or Ask Questions

DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

Feb 19, 2026

Dahye Kim, Deepti Ghadiyaram, Raghudeep Gadde

Abstract:Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to $3.52\times$ and $3.2\times$ speedup on FLUX-1.Dev and Wan $2.1$, respectively, without compromising the generation quality and prompt adherence.

Via

Access Paper or Ask Questions

Concept Steerers: Leveraging K-Sparse Autoencoders for Controllable Generations

Jan 31, 2025

Dahye Kim, Deepti Ghadiyaram

Figure 1 for Concept Steerers: Leveraging K-Sparse Autoencoders for Controllable Generations

Figure 2 for Concept Steerers: Leveraging K-Sparse Autoencoders for Controllable Generations

Figure 3 for Concept Steerers: Leveraging K-Sparse Autoencoders for Controllable Generations

Figure 4 for Concept Steerers: Leveraging K-Sparse Autoencoders for Controllable Generations

Abstract:Despite the remarkable progress in text-to-image generative models, they are prone to adversarial attacks and inadvertently generate unsafe, unethical content. Existing approaches often rely on fine-tuning models to remove specific concepts, which is computationally expensive, lack scalability, and/or compromise generation quality. In this work, we propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation in diffusion models. Specifically, we first identify interpretable monosemantic concepts in the latent space of text embeddings and leverage them to precisely steer the generation away or towards a given concept (e.g., nudity) or to introduce a new concept (e.g., photographic style). Through extensive experiments, we demonstrate that our approach is very simple, requires no retraining of the base model nor LoRA adapters, does not compromise the generation quality, and is robust to adversarial prompt manipulations. Our method yields an improvement of $\mathbf{20.01\%}$ in unsafe concept removal, is effective in style manipulation, and is $\mathbf{\sim5}$x faster than current state-of-the-art.

* 15 pages, 16 figures

Via

Access Paper or Ask Questions

$\textit{Revelio}$: Interpreting and leveraging semantic information in diffusion models

Nov 23, 2024

Dahye Kim, Xavier Thomas, Deepti Ghadiyaram

$Figure 1 for $\textit{Revelio}$: Interpreting and leveraging semantic information in diffusion models$

$Figure 2 for $\textit{Revelio}$: Interpreting and leveraging semantic information in diffusion models$

$Figure 3 for $\textit{Revelio}$: Interpreting and leveraging semantic information in diffusion models$

$Figure 4 for $\textit{Revelio}$: Interpreting and leveraging semantic information in diffusion models$

Abstract:We study $\textit{how}$ rich visual semantic information is represented within various layers and denoising timesteps of different diffusion architectures. We uncover monosemantic interpretable features by leveraging k-sparse autoencoders (k-SAE). We substantiate our mechanistic interpretations via transfer learning using light-weight classifiers on off-the-shelf diffusion models' features. On $4$ datasets, we demonstrate the effectiveness of diffusion features for representation learning. We provide in-depth analysis of how different diffusion architectures, pre-training datasets, and language model conditioning impacts visual representation granularity, inductive biases, and transfer learning capabilities. Our work is a critical step towards deepening interpretability of black-box diffusion models. Code and visualizations available at: https://github.com/revelio-diffusion/revelio

* 14 pages, 14 figures

Via

Access Paper or Ask Questions

Language-free Training for Zero-shot Video Grounding

Oct 24, 2022

Dahye Kim, Jungin Park, Jiyoung Lee, Seongheon Park, Kwanghoon Sohn

Figure 1 for Language-free Training for Zero-shot Video Grounding

Figure 2 for Language-free Training for Zero-shot Video Grounding

Figure 3 for Language-free Training for Zero-shot Video Grounding

Figure 4 for Language-free Training for Zero-shot Video Grounding

Abstract:Given an untrimmed video and a language query depicting a specific temporal moment in the video, video grounding aims to localize the time interval by understanding the text and video simultaneously. One of the most challenging issues is an extremely time- and cost-consuming annotation collection, including video captions in a natural language form and their corresponding temporal regions. In this paper, we present a simple yet novel training framework for video grounding in the zero-shot setting, which learns a network with only video data without any annotation. Inspired by the recent language-free paradigm, i.e. training without language data, we train the network without compelling the generation of fake (pseudo) text queries into a natural language form. Specifically, we propose a method for learning a video grounding model by selecting a temporal interval as a hypothetical correct answer and considering the visual feature selected by our method in the interval as a language feature, with the help of the well-aligned visual-language space of CLIP. Extensive experiments demonstrate the prominence of our language-free training framework, outperforming the existing zero-shot video grounding method and even several weakly-supervised approaches with large margins on two standard datasets.

* Accepted to WACV 2023

Via

Access Paper or Ask Questions