Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nojun Kwak

Unlocking the Potential of Unlabeled Data in Semi-Supervised Domain Generalization

Mar 18, 2025

Dongkwan Lee, Kyomin Hwang, Nojun Kwak

Abstract:We address the problem of semi-supervised domain generalization (SSDG), where the distributions of train and test data differ, and only a small amount of labeled data along with a larger amount of unlabeled data are available during training. Existing SSDG methods that leverage only the unlabeled samples for which the model's predictions are highly confident (confident-unlabeled samples), limit the full utilization of the available unlabeled data. To the best of our knowledge, we are the first to explore a method for incorporating the unconfident-unlabeled samples that were previously disregarded in SSDG setting. To this end, we propose UPCSC to utilize these unconfident-unlabeled samples in SSDG that consists of two modules: 1) Unlabeled Proxy-based Contrastive learning (UPC) module, treating unconfident-unlabeled samples as additional negative pairs and 2) Surrogate Class learning (SC) module, generating positive pairs for unconfident-unlabeled samples using their confusing class set. These modules are plug-and-play and do not require any domain labels, which can be easily integrated into existing approaches. Experiments on four widely used SSDG benchmarks demonstrate that our approach consistently improves performance when attached to baselines and outperforms competing plug-and-play methods. We also analyze the role of our method in SSDG, showing that it enhances class-level discriminability and mitigates domain gaps. The code is available at https://github.com/dongkwani/UPCSC.

* CVPR 2025

Via

Access Paper or Ask Questions

DivCon-NeRF: Generating Augmented Rays with Diversity and Consistency for Few-shot View Synthesis

Mar 17, 2025

Ingyun Lee, Jae Won Jang, Seunghyeon Seo, Nojun Kwak

Figure 1 for DivCon-NeRF: Generating Augmented Rays with Diversity and Consistency for Few-shot View Synthesis

Figure 2 for DivCon-NeRF: Generating Augmented Rays with Diversity and Consistency for Few-shot View Synthesis

Figure 3 for DivCon-NeRF: Generating Augmented Rays with Diversity and Consistency for Few-shot View Synthesis

Figure 4 for DivCon-NeRF: Generating Augmented Rays with Diversity and Consistency for Few-shot View Synthesis

Abstract:Neural Radiance Field (NeRF) has shown remarkable performance in novel view synthesis but requires many multiview images, making it impractical for few-shot scenarios. Ray augmentation was proposed to prevent overfitting for sparse training data by generating additional rays. However, existing methods, which generate augmented rays only near the original rays, produce severe floaters and appearance distortion due to limited viewpoints and inconsistent rays obstructed by nearby obstacles and complex surfaces. To address these problems, we propose DivCon-NeRF, which significantly enhances both diversity and consistency. It employs surface-sphere augmentation, which preserves the distance between the original camera and the predicted surface point. This allows the model to compare the order of high-probability surface points and filter out inconsistent rays easily without requiring the exact depth. By introducing inner-sphere augmentation, DivCon-NeRF randomizes angles and distances for diverse viewpoints, further increasing diversity. Consequently, our method significantly reduces floaters and visual distortions, achieving state-of-the-art performance on the Blender, LLFF, and DTU datasets. Our code will be publicly available.

* 11 pages, 6 figures

Via

Access Paper or Ask Questions

ROODI: Reconstructing Occluded Objects with Denoising Inpainters

Mar 13, 2025

Yeonjin Chang, Erqun Dong, Seunghyeon Seo, Nojun Kwak, Kwang Moo Yi

Figure 1 for ROODI: Reconstructing Occluded Objects with Denoising Inpainters

Figure 2 for ROODI: Reconstructing Occluded Objects with Denoising Inpainters

Figure 3 for ROODI: Reconstructing Occluded Objects with Denoising Inpainters

Figure 4 for ROODI: Reconstructing Occluded Objects with Denoising Inpainters

Abstract:While the quality of novel-view images has improved dramatically with 3D Gaussian Splatting, extracting specific objects from scenes remains challenging. Isolating individual 3D Gaussian primitives for each object and handling occlusions in scenes remain far from being solved. We propose a novel object extraction method based on two key principles: (1) being object-centric by pruning irrelevant primitives; and (2) leveraging generative inpainting to compensate for missing observations caused by occlusions. For pruning, we analyze the local structure of primitives using K-nearest neighbors, and retain only relevant ones. For inpainting, we employ an off-the-shelf diffusion-based inpainter combined with occlusion reasoning, utilizing the 3D representation of the entire scene. Our findings highlight the crucial synergy between pruning and inpainting, both of which significantly enhance extraction performance. We evaluate our method on a standard real-world dataset and introduce a synthetic dataset for quantitative analysis. Our approach outperforms the state-of-the-art, demonstrating its effectiveness in object extraction from complex scenes.

* Project page: https://yeonjin-chang.github.io/ROODI/

Via

Access Paper or Ask Questions

Bi-ICE: An Inner Interpretable Framework for Image Classification via Bi-directional Interactions between Concept and Input Embeddings

Nov 26, 2024

Jinyung Hong, Yearim Kim, Keun Hee Park, Sangyu Han, Nojun Kwak, Theodore P. Pavlic

Figure 1 for Bi-ICE: An Inner Interpretable Framework for Image Classification via Bi-directional Interactions between Concept and Input Embeddings

Abstract:Inner interpretability is a promising field focused on uncovering the internal mechanisms of AI systems and developing scalable, automated methods to understand these systems at a mechanistic level. While significant research has explored top-down approaches starting from high-level problems or algorithmic hypotheses and bottom-up approaches building higher-level abstractions from low-level or circuit-level descriptions, most efforts have concentrated on analyzing large language models. Moreover, limited attention has been given to applying inner interpretability to large-scale image tasks, primarily focusing on architectural and functional levels to visualize learned concepts. In this paper, we first present a conceptual framework that supports inner interpretability and multilevel analysis for large-scale image classification tasks. We introduce the Bi-directional Interaction between Concept and Input Embeddings (Bi-ICE) module, which facilitates interpretability across the computational, algorithmic, and implementation levels. This module enhances transparency by generating predictions based on human-understandable concepts, quantifying their contributions, and localizing them within the inputs. Finally, we showcase enhanced transparency in image classification, measuring concept contributions and pinpointing their locations within the inputs. Our approach highlights algorithmic interpretability by demonstrating the process of concept learning and its convergence.

* The first two authors equally contributed to this work, 27 pages, 19 figures, 9 tables

Via

Access Paper or Ask Questions

Decompose the model: Mechanistic interpretability in image models with Generalized Integrated Gradients (GIG)

Sep 03, 2024

Yearim Kim, Sangyu Han, Sangbum Han, Nojun Kwak

Figure 1 for Decompose the model: Mechanistic interpretability in image models with Generalized Integrated Gradients (GIG)

Figure 2 for Decompose the model: Mechanistic interpretability in image models with Generalized Integrated Gradients (GIG)

Figure 3 for Decompose the model: Mechanistic interpretability in image models with Generalized Integrated Gradients (GIG)

Figure 4 for Decompose the model: Mechanistic interpretability in image models with Generalized Integrated Gradients (GIG)

Abstract:In the field of eXplainable AI (XAI) in language models, the progression from local explanations of individual decisions to global explanations with high-level concepts has laid the groundwork for mechanistic interpretability, which aims to decode the exact operations. However, this paradigm has not been adequately explored in image models, where existing methods have primarily focused on class-specific interpretations. This paper introduces a novel approach to systematically trace the entire pathway from input through all intermediate layers to the final output within the whole dataset. We utilize Pointwise Feature Vectors (PFVs) and Effective Receptive Fields (ERFs) to decompose model embeddings into interpretable Concept Vectors. Then, we calculate the relevance between concept vectors with our Generalized Integrated Gradients (GIG), enabling a comprehensive, dataset-wide analysis of model behavior. We validate our method of concept extraction and concept attribution in both qualitative and quantitative evaluations. Our approach advances the understanding of semantic significance within image models, offering a holistic view of their operational mechanics.

Via

Access Paper or Ask Questions

MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline

Jul 17, 2024

Donghoon Han, Eunhwan Park, Gisang Lee, Adam Lee, Nojun Kwak

Figure 1 for MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline

Figure 2 for MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline

Figure 3 for MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline

Figure 4 for MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline

Abstract:The rapid expansion of multimedia content has made accurately retrieving relevant videos from large collections increasingly challenging. Recent advancements in text-video retrieval have focused on cross-modal interactions, large-scale foundation model training, and probabilistic modeling, yet often neglect the crucial user perspective, leading to discrepancies between user queries and the content retrieved. To address this, we introduce MERLIN (Multimodal Embedding Refinement via LLM-based Iterative Navigation), a novel, training-free pipeline that leverages Large Language Models (LLMs) for iterative feedback learning. MERLIN refines query embeddings from a user perspective, enhancing alignment between queries and video content through a dynamic question answering process. Experimental results on datasets like MSR-VTT, MSVD, and ActivityNet demonstrate that MERLIN substantially improves Recall@1, outperforming existing systems and confirming the benefits of integrating LLMs into multimodal retrieval systems for more responsive and context-aware multimedia retrieval.

* Work in progress

Via

Access Paper or Ask Questions

Practical Dataset Distillation Based on Deep Support Vectors

May 01, 2024

Hyunho Lee, Junhoo Lee, Nojun Kwak

Figure 1 for Practical Dataset Distillation Based on Deep Support Vectors

Figure 2 for Practical Dataset Distillation Based on Deep Support Vectors

Figure 3 for Practical Dataset Distillation Based on Deep Support Vectors

Figure 4 for Practical Dataset Distillation Based on Deep Support Vectors

Abstract:Conventional dataset distillation requires significant computational resources and assumes access to the entire dataset, an assumption impractical as it presumes all data resides on a central server. In this paper, we focus on dataset distillation in practical scenarios with access to only a fraction of the entire dataset. We introduce a novel distillation method that augments the conventional process by incorporating general model knowledge via the addition of Deep KKT (DKKT) loss. In practical settings, our approach showed improved performance compared to the baseline distribution matching distillation method on the CIFAR-10 dataset. Additionally, we present experimental evidence that Deep Support Vectors (DSVs) offer unique information to the original distillation, and their integration results in enhanced performance.

Via

Access Paper or Ask Questions

Do not think pink elephant!

Apr 22, 2024

Kyomin Hwang, Suyoung Kim, JunHoo Lee, Nojun Kwak

Abstract:Large Models (LMs) have heightened expectations for the potential of general AI as they are akin to human intelligence. This paper shows that recent large models such as Stable Diffusion and DALL-E3 also share the vulnerability of human intelligence, namely the "white bear phenomenon". We investigate the causes of the white bear phenomenon by analyzing their representation space. Based on this analysis, we propose a simple prompt-based attack method, which generates figures prohibited by the LM provider's policy. To counter these attacks, we introduce prompt-based defense strategies inspired by cognitive therapy techniques, successfully mitigating attacks by up to 48.22\%.

* This paper is accepted in CVPRW

Via

Access Paper or Ask Questions

Coreset Selection for Object Detection

Apr 14, 2024

Hojun Lee, Suyoung Kim, Junhoo Lee, Jaeyoung Yoo, Nojun Kwak

Figure 1 for Coreset Selection for Object Detection

Figure 2 for Coreset Selection for Object Detection

Figure 3 for Coreset Selection for Object Detection

Figure 4 for Coreset Selection for Object Detection

Abstract:Coreset selection is a method for selecting a small, representative subset of an entire dataset. It has been primarily researched in image classification, assuming there is only one object per image. However, coreset selection for object detection is more challenging as an image can contain multiple objects. As a result, much research has yet to be done on this topic. Therefore, we introduce a new approach, Coreset Selection for Object Detection (CSOD). CSOD generates imagewise and classwise representative feature vectors for multiple objects of the same class within each image. Subsequently, we adopt submodular optimization for considering both representativeness and diversity and utilize the representative vectors in the submodular optimization process to select a subset. When we evaluated CSOD on the Pascal VOC dataset, CSOD outperformed random selection by +6.4%p in AP$_{50}$ when selecting 200 images.

* Accepted by CVPR 2024: 1st Workshop on Dataset Distillation for Computer Vision

Via

Access Paper or Ask Questions

Unleash the Potential of CLIP for Video Highlight Detection

Apr 02, 2024

Donghoon Han, Seunghyeon Seo, Eunhwan Park, Seong-Uk Nam, Nojun Kwak

Figure 1 for Unleash the Potential of CLIP for Video Highlight Detection

Figure 2 for Unleash the Potential of CLIP for Video Highlight Detection

Figure 3 for Unleash the Potential of CLIP for Video Highlight Detection

Abstract:Multimodal and large language models (LLMs) have revolutionized the utilization of open-world knowledge, unlocking novel potentials across various tasks and applications. Among these domains, the video domain has notably benefited from their capabilities. In this paper, we present Highlight-CLIP (HL-CLIP), a method designed to excel in the video highlight detection task by leveraging the pre-trained knowledge embedded in multimodal models. By simply fine-tuning the multimodal encoder in combination with our innovative saliency pooling technique, we have achieved the state-of-the-art performance in the highlight detection task, the QVHighlight Benchmark, to the best of our knowledge.

Via

Access Paper or Ask Questions