Alert button
Picture for Jinkyu Kim

Jinkyu Kim

Alert button

Bridging the Domain Gap by Clustering-based Image-Text Graph Matching

Oct 04, 2023
Nokyung Park, Daewon Chae, Jeongyong Shim, Sangpil Kim, Eun-Sol Kim, Jinkyu Kim

Figure 1 for Bridging the Domain Gap by Clustering-based Image-Text Graph Matching
Figure 2 for Bridging the Domain Gap by Clustering-based Image-Text Graph Matching
Figure 3 for Bridging the Domain Gap by Clustering-based Image-Text Graph Matching
Figure 4 for Bridging the Domain Gap by Clustering-based Image-Text Graph Matching

Learning domain-invariant representations is important to train a model that can generalize well to unseen target task domains. Text descriptions inherently contain semantic structures of concepts and such auxiliary semantic cues can be used as effective pivot embedding for domain generalization problems. Here, we use multimodal graph representations, fusing images and text, to get domain-invariant pivot embeddings by considering the inherent semantic structure between local images and text descriptors. Specifically, we aim to learn domain-invariant features by (i) representing the image and text descriptions with graphs, and by (ii) clustering and matching the graph-based image node features into textual graphs simultaneously. We experiment with large-scale public datasets, such as CUB-DG and DomainBed, and our model achieves matched or better state-of-the-art performance on these datasets. Our code will be publicly available upon publication.

Viaarxiv icon

The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion

Sep 08, 2023
Yujin Jeong, Wonjeong Ryoo, Seunghyun Lee, Dabin Seo, Wonmin Byeon, Sangpil Kim, Jinkyu Kim

Figure 1 for The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion
Figure 2 for The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion
Figure 3 for The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion
Figure 4 for The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion

In recent years, video generation has become a prominent generative tool and has drawn significant attention. However, there is little consideration in audio-to-video generation, though audio contains unique qualities like temporal semantics and magnitude. Hence, we propose The Power of Sound (TPoS) model to incorporate audio input that includes both changeable temporal semantics and magnitude. To generate video frames, TPoS utilizes a latent stable diffusion model with textual semantic information, which is then guided by the sequential audio embedding from our pretrained Audio Encoder. As a result, this method produces audio reactive video contents. We demonstrate the effectiveness of TPoS across various tasks and compare its results with current state-of-the-art techniques in the field of audio-to-video generation. More examples are available at https://ku-vai.github.io/TPoS/

* ICCV2023 
Viaarxiv icon

Soundini: Sound-Guided Diffusion for Natural Video Editing

Apr 13, 2023
Seung Hyun Lee, Sieun Kim, Innfarn Yoo, Feng Yang, Donghyeon Cho, Youngseo Kim, Huiwen Chang, Jinkyu Kim, Sangpil Kim

Figure 1 for Soundini: Sound-Guided Diffusion for Natural Video Editing
Figure 2 for Soundini: Sound-Guided Diffusion for Natural Video Editing
Figure 3 for Soundini: Sound-Guided Diffusion for Natural Video Editing
Figure 4 for Soundini: Sound-Guided Diffusion for Natural Video Editing

We propose a method for adding sound-guided visual effects to specific regions of videos with a zero-shot setting. Animating the appearance of the visual effect is challenging because each frame of the edited video should have visual changes while maintaining temporal consistency. Moreover, existing video editing solutions focus on temporal consistency across frames, ignoring the visual style variations over time, e.g., thunderstorm, wave, fire crackling. To overcome this limitation, we utilize temporal sound features for the dynamic style. Specifically, we guide denoising diffusion probabilistic models with an audio latent representation in the audio-visual latent space. To the best of our knowledge, our work is the first to explore sound-guided natural video editing from various sound sources with sound-specialized properties, such as intensity, timbre, and volume. Additionally, we design optical flow-based guidance to generate temporally consistent video frames, capturing the pixel-wise relationship between adjacent frames. Experimental results show that our method outperforms existing video editing techniques, producing more realistic visual effects that reflect the properties of sound. Please visit our page: https://kuai-lab.github.io/soundini-gallery/.

Viaarxiv icon

FPANet: Frequency-based Video Demoireing using Frame-level Post Alignment

Jan 18, 2023
Gyeongrok Oh, Heon Gu, Sangpil Kim, Jinkyu Kim

Figure 1 for FPANet: Frequency-based Video Demoireing using Frame-level Post Alignment
Figure 2 for FPANet: Frequency-based Video Demoireing using Frame-level Post Alignment
Figure 3 for FPANet: Frequency-based Video Demoireing using Frame-level Post Alignment
Figure 4 for FPANet: Frequency-based Video Demoireing using Frame-level Post Alignment

Interference between overlapping gird patterns creates moire patterns, degrading the visual quality of an image that captures a screen of a digital display device by an ordinary digital camera. Removing such moire patterns is challenging due to their complex patterns of diverse sizes and color distortions. Existing approaches mainly focus on filtering out in the spatial domain, failing to remove a large-scale moire pattern. In this paper, we propose a novel model called FPANet that learns filters in both frequency and spatial domains, improving the restoration quality by removing various sizes of moire patterns. To further enhance, our model takes multiple consecutive frames, learning to extract frame-invariant content features and outputting better quality temporally consistent images. We demonstrate the effectiveness of our proposed method with a publicly available large-scale dataset, observing that ours outperforms the state-of-the-art approaches, including ESDNet, VDmoire, MBCNN, WDNet, UNet, and DMCNN, in terms of the image and video quality metrics, such as PSNR, SSIM, LPIPS, FVD, and FSIM.

Viaarxiv icon

Judge, Localize, and Edit: Ensuring Visual Commonsense Morality for Text-to-Image Generation

Dec 09, 2022
Seongbeom Park, Suhong Moon, Jinkyu Kim

Figure 1 for Judge, Localize, and Edit: Ensuring Visual Commonsense Morality for Text-to-Image Generation
Figure 2 for Judge, Localize, and Edit: Ensuring Visual Commonsense Morality for Text-to-Image Generation
Figure 3 for Judge, Localize, and Edit: Ensuring Visual Commonsense Morality for Text-to-Image Generation
Figure 4 for Judge, Localize, and Edit: Ensuring Visual Commonsense Morality for Text-to-Image Generation

Text-to-image generation methods produce high-resolution and high-quality images, but these methods should not produce immoral images that may contain inappropriate content from the commonsense morality perspective. Conventional approaches often neglect these ethical concerns, and existing solutions are limited in avoiding immoral image generation. In this paper, we aim to automatically judge the immorality of synthesized images and manipulate these images into a moral alternative. To this end, we build a model that has the three main primitives: (1) our model recognizes the visual commonsense immorality of a given image, (2) our model localizes or highlights immoral visual (and textual) attributes that make the image immoral, and (3) our model manipulates a given immoral image into a morally-qualifying alternative. We experiment with the state-of-the-art Stable Diffusion text-to-image generation model and show the effectiveness of our ethical image manipulation. Our human study confirms that ours is indeed able to generate morally-satisfying images from immoral ones. Our implementation will be publicly available upon publication to be widely used as a new safety checker for text-to-image generation models.

Viaarxiv icon

LISA: Localized Image Stylization with Audio via Implicit Neural Representation

Nov 21, 2022
Seung Hyun Lee, Chanyoung Kim, Wonmin Byeon, Sang Ho Yoon, Jinkyu Kim, Sangpil Kim

Figure 1 for LISA: Localized Image Stylization with Audio via Implicit Neural Representation
Figure 2 for LISA: Localized Image Stylization with Audio via Implicit Neural Representation
Figure 3 for LISA: Localized Image Stylization with Audio via Implicit Neural Representation
Figure 4 for LISA: Localized Image Stylization with Audio via Implicit Neural Representation

We present a novel framework, Localized Image Stylization with Audio (LISA) which performs audio-driven localized image stylization. Sound often provides information about the specific context of the scene and is closely related to a certain part of the scene or object. However, existing image stylization works have focused on stylizing the entire image using an image or text input. Stylizing a particular part of the image based on audio input is natural but challenging. In this work, we propose a framework that a user provides an audio input to localize the sound source in the input image and another for locally stylizing the target object or scene. LISA first produces a delicate localization map with an audio-visual localization network by leveraging CLIP embedding space. We then utilize implicit neural representation (INR) along with the predicted localization map to stylize the target object or scene based on sound information. The proposed INR can manipulate the localized pixel values to be semantically consistent with the provided audio input. Through a series of experiments, we show that the proposed framework outperforms the other audio-guided stylization methods. Moreover, LISA constructs concise localization maps and naturally manipulates the target object or scene in accordance with the given audio input.

Viaarxiv icon

Zero-shot Visual Commonsense Immorality Prediction

Nov 10, 2022
Yujin Jeong, Seongbeom Park, Suhong Moon, Jinkyu Kim

Figure 1 for Zero-shot Visual Commonsense Immorality Prediction
Figure 2 for Zero-shot Visual Commonsense Immorality Prediction
Figure 3 for Zero-shot Visual Commonsense Immorality Prediction
Figure 4 for Zero-shot Visual Commonsense Immorality Prediction

Artificial intelligence is currently powering diverse real-world applications. These applications have shown promising performance, but raise complicated ethical issues, i.e. how to embed ethics to make AI applications behave morally. One way toward moral AI systems is by imitating human prosocial behavior and encouraging some form of good behavior in systems. However, learning such normative ethics (especially from images) is challenging mainly due to a lack of data and labeling complexity. Here, we propose a model that predicts visual commonsense immorality in a zero-shot manner. We train our model with an ETHICS dataset (a pair of text and morality annotation) via a CLIP-based image-text joint embedding. In a testing phase, the immorality of an unseen image is predicted. We evaluate our model with existing moral/immoral image datasets and show fair prediction performance consistent with human intuitions. Further, we create a visual commonsense immorality benchmark with more general and extensive immoral visual contents. Codes and dataset are available at https://github.com/ku-vai/Zero-shot-Visual-Commonsense-Immorality-Prediction. Note that this paper might contain images and descriptions that are offensive in nature.

* BMVC2022 
Viaarxiv icon

Resolving Class Imbalance for LiDAR-based Object Detector by Dynamic Weight Average and Contextual Ground Truth Sampling

Oct 07, 2022
Daeun Lee, Jongwon Park, Jinkyu Kim

Figure 1 for Resolving Class Imbalance for LiDAR-based Object Detector by Dynamic Weight Average and Contextual Ground Truth Sampling
Figure 2 for Resolving Class Imbalance for LiDAR-based Object Detector by Dynamic Weight Average and Contextual Ground Truth Sampling
Figure 3 for Resolving Class Imbalance for LiDAR-based Object Detector by Dynamic Weight Average and Contextual Ground Truth Sampling
Figure 4 for Resolving Class Imbalance for LiDAR-based Object Detector by Dynamic Weight Average and Contextual Ground Truth Sampling

An autonomous driving system requires a 3D object detector, which must perceive all present road agents reliably to navigate an environment safely. However, real-world driving datasets often suffer from the problem of data imbalance, which causes difficulties in training a model that works well across all classes, resulting in an undesired imbalanced sub-optimal performance. In this work, we propose a method to address this data imbalance problem. Our method consists of two main components: (i) a LiDAR-based 3D object detector with per-class multiple detection heads where losses from each head are modified by dynamic weight average to be balanced. (ii) Contextual ground truth (GT) sampling, where we improve conventional GT sampling techniques by leveraging semantic information to augment point cloud with sampled ground truth GT objects. Our experiment with KITTI and nuScenes datasets confirms our proposed method's effectiveness in dealing with the data imbalance problem, producing better detection accuracy compared to existing approaches.

* 10 pages 
Viaarxiv icon

Robust Sound-Guided Image Manipulation

Aug 31, 2022
Seung Hyun Lee, Chanyoung Kim, Wonmin Byeon, Gyeongrok Oh, Jooyoung Lee, Sang Ho Yoon, Jinkyu Kim, Sangpil Kim

Figure 1 for Robust Sound-Guided Image Manipulation
Figure 2 for Robust Sound-Guided Image Manipulation
Figure 3 for Robust Sound-Guided Image Manipulation
Figure 4 for Robust Sound-Guided Image Manipulation

Recent successes suggest that an image can be manipulated by a text prompt, e.g., a landscape scene on a sunny day is manipulated into the same scene on a rainy day driven by a text input "raining". These approaches often utilize a StyleCLIP-based image generator, which leverages multi-modal (text and image) embedding space. However, we observe that such text inputs are often bottlenecked in providing and synthesizing rich semantic cues, e.g., differentiating heavy rain from rain with thunderstorms. To address this issue, we advocate leveraging an additional modality, sound, which has notable advantages in image manipulation as it can convey more diverse semantic cues (vivid emotions or dynamic expressions of the natural world) than texts. In this paper, we propose a novel approach that first extends the image-text joint embedding space with sound and applies a direct latent optimization method to manipulate a given image based on audio input, e.g., the sound of rain. Our extensive experiments show that our sound-guided image manipulation approach produces semantically and visually more plausible manipulation results than the state-of-the-art text and sound-guided image manipulation methods, which are further confirmed by our human evaluations. Our downstream task evaluations also show that our learned image-text-sound joint embedding space effectively encodes sound inputs.

* arXiv admin note: text overlap with arXiv:2112.00007 
Viaarxiv icon

Grounding Visual Representations with Texts for Domain Generalization

Jul 21, 2022
Seonwoo Min, Nokyung Park, Siwon Kim, Seunghyun Park, Jinkyu Kim

Figure 1 for Grounding Visual Representations with Texts for Domain Generalization
Figure 2 for Grounding Visual Representations with Texts for Domain Generalization
Figure 3 for Grounding Visual Representations with Texts for Domain Generalization
Figure 4 for Grounding Visual Representations with Texts for Domain Generalization

Reducing the representational discrepancy between source and target domains is a key component to maximize the model generalization. In this work, we advocate for leveraging natural language supervision for the domain generalization task. We introduce two modules to ground visual representations with texts containing typical reasoning of humans: (1) Visual and Textual Joint Embedder and (2) Textual Explanation Generator. The former learns the image-text joint embedding space where we can ground high-level class-discriminative information into the model. The latter leverages an explainable model and generates explanations justifying the rationale behind its decision. To the best of our knowledge, this is the first work to leverage the vision-and-language cross-modality approach for the domain generalization task. Our experiments with a newly created CUB-DG benchmark dataset demonstrate that cross-modality supervision can be successfully used to ground domain-invariant visual representations and improve the model generalization. Furthermore, in the large-scale DomainBed benchmark, our proposed method achieves state-of-the-art results and ranks 1st in average performance for five multi-domain datasets. The dataset and codes are available at https://github.com/mswzeus/GVRT.

* 25 pages (including Supplementary Materials), ECCV 2022 camera ready version 
Viaarxiv icon