Alert button
Picture for Kyungmin Lee

Kyungmin Lee

Alert button

Collaborative Score Distillation for Consistent Visual Synthesis

Jul 04, 2023
Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, Jinwoo Shin

Figure 1 for Collaborative Score Distillation for Consistent Visual Synthesis
Figure 2 for Collaborative Score Distillation for Consistent Visual Synthesis
Figure 3 for Collaborative Score Distillation for Consistent Visual Synthesis
Figure 4 for Collaborative Score Distillation for Consistent Visual Synthesis

Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.

* Project page with visuals: https://subin-kim-cv.github.io/CSD/ 
Viaarxiv icon

S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captions

May 23, 2023
Sangwoo Mo, Minkyu Kim, Kyungmin Lee, Jinwoo Shin

Figure 1 for S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captions
Figure 2 for S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captions
Figure 3 for S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captions
Figure 4 for S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captions

Vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated impressive results in natural image domains. However, these models often struggle when applied to specialized domains like remote sensing, and adapting to such domains is challenging due to the limited number of image-text pairs available for training. To address this, we propose S-CLIP, a semi-supervised learning method for training CLIP that utilizes additional unpaired images. S-CLIP employs two pseudo-labeling strategies specifically designed for contrastive learning and the language modality. The caption-level pseudo-label is given by a combination of captions of paired images, obtained by solving an optimal transport problem between unpaired and paired images. The keyword-level pseudo-label is given by a keyword in the caption of the nearest paired image, trained through partial label learning that assumes a candidate set of labels for supervision instead of the exact one. By combining these objectives, S-CLIP significantly enhances the training of CLIP using only a few image-text pairs, as demonstrated in various specialist domains, including remote sensing, fashion, scientific figures, and comics. For instance, S-CLIP improves CLIP by 10% for zero-shot classification and 4% for image-text retrieval on the remote sensing benchmark, matching the performance of supervised CLIP while using three times fewer image-text pairs.

Viaarxiv icon

STUNT: Few-shot Tabular Learning with Self-generated Tasks from Unlabeled Tables

Mar 02, 2023
Jaehyun Nam, Jihoon Tack, Kyungmin Lee, Hankook Lee, Jinwoo Shin

Figure 1 for STUNT: Few-shot Tabular Learning with Self-generated Tasks from Unlabeled Tables
Figure 2 for STUNT: Few-shot Tabular Learning with Self-generated Tasks from Unlabeled Tables
Figure 3 for STUNT: Few-shot Tabular Learning with Self-generated Tasks from Unlabeled Tables
Figure 4 for STUNT: Few-shot Tabular Learning with Self-generated Tasks from Unlabeled Tables

Learning with few labeled tabular samples is often an essential requirement for industrial machine learning applications as varieties of tabular data suffer from high annotation costs or have difficulties in collecting new samples for novel tasks. Despite the utter importance, such a problem is quite under-explored in the field of tabular learning, and existing few-shot learning schemes from other domains are not straightforward to apply, mainly due to the heterogeneous characteristics of tabular data. In this paper, we propose a simple yet effective framework for few-shot semi-supervised tabular learning, coined Self-generated Tasks from UNlabeled Tables (STUNT). Our key idea is to self-generate diverse few-shot tasks by treating randomly chosen columns as a target label. We then employ a meta-learning scheme to learn generalizable knowledge with the constructed tasks. Moreover, we introduce an unsupervised validation scheme for hyperparameter search (and early stopping) by generating a pseudo-validation set using STUNT from unlabeled data. Our experimental results demonstrate that our simple framework brings significant performance gain under various tabular few-shot learning benchmarks, compared to prior semi- and self-supervised baselines. Code is available at https://github.com/jaehyun513/STUNT.

* ICLR 2023 (Spotlight) 
Viaarxiv icon

Explaining Visual Biases as Words by Generating Captions

Jan 26, 2023
Younghyun Kim, Sangwoo Mo, Minkyu Kim, Kyungmin Lee, Jaeho Lee, Jinwoo Shin

Figure 1 for Explaining Visual Biases as Words by Generating Captions
Figure 2 for Explaining Visual Biases as Words by Generating Captions
Figure 3 for Explaining Visual Biases as Words by Generating Captions
Figure 4 for Explaining Visual Biases as Words by Generating Captions

We aim to diagnose the potential biases in image classifiers. To this end, prior works manually labeled biased attributes or visualized biased features, which need high annotation costs or are often ambiguous to interpret. Instead, we leverage two types (generative and discriminative) of pre-trained vision-language models to describe the visual bias as a word. Specifically, we propose bias-to-text (B2T), which generates captions of the mispredicted images using a pre-trained captioning model to extract the common keywords that may describe visual biases. Then, we categorize the bias type as spurious correlation or majority bias by checking if it is specific or agnostic to the class, based on the similarity of class-wise mispredicted images and the keyword upon a pre-trained vision-language joint embedding space, e.g., CLIP. We demonstrate that the proposed simple and intuitive scheme can recover well-known gender and background biases, and discover novel ones in real-world datasets. Moreover, we utilize B2T to compare the classifiers using different architectures or training methods. Finally, we show that one can obtain debiased classifiers using the B2T bias keywords and CLIP, in both zero-shot and full-shot manners, without using any human annotation on the bias.

* First two authors contributed equally 
Viaarxiv icon

GCISG: Guided Causal Invariant Learning for Improved Syn-to-real Generalization

Aug 22, 2022
Gilhyun Nam, Gyeongjae Choi, Kyungmin Lee

Figure 1 for GCISG: Guided Causal Invariant Learning for Improved Syn-to-real Generalization
Figure 2 for GCISG: Guided Causal Invariant Learning for Improved Syn-to-real Generalization
Figure 3 for GCISG: Guided Causal Invariant Learning for Improved Syn-to-real Generalization
Figure 4 for GCISG: Guided Causal Invariant Learning for Improved Syn-to-real Generalization

Training a deep learning model with artificially generated data can be an alternative when training data are scarce, yet it suffers from poor generalization performance due to a large domain gap. In this paper, we characterize the domain gap by using a causal framework for data generation. We assume that the real and synthetic data have common content variables but different style variables. Thus, a model trained on synthetic dataset might have poor generalization as the model learns the nuisance style variables. To that end, we propose causal invariance learning which encourages the model to learn a style-invariant representation that enhances the syn-to-real generalization. Furthermore, we propose a simple yet effective feature distillation method that prevents catastrophic forgetting of semantic knowledge of the real domain. In sum, we refer to our method as Guided Causal Invariant Syn-to-real Generalization that effectively improves the performance of syn-to-real generalization. We empirically verify the validity of proposed methods, and especially, our method achieves state-of-the-art on visual syn-to-real domain generalization tasks such as image classification and semantic segmentation.

* Accepted to ECCV 2022 
Viaarxiv icon

RényiCL: Contrastive Representation Learning with Skew Rényi Divergence

Aug 12, 2022
Kyungmin Lee, Jinwoo Shin

Figure 1 for RényiCL: Contrastive Representation Learning with Skew Rényi Divergence
Figure 2 for RényiCL: Contrastive Representation Learning with Skew Rényi Divergence
Figure 3 for RényiCL: Contrastive Representation Learning with Skew Rényi Divergence
Figure 4 for RényiCL: Contrastive Representation Learning with Skew Rényi Divergence

Contrastive representation learning seeks to acquire useful representations by estimating the shared information between multiple views of data. Here, the choice of data augmentation is sensitive to the quality of learned representations: as harder the data augmentations are applied, the views share more task-relevant information, but also task-irrelevant one that can hinder the generalization capability of representation. Motivated by this, we present a new robust contrastive learning scheme, coined R\'enyiCL, which can effectively manage harder augmentations by utilizing R\'enyi divergence. Our method is built upon the variational lower bound of R\'enyi divergence, but a na\"ive usage of a variational method is impractical due to the large variance. To tackle this challenge, we propose a novel contrastive objective that conducts variational estimation of a skew R\'enyi divergence and provide a theoretical guarantee on how variational estimation of skew divergence leads to stable training. We show that R\'enyi contrastive learning objectives perform innate hard negative sampling and easy positive sampling simultaneously so that it can selectively learn useful features and ignore nuisance features. Through experiments on ImageNet, we show that R\'enyi contrastive learning with stronger augmentations outperforms other self-supervised methods without extra regularization or computational overhead. Moreover, we also validate our method on other domains such as graph and tabular, showing empirical gain over other contrastive methods.

* 28 pages 
Viaarxiv icon

Applying GPGPU to Recurrent Neural Network Language Model based Fast Network Search in the Real-Time LVCSR

Jul 23, 2020
Kyungmin Lee, Chiyoun Park, Ilhwan Kim, Namhoon Kim, Jaewon Lee

Figure 1 for Applying GPGPU to Recurrent Neural Network Language Model based Fast Network Search in the Real-Time LVCSR
Figure 2 for Applying GPGPU to Recurrent Neural Network Language Model based Fast Network Search in the Real-Time LVCSR
Figure 3 for Applying GPGPU to Recurrent Neural Network Language Model based Fast Network Search in the Real-Time LVCSR
Figure 4 for Applying GPGPU to Recurrent Neural Network Language Model based Fast Network Search in the Real-Time LVCSR

Recurrent Neural Network Language Models (RNNLMs) have started to be used in various fields of speech recognition due to their outstanding performance. However, the high computational complexity of RNNLMs has been a hurdle in applying the RNNLM to a real-time Large Vocabulary Continuous Speech Recognition (LVCSR). In order to accelerate the speed of RNNLM-based network searches during decoding, we apply the General Purpose Graphic Processing Units (GPGPUs). This paper proposes a novel method of applying GPGPUs to RNNLM-based graph traversals. We have achieved our goal by reducing redundant computations on CPUs and amount of transfer between GPGPUs and CPUs. The proposed approach was evaluated on both WSJ corpus and in-house data. Experiments shows that the proposed approach achieves the real-time speed in various circumstances while maintaining the Word Error Rate (WER) to be relatively 10% lower than that of n-gram models.

* 4 pages, 2 figures, Interspeech2015(Accepted) 
Viaarxiv icon

Sequential Routing Framework: Fully Capsule Network-based Speech Recognition

Jul 23, 2020
Kyungmin Lee, Hyunwhan Joe, Hyeontaek Lim, Kwangyoun Kim, Sungsoo Kim, Chang Woo Han, Hong-Gee Kim

Figure 1 for Sequential Routing Framework: Fully Capsule Network-based Speech Recognition
Figure 2 for Sequential Routing Framework: Fully Capsule Network-based Speech Recognition
Figure 3 for Sequential Routing Framework: Fully Capsule Network-based Speech Recognition
Figure 4 for Sequential Routing Framework: Fully Capsule Network-based Speech Recognition

Capsule networks (CapsNets) have recently gotten attention as alternatives for convolutional neural networks (CNNs) with their greater hierarchical representation capabilities. In this paper, we introduce the sequential routing framework (SRF) which we believe is the first method to adapt a CapsNet-only structure to sequence-to-sequence recognition. In SRF, input sequences are capsulized then sliced by the window size. Each sliced window is classified to a label at the corresponding time through iterative routing mechanisms. Afterwards, training losses are computed using connectionist temporal classification (CTC). During routing, two kinds of information, learnable weights and iteration outputs are shared across the slices. By sharing the information, the required parameter numbers can be controlled by the given window size regardless of the length of sequences. Moreover, the method can minimize decoding speed degradation caused by the routing iterations since it can operate in a non-iterative manner at inference time without dropping accuracy. We empirically proved the validity of our method by performing phoneme sequence recognition tasks on the TIMIT corpus. The proposed method attains an 82.6% phoneme recognition rate. It is 0.8% more accurate than that of CNN-based CTC networks and on par with that of recurrent neural network transducers (RNN-Ts). Even more, the method requires less than half the parameters compared to the two architectures.

* 40 pages, 7 figures (totally 10 figures), submitted to Computer Speech and Language (Only line numbers were removed from the submitted version) 
Viaarxiv icon

Attention based on-device streaming speech recognition with large speech corpus

Jan 02, 2020
Kwangyoun Kim, Kyungmin Lee, Dhananjaya Gowda, Junmo Park, Sungsoo Kim, Sichen Jin, Young-Yoon Lee, Jinsu Yeo, Daehyun Kim, Seokyeong Jung, Jungin Lee, Myoungji Han, Chanwoo Kim

Figure 1 for Attention based on-device streaming speech recognition with large speech corpus
Figure 2 for Attention based on-device streaming speech recognition with large speech corpus
Figure 3 for Attention based on-device streaming speech recognition with large speech corpus
Figure 4 for Attention based on-device streaming speech recognition with large speech corpus

In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER) training, layer-wise pre-training and data augmentation methods. In addition, we compressed our models by more than 3.4 times smaller using an iterative hyper low-rank approximation (LRA) method while minimizing the degradation in recognition accuracy. The memory footprint was further reduced with 8-bit quantization to bring down the final model size to lower than 39 MB. For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on average in word error rate (WER) for target domains including the general domain.

* Accepted and presented at the ASRU 2019 conference 
Viaarxiv icon

end-to-end training of a large vocabulary end-to-end speech recognition system

Dec 22, 2019
Chanwoo Kim, Sungsoo Kim, Kwangyoun Kim, Mehul Kumar, Jiyeon Kim, Kyungmin Lee, Changwoo Han, Abhinav Garg, Eunhyang Kim, Minkyoo Shin, Shatrughan Singh, Larry Heck, Dhananjaya Gowda

Figure 1 for end-to-end training of a large vocabulary end-to-end speech recognition system
Figure 2 for end-to-end training of a large vocabulary end-to-end speech recognition system
Figure 3 for end-to-end training of a large vocabulary end-to-end speech recognition system
Figure 4 for end-to-end training of a large vocabulary end-to-end speech recognition system

In this paper, we present an end-to-end training framework for building state-of-the-art end-to-end speech recognition systems. Our training system utilizes a cluster of Central Processing Units(CPUs) and Graphics Processing Units (GPUs). The entire data reading, large scale data augmentation, neural network parameter updates are all performed "on-the-fly". We use vocal tract length perturbation [1] and an acoustic simulator [2] for data augmentation. The processed features and labels are sent to the GPU cluster. The Horovod allreduce approach is employed to train neural network parameters. We evaluated the effectiveness of our system on the standard Librispeech corpus [3] and the 10,000-hr anonymized Bixby English dataset. Our end-to-end speech recognition system built using this training infrastructure showed a 2.44 % WER on test-clean of the LibriSpeech test set after applying shallow fusion with a Transformer language model (LM). For the proprietary English Bixby open domain test set, we obtained a WER of 7.92 % using a Bidirectional Full Attention (BFA) end-to-end model after applying shallow fusion with an RNN-LM. When the monotonic chunckwise attention (MoCha) based approach is employed for streaming speech recognition, we obtained a WER of 9.95 % on the same Bixby open domain test set.

* Accepted and presented at the ASRU 2019 conference 
Viaarxiv icon