Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jin-Hwa Kim

Semi-Parametric Video-Grounded Text Generation

Jan 27, 2023

Sungdong Kim, Jin-Hwa Kim, Jiyoung Lee, Minjoon Seo

Abstract:Efficient video-language modeling should consider the computational cost because of a large, sometimes intractable, number of video frames. Parametric approaches such as the attention mechanism may not be ideal since its computational cost quadratically increases as the video length increases. Rather, previous studies have relied on offline feature extraction or frame sampling to represent the video efficiently, focusing on cross-modal modeling in short video clips. In this paper, we propose a semi-parametric video-grounded text generation model, SeViT, a novel perspective on scalable video-language modeling toward long untrimmed videos. Treating a video as an external data store, SeViT includes a non-parametric frame retriever to select a few query-relevant frames from the data store for a given query and a parametric generator to effectively aggregate the frames with the query via late fusion methods. Experimental results demonstrate our method has a significant advantage in longer videos and causal video understanding. Moreover, our model achieves the new state of the art on four video-language datasets, iVQA (+4.8), Next-QA (+6.9), and Activitynet-QA (+4.8) in accuracy, and MSRVTT-Caption (+3.6) in CIDEr.

* Preprint (16 pages, 5 figures)

Via

Access Paper or Ask Questions

SelecMix: Debiased Learning by Contradicting-pair Sampling

Nov 04, 2022

Inwoo Hwang, Sangjun Lee, Yunhyeok Kwak, Seong Joon Oh, Damien Teney, Jin-Hwa Kim, Byoung-Tak Zhang

Abstract:Neural networks trained with ERM (empirical risk minimization) sometimes learn unintended decision rules, in particular when their training data is biased, i.e., when training labels are strongly correlated with undesirable features. To prevent a network from learning such features, recent methods augment training data such that examples displaying spurious correlations (i.e., bias-aligned examples) become a minority, whereas the other, bias-conflicting examples become prevalent. However, these approaches are sometimes difficult to train and scale to real-world data because they rely on generative models or disentangled representations. We propose an alternative based on mixup, a popular augmentation that creates convex combinations of training examples. Our method, coined SelecMix, applies mixup to contradicting pairs of examples, defined as showing either (i) the same label but dissimilar biased features, or (ii) different labels but similar biased features. Identifying such pairs requires comparing examples with respect to unknown biased features. For this, we utilize an auxiliary contrastive model with the popular heuristic that biased features are learned preferentially during training. Experiments on standard benchmarks demonstrate the effectiveness of the method, in particular when label noise complicates the identification of bias-conflicting examples.

* NeurIPS 2022

Via

Access Paper or Ask Questions

Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval

Oct 23, 2022

Minjoon Jung, Seongho Choi, Joochan Kim, Jin-Hwa Kim, Byoung-Tak Zhang

Abstract:Video corpus moment retrieval (VCMR) is the task to retrieve the most relevant video moment from a large video corpus using a natural language query. For narrative videos, e.g., dramas or movies, the holistic understanding of temporal dynamics and multimodal reasoning is crucial. Previous works have shown promising results; however, they relied on the expensive query annotations for VCMR, i.e., the corresponding moment intervals. To overcome this problem, we propose a self-supervised learning framework: Modal-specific Pseudo Query Generation Network (MPGN). First, MPGN selects candidate temporal moments via subtitle-based moment sampling. Then, it generates pseudo queries exploiting both visual and textual information from the selected temporal moments. Through the multimodal information in the pseudo queries, we show that MPGN successfully learns to localize the video corpus moment without any explicit annotation. We validate the effectiveness of MPGN on the TVR dataset, showing competitive results compared with both supervised models and unsupervised setting models.

* Accepted by EMNLP 2022 main conference

Via

Access Paper or Ask Questions

AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models

Oct 08, 2022

Se Jung Kwon, Jeonghoon Kim, Jeongin Bae, Kang Min Yoo, Jin-Hwa Kim, Baeseong Park, Byeongwook Kim, Jung-Woo Ha, Nako Sung, Dongsoo Lee

Figure 1 for AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models

Figure 2 for AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models

Figure 3 for AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models

Figure 4 for AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models

Abstract:There are growing interests in adapting large-scale language models using parameter-efficient fine-tuning methods. However, accelerating the model itself and achieving better inference efficiency through model compression has not been thoroughly explored yet. Model compression could provide the benefits of reducing memory footprints, enabling low-precision computations, and ultimately achieving cost-effective inference. To combine parameter-efficient adaptation and model compression, we propose AlphaTuning consisting of post-training quantization of the pre-trained language model and fine-tuning only some parts of quantized parameters for a target task. Specifically, AlphaTuning works by employing binary-coding quantization, which factorizes the full-precision parameters into binary parameters and a separate set of scaling factors. During the adaptation phase, the binary values are frozen for all tasks, while the scaling factors are fine-tuned for the downstream task. We demonstrate that AlphaTuning, when applied to GPT-2 and OPT, performs competitively with full fine-tuning on a variety of downstream tasks while achieving >10x compression ratio under 4-bit quantization and >1,000x reduction in the number of trainable parameters.

* Findings of EMNLP 2022

Via

Access Paper or Ask Questions

Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

May 25, 2022

Jin-Hwa Kim, Yunji Kim, Jiyoung Lee, Kang Min Yoo, Sang-Woo Lee

Figure 1 for Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Figure 2 for Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Figure 3 for Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Figure 4 for Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Abstract:Text-to-image generation and image captioning are recently emerged as a new experimental paradigm to assess machine intelligence. They predict continuous quantity accompanied by their sampling techniques in the generation, making evaluation complicated and intractable to get marginal distributions. Based on a recent trend that multimodal generative evaluations exploit a vison-and-language pre-trained model, we propose the negative Gaussian cross-mutual information using the CLIP features as a unified metric, coined by Mutual Information Divergence (MID). To validate, we extensively compare it with competing metrics using carefully-generated or human-annotated judgments in text-to-image generation and image captioning tasks. The proposed MID significantly outperforms the competitive methods by having consistency across benchmarks, sample parsimony, and robustness toward the exploited CLIP model. We look forward to seeing the underrepresented implications of the Gaussian cross-mutual information in multimodal representation learning and the future works based on this novel proposition.

Via

Access Paper or Ask Questions

The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training

May 25, 2022

Gi-Cheon Kang, Sungdong Kim, Jin-Hwa Kim, Donghyun Kwak, Byoung-Tak Zhang

Figure 1 for The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training

Figure 2 for The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training

Figure 3 for The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training

Figure 4 for The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training

Abstract:Visual dialog (VisDial) is a task of answering a sequence of questions grounded in an image, using the dialog history as context. Prior work has trained the dialog agents solely on VisDial data via supervised learning or leveraged pre-training on related vision-and-language datasets. This paper presents a semi-supervised learning approach for visually-grounded dialog, called Generative Self-Training (GST), to leverage unlabeled images on the Web. Specifically, GST first retrieves in-domain images through out-of-distribution detection and generates synthetic dialogs regarding the images via multimodal conditional text generation. GST then trains a dialog agent on the synthetic and the original VisDial data. As a result, GST scales the amount of training data up to an order of magnitude that of VisDial (1.2M to 12.9M QA data). For robust training of the generated dialogs, we also propose perplexity-based data selection and multimodal consistency regularization. Evaluation on VisDial v1.0 and v0.9 datasets shows that GST achieves new state-of-the-art results on both datasets. We further observe strong performance gains in the low-data regime (up to 9.35 absolute points on NDCG).

* 16 pages, 4 figures

Via

Access Paper or Ask Questions

ReFine: Re-randomization before Fine-tuning for Cross-domain Few-shot Learning

May 11, 2022

Jaehoon Oh, Sungnyun Kim, Namgyu Ho, Jin-Hwa Kim, Hwanjun Song, Se-Young Yun

Figure 1 for ReFine: Re-randomization before Fine-tuning for Cross-domain Few-shot Learning

Figure 2 for ReFine: Re-randomization before Fine-tuning for Cross-domain Few-shot Learning

Figure 3 for ReFine: Re-randomization before Fine-tuning for Cross-domain Few-shot Learning

Figure 4 for ReFine: Re-randomization before Fine-tuning for Cross-domain Few-shot Learning

Abstract:Cross-domain few-shot learning (CD-FSL), where there are few target samples under extreme differences between source and target domains, has recently attracted huge attention. For CD-FSL, recent studies generally have developed transfer learning based approaches that pre-train a neural network on popular labeled source domain datasets and then transfer it to target domain data. Although the labeled datasets may provide suitable initial parameters for the target data, the domain difference between the source and target might hinder the fine-tuning on the target domain. This paper proposes a simple yet powerful method that re-randomizes the parameters fitted on the source domain before adapting to the target data. The re-randomization resets source-specific parameters of the source pre-trained model and thus facilitates fine-tuning on the target domain, improving few-shot performance.

* 8 pages, 3 figures, and 7 tables

Via

Access Paper or Ask Questions

Understanding Cross-Domain Few-Shot Learning: An Experimental Study

Feb 08, 2022

Jaehoon Oh, Sungnyun Kim, Namgyu Ho, Jin-Hwa Kim, Hwanjun Song, Se-Young Yun

Figure 1 for Understanding Cross-Domain Few-Shot Learning: An Experimental Study

Figure 2 for Understanding Cross-Domain Few-Shot Learning: An Experimental Study

Figure 3 for Understanding Cross-Domain Few-Shot Learning: An Experimental Study

Figure 4 for Understanding Cross-Domain Few-Shot Learning: An Experimental Study

Abstract:Cross-domain few-shot learning has drawn increasing attention for handling large differences between the source and target domains--an important concern in real-world scenarios. To overcome these large differences, recent works have considered exploiting small-scale unlabeled data from the target domain during the pre-training stage. This data enables self-supervised pre-training on the target domain, in addition to supervised pre-training on the source domain. In this paper, we empirically investigate scenarios under which it is advantageous to use each pre-training scheme, based on domain similarity and few-shot difficulty: performance gain of self-supervised pre-training over supervised pre-training increases when domain similarity is smaller or few-shot difficulty is lower. We further design two pre-training schemes, mixed-supervised and two-stage learning, that improve performance. In this light, we present seven findings for CD-FSL which are supported by extensive experiments and analyses on three source and eight target benchmark datasets with varying levels of domain similarity and few-shot difficulty. Our code is available at https://anonymous.4open.science/r/understandingCDFSL.

* 25 pages, 13 figures, and 15 tables

Via

Access Paper or Ask Questions

Semi-orthogonal Embedding for Efficient Unsupervised Anomaly Segmentation

May 31, 2021

Jin-Hwa Kim, Do-Hyeong Kim, Saehoon Yi, Taehoon Lee

Figure 1 for Semi-orthogonal Embedding for Efficient Unsupervised Anomaly Segmentation

Figure 2 for Semi-orthogonal Embedding for Efficient Unsupervised Anomaly Segmentation

Figure 3 for Semi-orthogonal Embedding for Efficient Unsupervised Anomaly Segmentation

Figure 4 for Semi-orthogonal Embedding for Efficient Unsupervised Anomaly Segmentation

Abstract:We present the efficiency of semi-orthogonal embedding for unsupervised anomaly segmentation. The multi-scale features from pre-trained CNNs are recently used for the localized Mahalanobis distances with significant performance. However, the increased feature size is problematic to scale up to the bigger CNNs, since it requires the batch-inverse of multi-dimensional covariance tensor. Here, we generalize an ad-hoc method, random feature selection, into semi-orthogonal embedding for robust approximation, cubically reducing the computational cost for the inverse of multi-dimensional covariance tensor. With the scrutiny of ablation studies, the proposed method achieves a new state-of-the-art with significant margins for the MVTec AD, KolektorSDD, KolektorSDD2, and mSTC datasets. The theoretical and empirical analyses offer insights and verification of our straightforward yet cost-effective approach.

Via

Access Paper or Ask Questions

Multi-step Estimation for Gradient-based Meta-learning

Jun 08, 2020

Jin-Hwa Kim, Junyoung Park, Yongseok Choi

Figure 1 for Multi-step Estimation for Gradient-based Meta-learning

Figure 2 for Multi-step Estimation for Gradient-based Meta-learning

Figure 3 for Multi-step Estimation for Gradient-based Meta-learning

Figure 4 for Multi-step Estimation for Gradient-based Meta-learning

Abstract:Gradient-based meta-learning approaches have been successful in few-shot learning, transfer learning, and a wide range of other domains. Despite its efficacy and simplicity, the burden of calculating the Hessian matrix with large memory footprints is the critical challenge in large-scale applications. To tackle this issue, we propose a simple yet straightforward method to reduce the cost by reusing the same gradient in a window of inner steps. We describe the dynamics of the multi-step estimation in the Lagrangian formalism and discuss how to reduce evaluating second-order derivatives estimating the dynamics. To validate our method, we experiment on meta-transfer learning and few-shot learning tasks for multiple settings. The experiment on meta-transfer emphasizes the applicability of training meta-networks, where other approximations are limited. For few-shot learning, we evaluate time and memory complexities compared with popular baselines. We show that our method significantly reduces training time and memory usage, maintaining competitive accuracies, or even outperforming in some cases.

* 17 pages, 5 figures

Via

Access Paper or Ask Questions