Alert button
Picture for Gunhee Kim

Gunhee Kim

Alert button

Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation

Sep 20, 2023
Heeseung Yun, Joonil Na, Gunhee Kim

Figure 1 for Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation
Figure 2 for Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation
Figure 3 for Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation

Sound can convey significant information for spatial reasoning in our daily lives. To endow deep networks with such ability, we address the challenge of dense indoor prediction with sound in both 2D and 3D via cross-modal knowledge distillation. In this work, we propose a Spatial Alignment via Matching (SAM) distillation framework that elicits local correspondence between the two modalities in vision-to-audio knowledge transfer. SAM integrates audio features with visually coherent learnable spatial embeddings to resolve inconsistencies in multiple layers of a student model. Our approach does not rely on a specific input representation, allowing for flexibility in the input shapes or dimensions without performance degradation. With a newly curated benchmark named Dense Auditory Prediction of Surroundings (DAPS), we are the first to tackle dense indoor prediction of omnidirectional surroundings in both 2D and 3D with audio observations. Specifically, for audio-based depth estimation, semantic segmentation, and challenging 3D scene reconstruction, the proposed distillation framework consistently achieves state-of-the-art performance across various metrics and backbone architectures.

* Published to ICCV2023 
Viaarxiv icon

EP2P-Loc: End-to-End 3D Point to 2D Pixel Localization for Large-Scale Visual Localization

Sep 14, 2023
Minjung Kim, Junseo Koo, Gunhee Kim

Figure 1 for EP2P-Loc: End-to-End 3D Point to 2D Pixel Localization for Large-Scale Visual Localization
Figure 2 for EP2P-Loc: End-to-End 3D Point to 2D Pixel Localization for Large-Scale Visual Localization
Figure 3 for EP2P-Loc: End-to-End 3D Point to 2D Pixel Localization for Large-Scale Visual Localization
Figure 4 for EP2P-Loc: End-to-End 3D Point to 2D Pixel Localization for Large-Scale Visual Localization

Visual localization is the task of estimating a 6-DoF camera pose of a query image within a provided 3D reference map. Thanks to recent advances in various 3D sensors, 3D point clouds are becoming a more accurate and affordable option for building the reference map, but research to match the points of 3D point clouds with pixels in 2D images for visual localization remains challenging. Existing approaches that jointly learn 2D-3D feature matching suffer from low inliers due to representational differences between the two modalities, and the methods that bypass this problem into classification have an issue of poor refinement. In this work, we propose EP2P-Loc, a novel large-scale visual localization method that mitigates such appearance discrepancy and enables end-to-end training for pose estimation. To increase the number of inliers, we propose a simple algorithm to remove invisible 3D points in the image, and find all 2D-3D correspondences without keypoint detection. To reduce memory usage and search complexity, we take a coarse-to-fine approach where we extract patch-level features from 2D images, then perform 2D patch classification on each 3D point, and obtain the exact corresponding 2D pixel coordinates through positional encoding. Finally, for the first time in this task, we employ a differentiable PnP for end-to-end training. In the experiments on newly curated large-scale indoor and outdoor benchmarks based on 2D-3D-S and KITTI, we show that our method achieves the state-of-the-art performance compared to existing visual localization and image-to-point cloud registration methods.

* Accepted to ICCV 2023 
Viaarxiv icon

Recursion of Thought: A Divide-and-Conquer Approach to Multi-Context Reasoning with Language Models

Jun 12, 2023
Soochan Lee, Gunhee Kim

Figure 1 for Recursion of Thought: A Divide-and-Conquer Approach to Multi-Context Reasoning with Language Models
Figure 2 for Recursion of Thought: A Divide-and-Conquer Approach to Multi-Context Reasoning with Language Models
Figure 3 for Recursion of Thought: A Divide-and-Conquer Approach to Multi-Context Reasoning with Language Models
Figure 4 for Recursion of Thought: A Divide-and-Conquer Approach to Multi-Context Reasoning with Language Models

Generating intermediate steps, or Chain of Thought (CoT), is an effective way to significantly improve language models' (LM) multi-step reasoning capability. However, the CoT lengths can grow rapidly with the problem complexity, easily exceeding the maximum context size. Instead of increasing the context limit, which has already been heavily investigated, we explore an orthogonal direction: making LMs divide a problem into multiple contexts. We propose a new inference framework, called Recursion of Thought (RoT), which introduces several special tokens that the models can output to trigger context-related operations. Extensive experiments with multiple architectures including GPT-3 show that RoT dramatically improves LMs' inference capability to solve problems, whose solution consists of hundreds of thousands of tokens.

* ACL 2023 (short, findings) 
Viaarxiv icon

KoSBi: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Application

May 30, 2023
Hwaran Lee, Seokhee Hong, Joonsuk Park, Takyoung Kim, Gunhee Kim, Jung-Woo Ha

Figure 1 for KoSBi: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Application
Figure 2 for KoSBi: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Application
Figure 3 for KoSBi: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Application
Figure 4 for KoSBi: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Application

Large language models (LLMs) learn not only natural text generation abilities but also social biases against different demographic groups from real-world data. This poses a critical risk when deploying LLM-based applications. Existing research and resources are not readily applicable in South Korea due to the differences in language and culture, both of which significantly affect the biases and targeted demographic groups. This limitation requires localized social bias datasets to ensure the safe and effective deployment of LLMs. To this end, we present KO SB I, a new social bias dataset of 34k pairs of contexts and sentences in Korean covering 72 demographic groups in 15 categories. We find that through filtering-based moderation, social biases in generated content can be reduced by 16.47%p on average for HyperCLOVA (30B and 82B), and GPT-3.

* 17 pages, 8 figures, 12 tables, ACL 2023 
Viaarxiv icon

SQuARe: A Large-Scale Dataset of Sensitive Questions and Acceptable Responses Created Through Human-Machine Collaboration

May 28, 2023
Hwaran Lee, Seokhee Hong, Joonsuk Park, Takyoung Kim, Meeyoung Cha, Yejin Choi, Byoung Pil Kim, Gunhee Kim, Eun-Ju Lee, Yong Lim, Alice Oh, Sangchul Park, Jung-Woo Ha

Figure 1 for SQuARe: A Large-Scale Dataset of Sensitive Questions and Acceptable Responses Created Through Human-Machine Collaboration
Figure 2 for SQuARe: A Large-Scale Dataset of Sensitive Questions and Acceptable Responses Created Through Human-Machine Collaboration
Figure 3 for SQuARe: A Large-Scale Dataset of Sensitive Questions and Acceptable Responses Created Through Human-Machine Collaboration
Figure 4 for SQuARe: A Large-Scale Dataset of Sensitive Questions and Acceptable Responses Created Through Human-Machine Collaboration

The potential social harms that large language models pose, such as generating offensive content and reinforcing biases, are steeply rising. Existing works focus on coping with this concern while interacting with ill-intentioned users, such as those who explicitly make hate speech or elicit harmful responses. However, discussions on sensitive issues can become toxic even if the users are well-intentioned. For safer models in such scenarios, we present the Sensitive Questions and Acceptable Response (SQuARe) dataset, a large-scale Korean dataset of 49k sensitive questions with 42k acceptable and 46k non-acceptable responses. The dataset was constructed leveraging HyperCLOVA in a human-in-the-loop manner based on real news headlines. Experiments show that acceptable response generation significantly improves for HyperCLOVA and GPT-3, demonstrating the efficacy of this dataset.

* 19 pages, 10 figures, ACL 2023 
Viaarxiv icon

MPCHAT: Towards Multimodal Persona-Grounded Conversation

May 27, 2023
Jaewoo Ahn, Yeda Song, Sangdoo Yun, Gunhee Kim

Figure 1 for MPCHAT: Towards Multimodal Persona-Grounded Conversation
Figure 2 for MPCHAT: Towards Multimodal Persona-Grounded Conversation
Figure 3 for MPCHAT: Towards Multimodal Persona-Grounded Conversation
Figure 4 for MPCHAT: Towards Multimodal Persona-Grounded Conversation

In order to build self-consistent personalized dialogue agents, previous research has mostly focused on textual persona that delivers personal facts or personalities. However, to fully describe the multi-faceted nature of persona, image modality can help better reveal the speaker's personal characteristics and experiences in episodic memory (Rubin et al., 2003; Conway, 2009). In this work, we extend persona-based dialogue to the multimodal domain and make two main contributions. First, we present the first multimodal persona-based dialogue dataset named MPCHAT, which extends persona with both text and images to contain episodic memories. Second, we empirically show that incorporating multimodal persona, as measured by three proposed multimodal persona-grounded dialogue tasks (i.e., next response prediction, grounding persona prediction, and speaker identification), leads to statistically significant performance improvements across all tasks. Thus, our work highlights that multimodal persona is crucial for improving multimodal dialogue comprehension, and our MPCHAT serves as a high-quality resource for this research.

* Accepted at ACL 2023 
Viaarxiv icon

Who Wrote this Code? Watermarking for Code Generation

May 24, 2023
Taehyun Lee, Seokhee Hong, Jaewoo Ahn, Ilgee Hong, Hwaran Lee, Sangdoo Yun, Jamin Shin, Gunhee Kim

Figure 1 for Who Wrote this Code? Watermarking for Code Generation
Figure 2 for Who Wrote this Code? Watermarking for Code Generation
Figure 3 for Who Wrote this Code? Watermarking for Code Generation
Figure 4 for Who Wrote this Code? Watermarking for Code Generation

Large language models for code have recently shown remarkable performance in generating executable code. However, this rapid advancement has been accompanied by many legal and ethical concerns, such as code licensing issues, code plagiarism, and malware generation, making watermarking machine-generated code a very timely problem. Despite such imminent needs, we discover that existing watermarking and machine-generated text detection methods for LLMs fail to function with code generation tasks properly. Hence, in this work, we propose a new watermarking method, SWEET, that significantly improves upon previous approaches when watermarking machine-generated code. Our proposed method selectively applies watermarking to the tokens with high enough entropy, surpassing a defined threshold. The experiments on code generation benchmarks show that our watermarked code has superior quality compared to code produced by the previous state-of-the-art LLM watermarking method. Furthermore, our watermark method also outperforms DetectGPT for the task of machine-generated code detection.

Viaarxiv icon

SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization

Dec 20, 2022
Hyunwoo Kim, Jack Hessel, Liwei Jiang, Ximing Lu, Youngjae Yu, Pei Zhou, Ronan Le Bras, Malihe Alikhani, Gunhee Kim, Maarten Sap, Yejin Choi

Figure 1 for SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization
Figure 2 for SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization
Figure 3 for SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization
Figure 4 for SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization

We present SODA: the first publicly available, million-scale high-quality social dialogue dataset. Using SODA, we train COSMO: a generalizable conversation agent outperforming previous best-performing agents on both in- and out-of-domain datasets. In contrast to most existing crowdsourced, small-scale dialogue corpora, we distill 1.5M socially-grounded dialogues from a pre-trained language model (InstructGPT; Ouyang et al., 2022). Dialogues are distilled by contextualizing social commonsense knowledge from a knowledge graph (Atomic10x; West et al., 2022). Human evaluation shows that dialogues in SODA are more consistent, specific, and (surprisingly) natural than prior human-authored datasets - e.g., DailyDialog (Li et al., 2017), BlendedSkillTalk (Smith et al., 2020). In addition, extensive evaluations show that COSMO is significantly more natural and consistent on unseen datasets than best-performing dialogue models - e.g., GODEL (Peng et al., 2022), BlenderBot (Roller et al., 2021), DialoGPT (Zhang et al., 2020). Furthermore, it is sometimes even preferred to the original human-written gold responses. We make our data, models, and code public.

* Dataset, models, and code can be found at https://hyunw.kim/sodaverse 
Viaarxiv icon

Variational Laplace Autoencoders

Nov 30, 2022
Yookoon Park, Chris Dongjoo Kim, Gunhee Kim

Figure 1 for Variational Laplace Autoencoders
Figure 2 for Variational Laplace Autoencoders
Figure 3 for Variational Laplace Autoencoders
Figure 4 for Variational Laplace Autoencoders

Variational autoencoders employ an amortized inference model to approximate the posterior of latent variables. However, such amortized variational inference faces two challenges: (1) the limited posterior expressiveness of fully-factorized Gaussian assumption and (2) the amortization error of the inference model. We present a novel approach that addresses both challenges. First, we focus on ReLU networks with Gaussian output and illustrate their connection to probabilistic PCA. Building on this observation, we derive an iterative algorithm that finds the mode of the posterior and apply full-covariance Gaussian posterior approximation centered on the mode. Subsequently, we present a general framework named Variational Laplace Autoencoders (VLAEs) for training deep generative models. Based on the Laplace approximation of the latent variable posterior, VLAEs enhance the expressiveness of the posterior while reducing the amortization error. Empirical results on MNIST, Omniglot, Fashion-MNIST, SVHN and CIFAR10 show that the proposed approach significantly outperforms other recent amortized or iterative methods on the ReLU networks.

* Published in ICML 2019 
Viaarxiv icon

Panoramic Vision Transformer for Saliency Detection in 360° Videos

Sep 19, 2022
Heeseung Yun, Sehun Lee, Gunhee Kim

Figure 1 for Panoramic Vision Transformer for Saliency Detection in 360° Videos
Figure 2 for Panoramic Vision Transformer for Saliency Detection in 360° Videos
Figure 3 for Panoramic Vision Transformer for Saliency Detection in 360° Videos
Figure 4 for Panoramic Vision Transformer for Saliency Detection in 360° Videos

360$^\circ$ video saliency detection is one of the challenging benchmarks for 360$^\circ$ video understanding since non-negligible distortion and discontinuity occur in the projection of any format of 360$^\circ$ videos, and capture-worthy viewpoint in the omnidirectional sphere is ambiguous by nature. We present a new framework named Panoramic Vision Transformer (PAVER). We design the encoder using Vision Transformer with deformable convolution, which enables us not only to plug pretrained models from normal videos into our architecture without additional modules or finetuning but also to perform geometric approximation only once, unlike previous deep CNN-based approaches. Thanks to its powerful encoder, PAVER can learn the saliency from three simple relative relations among local patch features, outperforming state-of-the-art models for the Wild360 benchmark by large margins without supervision or auxiliary information like class activation. We demonstrate the utility of our saliency prediction model with the omnidirectional video quality assessment task in VQA-ODV, where we consistently improve performance without any form of supervision, including head movement.

* Published to ECCV2022 
Viaarxiv icon