Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiyoung Lee

KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge

Feb 22, 2024

Jiyoung Lee, Minwoo Kim, Seungho Kim, Junghwan Kim, Seunghyun Won, Hwaran Lee, Edward Choi

Abstract:For Large Language Models (LLMs) to be effectively deployed in a specific country, they must possess an understanding of the nation's culture and basic knowledge. To this end, we introduce National Alignment, which measures an alignment between an LLM and a targeted country from two aspects: social value alignment and common knowledge alignment. Social value alignment evaluates how well the model understands nation-specific social values, while common knowledge alignment examines how well the model captures basic knowledge related to the nation. We constructed KorNAT, the first benchmark that measures national alignment with South Korea. For the social value dataset, we obtained ground truth labels from a large-scale survey involving 6,174 unique Korean participants. For the common knowledge dataset, we constructed samples based on Korean textbooks and GED reference materials. KorNAT contains 4K and 6K multiple-choice questions for social value and common knowledge, respectively. Our dataset creation process is meticulously designed and based on statistical sampling theory and was refined through multiple rounds of human review. The experiment results of seven LLMs reveal that only a few models met our reference score, indicating a potential for further enhancement. KorNAT has received government approval after passing an assessment conducted by a government-affiliated organization dedicated to evaluating dataset quality. Samples and detailed evaluation protocols of our dataset can be found in https://selectstar.ai/ko/papers-national-alignment

* 35 pages, 7 figures, 16 tables

Via

Access Paper or Ask Questions

Dense Text-to-Image Generation with Attention Modulation

Aug 24, 2023

Yunji Kim, Jiyoung Lee, Jin-Hwa Kim, Jung-Woo Ha, Jun-Yan Zhu

Figure 1 for Dense Text-to-Image Generation with Attention Modulation

Figure 2 for Dense Text-to-Image Generation with Attention Modulation

Figure 3 for Dense Text-to-Image Generation with Attention Modulation

Figure 4 for Dense Text-to-Image Generation with Attention Modulation

Abstract:Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region. To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout. We first analyze the relationship between generated images' layouts and the pre-trained model's intermediate attention maps. Next, we develop an attention modulation method that guides objects to appear in specific regions according to layout guidance. Without requiring additional fine-tuning or datasets, we improve image generation performance given dense captions regarding both automatic and human evaluation scores. In addition, we achieve similar-quality visual results with models specifically trained with layout conditions.

* Accepted by ICCV2023. Code and data are available at https://github.com/naver-ai/DenseDiffusion

Via

Access Paper or Ask Questions

Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning

Aug 08, 2023

Hanjae Kim, Jiyoung Lee, Seongheon Park, Kwanghoon Sohn

Figure 1 for Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning

Figure 2 for Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning

Figure 3 for Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning

Figure 4 for Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning

Abstract:Compositional zero-shot learning (CZSL) aims to recognize unseen compositions with prior knowledge of known primitives (attribute and object). Previous works for CZSL often suffer from grasping the contextuality between attribute and object, as well as the discriminability of visual features, and the long-tailed distribution of real-world compositional data. We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues. CoT employs object and attribute experts in distinctive manners to generate representative embeddings, using the visual network hierarchically. The object expert extracts representative object embeddings from the final layer in a bottom-up manner, while the attribute expert makes attribute embeddings in a top-down manner with a proposed object-guided attention module that models contextuality explicitly. To remedy biased prediction caused by imbalanced data distribution, we develop a simple minority attribute augmentation (MAA) that synthesizes virtual samples by mixing two images and oversampling minority attribute classes. Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL. We also demonstrate the effectiveness of CoT in improving visual discrimination and addressing the model bias from the imbalanced data distribution. The code is available at https://github.com/HanjaeKim98/CoT.

* ICCV 2023

Via

Access Paper or Ask Questions

VisAlign: Dataset for Measuring the Degree of Alignment between AI and Humans in Visual Perception

Aug 03, 2023

Jiyoung Lee, Seungho Kim, Seunghyun Won, Joonseok Lee, Marzyeh Ghassemi, James Thorne, Jaeseok Choi, O-Kil Kwon, Edward Choi

Figure 1 for VisAlign: Dataset for Measuring the Degree of Alignment between AI and Humans in Visual Perception

Figure 2 for VisAlign: Dataset for Measuring the Degree of Alignment between AI and Humans in Visual Perception

Figure 3 for VisAlign: Dataset for Measuring the Degree of Alignment between AI and Humans in Visual Perception

Figure 4 for VisAlign: Dataset for Measuring the Degree of Alignment between AI and Humans in Visual Perception

Abstract:AI alignment refers to models acting towards human-intended goals, preferences, or ethical principles. Given that most large-scale deep learning models act as black boxes and cannot be manually controlled, analyzing the similarity between models and humans can be a proxy measure for ensuring AI safety. In this paper, we focus on the models' visual perception alignment with humans, further referred to as AI-human visual alignment. Specifically, we propose a new dataset for measuring AI-human visual alignment in terms of image classification, a fundamental task in machine perception. In order to evaluate AI-human visual alignment, a dataset should encompass samples with various scenarios that may arise in the real world and have gold human perception labels. Our dataset consists of three groups of samples, namely Must-Act (i.e., Must-Classify), Must-Abstain, and Uncertain, based on the quantity and clarity of visual information in an image and further divided into eight categories. All samples have a gold human perception label; even Uncertain (severely blurry) sample labels were obtained via crowd-sourcing. The validity of our dataset is verified by sampling theory, statistical theories related to survey design, and experts in the related fields. Using our dataset, we analyze the visual alignment and reliability of five popular visual perception models and seven abstention methods. Our code and data is available at \url{https://github.com/jiyounglee-0523/VisAlign}.

Via

Access Paper or Ask Questions

Panoramic Image-to-Image Translation

Apr 11, 2023

Soohyun Kim, Junho Kim, Taekyung Kim, Hwan Heo, Seungryong Kim, Jiyoung Lee, Jin-Hwa Kim

Figure 1 for Panoramic Image-to-Image Translation

Figure 2 for Panoramic Image-to-Image Translation

Figure 3 for Panoramic Image-to-Image Translation

Figure 4 for Panoramic Image-to-Image Translation

Abstract:In this paper, we tackle the challenging task of Panoramic Image-to-Image translation (Pano-I2I) for the first time. This task is difficult due to the geometric distortion of panoramic images and the lack of a panoramic image dataset with diverse conditions, like weather or time. To address these challenges, we propose a panoramic distortion-aware I2I model that preserves the structure of the panoramic images while consistently translating their global style referenced from a pinhole image. To mitigate the distortion issue in naive 360 panorama translation, we adopt spherical positional embedding to our transformer encoders, introduce a distortion-free discriminator, and apply sphere-based rotation for augmentation and its ensemble. We also design a content encoder and a style encoder to be deformation-aware to deal with a large domain gap between panoramas and pinhole images, enabling us to work on diverse conditions of pinhole images. In addition, considering the large discrepancy between panoramas and pinhole images, our framework decouples the learning procedure of the panoramic reconstruction stage from the translation stage. We show distinct improvements over existing I2I models in translating the StreetLearn dataset in the daytime into diverse conditions. The code will be publicly available online for our community.

Via

Access Paper or Ask Questions

Three Recipes for Better 3D Pseudo-GTs of 3D Human Mesh Estimation in the Wild

Apr 10, 2023

Gyeongsik Moon, Hongsuk Choi, Sanghyuk Chun, Jiyoung Lee, Sangdoo Yun

Figure 1 for Three Recipes for Better 3D Pseudo-GTs of 3D Human Mesh Estimation in the Wild

Figure 2 for Three Recipes for Better 3D Pseudo-GTs of 3D Human Mesh Estimation in the Wild

Figure 3 for Three Recipes for Better 3D Pseudo-GTs of 3D Human Mesh Estimation in the Wild

Figure 4 for Three Recipes for Better 3D Pseudo-GTs of 3D Human Mesh Estimation in the Wild

Abstract:Recovering 3D human mesh in the wild is greatly challenging as in-the-wild (ITW) datasets provide only 2D pose ground truths (GTs). Recently, 3D pseudo-GTs have been widely used to train 3D human mesh estimation networks as the 3D pseudo-GTs enable 3D mesh supervision when training the networks on ITW datasets. However, despite the great potential of the 3D pseudo-GTs, there has been no extensive analysis that investigates which factors are important to make more beneficial 3D pseudo-GTs. In this paper, we provide three recipes to obtain highly beneficial 3D pseudo-GTs of ITW datasets. The main challenge is that only 2D-based weak supervision is allowed when obtaining the 3D pseudo-GTs. Each of our three recipes addresses the challenge in each aspect: depth ambiguity, sub-optimality of weak supervision, and implausible articulation. Experimental results show that simply re-training state-of-the-art networks with our new 3D pseudo-GTs elevates their performance to the next level without bells and whistles. The 3D pseudo-GT is publicly available in https://github.com/mks0601/NeuralAnnot_RELEASE.

* Published at CVPRW 2023

Via

Access Paper or Ask Questions

Dual-path Adaptation from Image to Video Transformers

Mar 17, 2023

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

Abstract:In this paper, we efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters. Previous adaptation methods have simultaneously considered spatial and temporal modeling with a unified learnable module but still suffered from fully leveraging the representative capabilities of image transformers. We argue that the popular dual-path (two-stream) architecture in video models can mitigate this problem. We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block. Especially for temporal dynamic modeling, we incorporate consecutive frames into a grid-like frameset to precisely imitate vision transformers' capability that extrapolates relationships between tokens. In addition, we extensively investigate the multiple baselines from a unified perspective in video understanding and compare them with DualPath. Experimental results on four action recognition benchmarks prove that pretrained image transformers with DualPath can be effectively generalized beyond the data domain.

* CVPR 2023. Code is available at https://github.com/park-jungin/DualPath

Via

Access Paper or Ask Questions

Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation

Mar 16, 2023

Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Jaehoon Ko, Hyeonsu Kim, Junho Kim, Jin-Hwa Kim, Jiyoung Lee, Seungryong Kim

Figure 1 for Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation

Figure 2 for Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation

Figure 3 for Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation

Figure 4 for Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation

Abstract:Text-to-3D generation has shown rapid progress in recent days with the advent of score distillation, a methodology of using pretrained text-to-2D diffusion models to optimize neural radiance field (NeRF) in the zero-shot setting. However, the lack of 3D awareness in the 2D diffusion models destabilizes score distillation-based methods from reconstructing a plausible 3D scene. To address this issue, we propose 3DFuse, a novel framework that incorporates 3D awareness into pretrained 2D diffusion models, enhancing the robustness and 3D consistency of score distillation-based methods. We realize this by first constructing a coarse 3D structure of a given text prompt and then utilizing projected, view-specific depth map as a condition for the diffusion model. Additionally, we introduce a training strategy that enables the 2D diffusion model learns to handle the errors and sparsity within the coarse 3D structure for robust generation, as well as a method for ensuring semantic consistency throughout all viewpoints of the scene. Our framework surpasses the limitations of prior arts, and has significant implications for 3D consistent generation of 2D diffusion models.

* Project page https://ku-cvlab.github.io/3DFuse/

Via

Access Paper or Ask Questions

Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech

Feb 27, 2023

Jiyoung Lee, Joon Son Chung, Soo-Whan Chung

Abstract:The goal of this work is zero-shot text-to-speech synthesis, with speaking styles and voices learnt from facial characteristics. Inspired by the natural fact that people can imagine the voice of someone when they look at his or her face, we introduce a face-styled diffusion text-to-speech (TTS) model within a unified framework learnt from visible attributes, called Face-TTS. This is the first time that face images are used as a condition to train a TTS model. We jointly train cross-model biometrics and TTS models to preserve speaker identity between face images and generated speech segments. We also propose a speaker feature binding loss to enforce the similarity of the generated and the ground truth speech segments in speaker embedding space. Since the biometric information is extracted directly from the face image, our method does not require extra fine-tuning steps to generate speech from unseen and unheard speakers. We train and evaluate the model on the LRS3 dataset, an in-the-wild audio-visual corpus containing background noise and diverse speaking styles. The project page is https://facetts.github.io.

* ICASSP 2023. Project page: https://facetts.github.io

Via

Access Paper or Ask Questions

Robust Camera Pose Refinement for Multi-Resolution Hash Encoding

Feb 03, 2023

Hwan Heo, Taekyung Kim, Jiyoung Lee, Jaewon Lee, Soohyun Kim, Hyunwoo J. Kim, Jin-Hwa Kim

Figure 1 for Robust Camera Pose Refinement for Multi-Resolution Hash Encoding

Figure 2 for Robust Camera Pose Refinement for Multi-Resolution Hash Encoding

Figure 3 for Robust Camera Pose Refinement for Multi-Resolution Hash Encoding

Figure 4 for Robust Camera Pose Refinement for Multi-Resolution Hash Encoding

Abstract:Multi-resolution hash encoding has recently been proposed to reduce the computational cost of neural renderings, such as NeRF. This method requires accurate camera poses for the neural renderings of given scenes. However, contrary to previous methods jointly optimizing camera poses and 3D scenes, the naive gradient-based camera pose refinement method using multi-resolution hash encoding severely deteriorates performance. We propose a joint optimization algorithm to calibrate the camera pose and learn a geometric representation using efficient multi-resolution hash encoding. Showing that the oscillating gradient flows of hash encoding interfere with the registration of camera poses, our method addresses the issue by utilizing smooth interpolation weighting to stabilize the gradient oscillation for the ray samplings across hash grids. Moreover, the curriculum training procedure helps to learn the level-wise hash encoding, further increasing the pose refinement. Experiments on the novel-view synthesis datasets validate that our learning frameworks achieve state-of-the-art performance and rapid convergence of neural rendering, even when initial camera poses are unknown.

Via

Access Paper or Ask Questions