Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yong Jae Lee

Contrastive Learning for Diverse Disentangled Foreground Generation

Nov 04, 2022

Yuheng Li, Yijun Li, Jingwan Lu, Eli Shechtman, Yong Jae Lee, Krishna Kumar Singh

Abstract:We introduce a new method for diverse foreground generation with explicit control over various factors. Existing image inpainting based foreground generation methods often struggle to generate diverse results and rarely allow users to explicitly control specific factors of variation (e.g., varying the facial identity or expression for face inpainting results). We leverage contrastive learning with latent codes to generate diverse foreground results for the same masked input. Specifically, we define two sets of latent codes, where one controls a pre-defined factor (``known''), and the other controls the remaining factors (``unknown''). The sampled latent codes from the two sets jointly bi-modulate the convolution kernels to guide the generator to synthesize diverse results. Experiments demonstrate the superiority of our method over state-of-the-arts in result diversity and generation controllability.

* ECCV 2022

Via

Access Paper or Ask Questions

EnergyMatch: Energy-based Pseudo-Labeling for Semi-Supervised Learning

Jun 13, 2022

Zhuoran Yu, Yin Li, Yong Jae Lee

Figure 1 for EnergyMatch: Energy-based Pseudo-Labeling for Semi-Supervised Learning

Figure 2 for EnergyMatch: Energy-based Pseudo-Labeling for Semi-Supervised Learning

Figure 3 for EnergyMatch: Energy-based Pseudo-Labeling for Semi-Supervised Learning

Figure 4 for EnergyMatch: Energy-based Pseudo-Labeling for Semi-Supervised Learning

Abstract:Recent state-of-the-art methods in semi-supervised learning (SSL) combine consistency regularization with confidence-based pseudo-labeling. To obtain high-quality pseudo-labels, a high confidence threshold is typically adopted. However, it has been shown that softmax-based confidence scores in deep networks can be arbitrarily high for samples far from the training data, and thus, the pseudo-labels for even high-confidence unlabeled samples may still be unreliable. In this work, we present a new perspective of pseudo-labeling: instead of relying on model confidence, we instead measure whether an unlabeled sample is likely to be "in-distribution"; i.e., close to the current training data. To classify whether an unlabeled sample is "in-distribution" or "out-of-distribution", we adopt the energy score from out-of-distribution detection literature. As training progresses and more unlabeled samples become in-distribution and contribute to training, the combined labeled and pseudo-labeled data can better approximate the true distribution to improve the model. Experiments demonstrate that our energy-based pseudo-labeling method, albeit conceptually simple, significantly outperforms confidence-based methods on imbalanced SSL benchmarks, and achieves competitive performance on class-balanced data. For example, it produces a 4-6% absolute accuracy improvement on CIFAR10-LT when the imbalance ratio is higher than 50. When combined with state-of-the-art long-tailed SSL methods, further improvements are attained.

Via

Access Paper or Ask Questions

What Knowledge Gets Distilled in Knowledge Distillation?

May 31, 2022

Utkarsh Ojha, Yuheng Li, Yong Jae Lee

Figure 1 for What Knowledge Gets Distilled in Knowledge Distillation?

Figure 2 for What Knowledge Gets Distilled in Knowledge Distillation?

Figure 3 for What Knowledge Gets Distilled in Knowledge Distillation?

Figure 4 for What Knowledge Gets Distilled in Knowledge Distillation?

Abstract:Knowledge distillation aims to transfer useful information from a teacher network to a student network, with the primary goal of improving the student's performance for the task at hand. Over the years, there has a been a deluge of novel techniques and use cases of knowledge distillation. Yet, despite the various improvements, there seems to be a glaring gap in the community's fundamental understanding of the process. Specifically, what is the knowledge that gets distilled in knowledge distillation? In other words, in what ways does the student become similar to the teacher? Does it start to localize objects in the same way? Does it get fooled by the same adversarial samples? Does its data invariance properties become similar? Our work presents a comprehensive study to try to answer these questions and more. Our results, using image classification as a case study and three state-of-the-art knowledge distillation techniques, show that knowledge distillation methods can indeed indirectly distill other kinds of properties beyond improving task performance. By exploring these questions, we hope for our work to provide a clearer picture of what happens during knowledge distillation.

Via

Access Paper or Ask Questions

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Apr 20, 2022

Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Yong Jae Lee, Houdong Hu, Zicheng Liu(+1 more)

Figure 1 for ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Figure 2 for ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Figure 3 for ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Figure 4 for ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Abstract:Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets/tasks. However, it remains a challenge to evaluate the transferablity of these foundation models due to the lack of easy-to-use toolkits for fair benchmarking. To tackle this, we build ELEVATER (Evaluation of Language-augmented Visual Task-level Transfer), the first benchmark to compare and evaluate pre-trained language-augmented visual models. Several highlights include: (i) Datasets. As downstream evaluation suites, it consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge. (ii) Toolkit. An automatic hyper-parameter tuning toolkit is developed to ensure the fairness in model adaption. To leverage the full power of language-augmented visual models, novel language-aware initialization methods are proposed to significantly improve the adaption performance. (iii) Metrics. A variety of evaluation metrics are used, including sample-efficiency (zero-shot and few-shot) and parameter-efficiency (linear probing and full model fine-tuning). We will release our toolkit and evaluation platforms for the research community.

* Preprint. The first two authors contribute equally

Via

Access Paper or Ask Questions

The Two Dimensions of Worst-case Training and the Integrated Effect for Out-of-domain Generalization

Apr 09, 2022

Zeyi Huang, Haohan Wang, Dong Huang, Yong Jae Lee, Eric P. Xing

Figure 1 for The Two Dimensions of Worst-case Training and the Integrated Effect for Out-of-domain Generalization

Figure 2 for The Two Dimensions of Worst-case Training and the Integrated Effect for Out-of-domain Generalization

Figure 3 for The Two Dimensions of Worst-case Training and the Integrated Effect for Out-of-domain Generalization

Figure 4 for The Two Dimensions of Worst-case Training and the Integrated Effect for Out-of-domain Generalization

Abstract:Training with an emphasis on "hard-to-learn" components of the data has been proven as an effective method to improve the generalization of machine learning models, especially in the settings where robustness (e.g., generalization across distributions) is valued. Existing literature discussing this "hard-to-learn" concept are mainly expanded either along the dimension of the samples or the dimension of the features. In this paper, we aim to introduce a simple view merging these two dimensions, leading to a new, simple yet effective, heuristic to train machine learning models by emphasizing the worst-cases on both the sample and the feature dimensions. We name our method W2D following the concept of "Worst-case along Two Dimensions". We validate the idea and demonstrate its empirical strength over standard benchmarks.

* to appear at CVPR2022

Via

Access Paper or Ask Questions

End-to-End Instance Edge Detection

Apr 06, 2022

Xueyan Zou, Haotian Liu, Yong Jae Lee

Figure 1 for End-to-End Instance Edge Detection

Figure 2 for End-to-End Instance Edge Detection

Figure 3 for End-to-End Instance Edge Detection

Figure 4 for End-to-End Instance Edge Detection

Abstract:Edge detection has long been an important problem in the field of computer vision. Previous works have explored category-agnostic or category-aware edge detection. In this paper, we explore edge detection in the context of object instances. Although object boundaries could be easily derived from segmentation masks, in practice, instance segmentation models are trained to maximize IoU to the ground-truth mask, which means that segmentation boundaries are not enforced to precisely align with ground-truth edge boundaries. Thus, the task of instance edge detection itself is different and critical. Since precise edge detection requires high resolution feature maps, we design a novel transformer architecture that efficiently combines a FPN and a transformer decoder to enable cross attention on multi-scale high resolution feature maps within a reasonable computation budget. Further, we propose a light weight dense prediction head that is applicable to both instance edge and mask detection. Finally, we use a penalty reduced focal loss to effectively train the model with point supervision on instance edges, which can reduce annotation costs. We demonstrate highly competitive instance edge detection performance compared to state-of-the-art baselines, and also show that the proposed task and loss are complementary to instance segmentation and object detection.

Via

Access Paper or Ask Questions

GIRAFFE HD: A High-Resolution 3D-aware Generative Model

Mar 28, 2022

Yang Xue, Yuheng Li, Krishna Kumar Singh, Yong Jae Lee

Figure 1 for GIRAFFE HD: A High-Resolution 3D-aware Generative Model

Figure 2 for GIRAFFE HD: A High-Resolution 3D-aware Generative Model

Figure 3 for GIRAFFE HD: A High-Resolution 3D-aware Generative Model

Figure 4 for GIRAFFE HD: A High-Resolution 3D-aware Generative Model

Abstract:3D-aware generative models have shown that the introduction of 3D information can lead to more controllable image generation. In particular, the current state-of-the-art model GIRAFFE can control each object's rotation, translation, scale, and scene camera pose without corresponding supervision. However, GIRAFFE only operates well when the image resolution is low. We propose GIRAFFE HD, a high-resolution 3D-aware generative model that inherits all of GIRAFFE's controllable features while generating high-quality, high-resolution images ($512^2$ resolution and above). The key idea is to leverage a style-based neural renderer, and to independently generate the foreground and background to force their disentanglement while imposing consistency constraints to stitch them together to composite a coherent final image. We demonstrate state-of-the-art 3D controllable high-resolution image generation on multiple natural image datasets.

* CVPR 2022

Via

Access Paper or Ask Questions

Masked Discrimination for Self-Supervised Learning on Point Clouds

Mar 21, 2022

Haotian Liu, Mu Cai, Yong Jae Lee

Figure 1 for Masked Discrimination for Self-Supervised Learning on Point Clouds

Figure 2 for Masked Discrimination for Self-Supervised Learning on Point Clouds

Figure 3 for Masked Discrimination for Self-Supervised Learning on Point Clouds

Figure 4 for Masked Discrimination for Self-Supervised Learning on Point Clouds

Abstract:Masked autoencoding has achieved great success for self-supervised learning in the image and language domains. However, mask based pretraining has yet to show benefits for point cloud understanding, likely due to standard backbones like PointNet being unable to properly handle the training versus testing distribution mismatch introduced by masking during training. In this paper, we bridge this gap by proposing a discriminative mask pretraining Transformer framework, MaskPoint}, for point clouds. Our key idea is to represent the point cloud as discrete occupancy values (1 if part of the point cloud; 0 if not), and perform simple binary classification between masked object points and sampled noise points as the proxy task. In this way, our approach is robust to the point sampling variance in point clouds, and facilitates learning rich representations. We evaluate our pretrained models across several downstream tasks, including 3D shape classification, segmentation, and real-word object detection, and demonstrate state-of-the-art results while achieving a significant pretraining speedup (e.g., 4.1x on ScanNet) compared to the prior state-of-the-art Transformer baseline. Code will be publicly available at https://github.com/haotian-liu/MaskPoint.

Via

Access Paper or Ask Questions

Collaging Class-specific GANs for Semantic Image Synthesis

Oct 08, 2021

Yuheng Li, Yijun Li, Jingwan Lu, Eli Shechtman, Yong Jae Lee, Krishna Kumar Singh

Figure 1 for Collaging Class-specific GANs for Semantic Image Synthesis

Figure 2 for Collaging Class-specific GANs for Semantic Image Synthesis

Figure 3 for Collaging Class-specific GANs for Semantic Image Synthesis

Figure 4 for Collaging Class-specific GANs for Semantic Image Synthesis

Abstract:We propose a new approach for high resolution semantic image synthesis. It consists of one base image generator and multiple class-specific generators. The base generator generates high quality images based on a segmentation map. To further improve the quality of different objects, we create a bank of Generative Adversarial Networks (GANs) by separately training class-specific models. This has several benefits including -- dedicated weights for each class; centrally aligned data for each model; additional training data from other sources, potential of higher resolution and quality; and easy manipulation of a specific object in the scene. Experiments show that our approach can generate high quality images in high resolution while having flexibility of object-level control by using class-specific generators.

* ICCV 2021

Via

Access Paper or Ask Questions

Equine Pain Behavior Classification via Self-Supervised Disentangled Pose Representation

Aug 30, 2021

Maheen Rashid, Sofia Broomé, Katrina Ask, Elin Hernlund, Pia Haubro Andersen, Hedvig Kjellström, Yong Jae Lee

Figure 1 for Equine Pain Behavior Classification via Self-Supervised Disentangled Pose Representation

Figure 2 for Equine Pain Behavior Classification via Self-Supervised Disentangled Pose Representation

Figure 3 for Equine Pain Behavior Classification via Self-Supervised Disentangled Pose Representation

Figure 4 for Equine Pain Behavior Classification via Self-Supervised Disentangled Pose Representation

Abstract:Timely detection of horse pain is important for equine welfare. Horses express pain through their facial and body behavior, but may hide signs of pain from unfamiliar human observers. In addition, collecting visual data with detailed annotation of horse behavior and pain state is both cumbersome and not scalable. Consequently, a pragmatic equine pain classification system would use video of the unobserved horse and weak labels. This paper proposes such a method for equine pain classification by using multi-view surveillance video footage of unobserved horses with induced orthopaedic pain, with temporally sparse video level pain labels. To ensure that pain is learned from horse body language alone, we first train a self-supervised generative model to disentangle horse pose from its appearance and background before using the disentangled horse pose latent representation for pain classification. To make best use of the pain labels, we develop a novel loss that formulates pain classification as a multi-instance learning problem. Our method achieves pain classification accuracy better than human expert performance with 60% accuracy. The learned latent horse pose representation is shown to be viewpoint covariant, and disentangled from horse appearance. Qualitative analysis of pain classified segments shows correspondence between the pain symptoms identified by our model, and equine pain scales used in veterinary practice.

Via

Access Paper or Ask Questions