Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yong Jae Lee

Few-shot Image Generation via Cross-domain Correspondence

Apr 13, 2021

Utkarsh Ojha, Yijun Li, Jingwan Lu, Alexei A. Efros, Yong Jae Lee, Eli Shechtman, Richard Zhang

Figure 1 for Few-shot Image Generation via Cross-domain Correspondence

Figure 2 for Few-shot Image Generation via Cross-domain Correspondence

Figure 3 for Few-shot Image Generation via Cross-domain Correspondence

Figure 4 for Few-shot Image Generation via Cross-domain Correspondence

Abstract:Training generative models, such as GANs, on a target domain containing limited examples (e.g., 10) can easily result in overfitting. In this work, we seek to utilize a large source domain for pretraining and transfer the diversity information from source to target. We propose to preserve the relative similarities and differences between instances in the source via a novel cross-domain distance consistency loss. To further reduce overfitting, we present an anchor-based strategy to encourage different levels of realism over different regions in the latent space. With extensive results in both photorealistic and non-photorealistic domains, we demonstrate qualitatively and quantitatively that our few-shot model automatically discovers correspondences between source and target domains and generates more diverse and realistic images than previous methods.

* CVPR 2021

Via

Access Paper or Ask Questions

Progressive Temporal Feature Alignment Network for Video Inpainting

Apr 08, 2021

Xueyan Zou, Linjie Yang, Ding Liu, Yong Jae Lee

Figure 1 for Progressive Temporal Feature Alignment Network for Video Inpainting

Figure 2 for Progressive Temporal Feature Alignment Network for Video Inpainting

Figure 3 for Progressive Temporal Feature Alignment Network for Video Inpainting

Figure 4 for Progressive Temporal Feature Alignment Network for Video Inpainting

Abstract:Video inpainting aims to fill spatio-temporal "corrupted" regions with plausible content. To achieve this goal, it is necessary to find correspondences from neighbouring frames to faithfully hallucinate the unknown content. Current methods achieve this goal through attention, flow-based warping, or 3D temporal convolution. However, flow-based warping can create artifacts when optical flow is not accurate, while temporal convolution may suffer from spatial misalignment. We propose 'Progressive Temporal Feature Alignment Network', which progressively enriches features extracted from the current frame with the feature warped from neighbouring frames using optical flow. Our approach corrects the spatial misalignment in the temporal feature propagation stage, greatly improving visual quality and temporal consistency of the inpainted videos. Using the proposed architecture, we achieve state-of-the-art performance on the DAVIS and FVI datasets compared to existing deep learning approaches. Code is available at https://github.com/MaureenZOU/TSAM.

* Accepted in CVPR2021

Via

Access Paper or Ask Questions

Generating Furry Cars: Disentangling Object Shape & Appearance across Multiple Domains

Apr 05, 2021

Utkarsh Ojha, Krishna Kumar Singh, Yong Jae Lee

Figure 1 for Generating Furry Cars: Disentangling Object Shape & Appearance across Multiple Domains

Figure 2 for Generating Furry Cars: Disentangling Object Shape & Appearance across Multiple Domains

Figure 3 for Generating Furry Cars: Disentangling Object Shape & Appearance across Multiple Domains

Figure 4 for Generating Furry Cars: Disentangling Object Shape & Appearance across Multiple Domains

Abstract:We consider the novel task of learning disentangled representations of object shape and appearance across multiple domains (e.g., dogs and cars). The goal is to learn a generative model that learns an intermediate distribution, which borrows a subset of properties from each domain, enabling the generation of images that did not exist in any domain exclusively. This challenging problem requires an accurate disentanglement of object shape, appearance, and background from each domain, so that the appearance and shape factors from the two domains can be interchanged. We augment an existing approach that can disentangle factors within a single domain but struggles to do so across domains. Our key technical contribution is to represent object appearance with a differentiable histogram of visual features, and to optimize the generator so that two images with the same latent appearance factor but different latent shape factors produce similar histograms. On multiple multi-domain datasets, we demonstrate our method leads to accurate and consistent appearance and shape transfer across domains.

* Camera ready version for ICLR 2021

Via

Access Paper or Ask Questions

YolactEdge: Real-time Instance Segmentation on the Edge (Jetson AGX Xavier: 30 FPS, RTX 2080 Ti: 170 FPS)

Dec 22, 2020

Haotian Liu, Rafael A. Rivera Soto, Fanyi Xiao, Yong Jae Lee

Figure 1 for YolactEdge: Real-time Instance Segmentation on the Edge (Jetson AGX Xavier: 30 FPS, RTX 2080 Ti: 170 FPS)

Figure 2 for YolactEdge: Real-time Instance Segmentation on the Edge (Jetson AGX Xavier: 30 FPS, RTX 2080 Ti: 170 FPS)

Figure 3 for YolactEdge: Real-time Instance Segmentation on the Edge (Jetson AGX Xavier: 30 FPS, RTX 2080 Ti: 170 FPS)

Figure 4 for YolactEdge: Real-time Instance Segmentation on the Edge (Jetson AGX Xavier: 30 FPS, RTX 2080 Ti: 170 FPS)

Abstract:We propose YolactEdge, the first competitive instance segmentation approach that runs on small edge devices at real-time speeds. Specifically, YolactEdge runs at up to 30.8 FPS on a Jetson AGX Xavier (and 172.7 FPS on an RTX 2080 Ti) with a ResNet-101 backbone on 550x550 resolution images. To achieve this, we make two improvements to the state-of-the-art image-based real-time method YOLACT: (1) TensorRT optimization while carefully trading off speed and accuracy, and (2) a novel feature warping module to exploit temporal redundancy in videos. Experiments on the YouTube VIS and MS COCO datasets demonstrate that YolactEdge produces a 3-5x speed up over existing real-time methods while producing competitive mask and box detection accuracy. We also conduct ablation studies to dissect our design choices and modules. Code and models are available at https://github.com/haotian-liu/yolact_edge.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

Delving Deeper into Anti-aliasing in ConvNets

Aug 21, 2020

Xueyan Zou, Fanyi Xiao, Zhiding Yu, Yong Jae Lee

Figure 1 for Delving Deeper into Anti-aliasing in ConvNets

Figure 2 for Delving Deeper into Anti-aliasing in ConvNets

Figure 3 for Delving Deeper into Anti-aliasing in ConvNets

Figure 4 for Delving Deeper into Anti-aliasing in ConvNets

Abstract:Aliasing refers to the phenomenon that high frequency signals degenerate into completely different ones after sampling. It arises as a problem in the context of deep learning as downsampling layers are widely adopted in deep architectures to reduce parameters and computation. The standard solution is to apply a low-pass filter (e.g., Gaussian blur) before downsampling. However, it can be suboptimal to apply the same filter across the entire content, as the frequency of feature maps can vary across both spatial locations and feature channels. To tackle this, we propose an adaptive content-aware low-pass filtering layer, which predicts separate filter weights for each spatial location and channel group of the input feature maps. We investigate the effectiveness and generalization of the proposed method across multiple tasks including ImageNet classification, COCO instance segmentation, and Cityscapes semantic segmentation. Qualitative and quantitative results demonstrate that our approach effectively adapts to the different feature frequencies to avoid aliasing while preserving useful information for recognition. Code is available at https://maureenzou.github.io/ddac/.

* [Accepted in BMVC2020] code: https://maureenzou.github.io/ddac/

Via

Access Paper or Ask Questions

Instance-aware, Context-focused, and Memory-efficient Weakly Supervised Object Detection

Apr 09, 2020

Zhongzheng Ren, Zhiding Yu, Xiaodong Yang, Ming-Yu Liu, Yong Jae Lee, Alexander G. Schwing, Jan Kautz

Figure 1 for Instance-aware, Context-focused, and Memory-efficient Weakly Supervised Object Detection

Figure 2 for Instance-aware, Context-focused, and Memory-efficient Weakly Supervised Object Detection

Figure 3 for Instance-aware, Context-focused, and Memory-efficient Weakly Supervised Object Detection

Figure 4 for Instance-aware, Context-focused, and Memory-efficient Weakly Supervised Object Detection

Abstract:Weakly supervised learning has emerged as a compelling tool for object detection by reducing the need for strong supervision during training. However, major challenges remain: (1) differentiation of object instances can be ambiguous; (2) detectors tend to focus on discriminative parts rather than entire objects; (3) without ground truth, object proposals have to be redundant for high recalls, causing significant memory consumption. Addressing these challenges is difficult, as it often requires to eliminate uncertainties and trivial solutions. To target these issues we develop an instance-aware and context-focused unified framework. It employs an instance-aware self-training algorithm and a learnable Concrete DropBlock while devising a memory-efficient sequential batch back-propagation. Our proposed method achieves state-of-the-art results on COCO ($12.1\% ~AP$, $24.8\% ~AP_{50}$), VOC 2007 ($54.9\% ~AP$), and VOC 2012 ($52.1\% ~AP$), improving baselines by great margins. In addition, the proposed method is the first to benchmark ResNet based models and weakly supervised video object detection. Refer to our project page for code, models, and more details: https://github.com/NVlabs/wetectron.

* Accepted to CVPR 2020

Via

Access Paper or Ask Questions

Action Graphs: Weakly-supervised Action Localization with Graph Convolution Networks

Feb 04, 2020

Maheen Rashid, Hedvig Kjellström, Yong Jae Lee

Figure 1 for Action Graphs: Weakly-supervised Action Localization with Graph Convolution Networks

Figure 2 for Action Graphs: Weakly-supervised Action Localization with Graph Convolution Networks

Figure 3 for Action Graphs: Weakly-supervised Action Localization with Graph Convolution Networks

Figure 4 for Action Graphs: Weakly-supervised Action Localization with Graph Convolution Networks

Abstract:We present a method for weakly-supervised action localization based on graph convolutions. In order to find and classify video time segments that correspond to relevant action classes, a system must be able to both identify discriminative time segments in each video, and identify the full extent of each action. Achieving this with weak video level labels requires the system to use similarity and dissimilarity between moments across videos in the training data to understand both how an action appears, as well as the sub-actions that comprise the action's full extent. However, current methods do not make explicit use of similarity between video moments to inform the localization and classification predictions. We present a novel method that uses graph convolutions to explicitly model similarity between video moments. Our method utilizes similarity graphs that encode appearance and motion, and pushes the state of the art on THUMOS '14, ActivityNet 1.2, and Charades for weakly supervised action localization.

* Accepted at WACV 2020

Via

Access Paper or Ask Questions

Audiovisual SlowFast Networks for Video Recognition

Jan 23, 2020

Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, Christoph Feichtenhofer

Figure 1 for Audiovisual SlowFast Networks for Video Recognition

Figure 2 for Audiovisual SlowFast Networks for Video Recognition

Figure 3 for Audiovisual SlowFast Networks for Video Recognition

Figure 4 for Audiovisual SlowFast Networks for Video Recognition

Abstract:We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast extends SlowFast Networks with a Faster Audio pathway that is deeply integrated with its visual counterparts. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, we employ DropPathway that randomly drops the Audio pathway during training as a simple and effective regularization technique. Inspired by prior studies in neuroscience, we perform hierarchical audiovisual synchronization and show that it leads to better audiovisual features. We report state-of-the-art results on four video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to self-supervised tasks, where it improves over prior work. Code will be made available at: https://github.com/facebookresearch/SlowFast.

* Technical report

Via

Access Paper or Ask Questions

Don't Judge an Object by Its Context: Learning to Overcome Contextual Bias

Jan 09, 2020

Krishna Kumar Singh, Dhruv Mahajan, Kristen Grauman, Yong Jae Lee, Matt Feiszli, Deepti Ghadiyaram

Figure 1 for Don't Judge an Object by Its Context: Learning to Overcome Contextual Bias

Figure 2 for Don't Judge an Object by Its Context: Learning to Overcome Contextual Bias

Figure 3 for Don't Judge an Object by Its Context: Learning to Overcome Contextual Bias

Figure 4 for Don't Judge an Object by Its Context: Learning to Overcome Contextual Bias

Abstract:Existing models often leverage co-occurrences between objects and their context to improve recognition accuracy. However, strongly relying on context risks a model's generalizability, especially when typical co-occurrence patterns are absent. This work focuses on addressing such contextual biases to improve the robustness of the learnt feature representations. Our goal is to accurately recognize a category in the absence of its context, without compromising on performance when it co-occurs with context. Our key idea is to decorrelate feature representations of a category from its co-occurring context. We achieve this by learning a feature subspace that explicitly represents categories occurring in the absence of context along side a joint feature subspace that represents both categories and context. Our very simple yet effective method is extensible to two multi-label tasks -- object and attribute classification. On 4 challenging datasets, we demonstrate the effectiveness of our method in reducing contextual bias.

Via

Access Paper or Ask Questions

YOLACT++: Better Real-time Instance Segmentation

Dec 03, 2019

Daniel Bolya, Chong Zhou, Fanyi Xiao, Yong Jae Lee

Figure 1 for YOLACT++: Better Real-time Instance Segmentation

Figure 2 for YOLACT++: Better Real-time Instance Segmentation

Figure 3 for YOLACT++: Better Real-time Instance Segmentation

Figure 4 for YOLACT++: Better Real-time Instance Segmentation

Abstract:We present a simple, fully-convolutional model for real-time (>30 fps) instance segmentation that achieves competitive results on MS COCO evaluated on a single Titan Xp, which is significantly faster than any previous state-of-the-art approach. Moreover, we obtain this result after training on only one GPU. We accomplish this by breaking instance segmentation into two parallel subtasks: (1) generating a set of prototype masks and (2) predicting per-instance mask coefficients. Then we produce instance masks by linearly combining the prototypes with the mask coefficients. We find that because this process doesn't depend on repooling, this approach produces very high-quality masks and exhibits temporal stability for free. Furthermore, we analyze the emergent behavior of our prototypes and show they learn to localize instances on their own in a translation variant manner, despite being fully-convolutional. We also propose Fast NMS, a drop-in 12 ms faster replacement for standard NMS that only has a marginal performance penalty. Finally, by incorporating deformable convolutions into the backbone network, optimizing the prediction head with better anchor scales and aspect ratios, and adding a novel fast mask re-scoring branch, our YOLACT++ model can achieve 34.1 mAP on MS COCO at 33.5 fps, which is fairly close to the state-of-the-art approaches while still running at real-time.

* Journal extension of our previous conference paper arXiv:1904.02689

Via

Access Paper or Ask Questions