Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Fisher Yu

Dense Prediction with Attentive Feature Aggregation

Nov 01, 2021

Yung-Hsu Yang, Thomas E. Huang, Samuel Rota Bulò, Peter Kontschieder, Fisher Yu

Figure 1 for Dense Prediction with Attentive Feature Aggregation

Figure 2 for Dense Prediction with Attentive Feature Aggregation

Figure 3 for Dense Prediction with Attentive Feature Aggregation

Figure 4 for Dense Prediction with Attentive Feature Aggregation

Abstract:Aggregating information from features across different layers is an essential operation for dense prediction models. Despite its limited expressiveness, feature concatenation dominates the choice of aggregation operations. In this paper, we introduce Attentive Feature Aggregation (AFA) to fuse different network layers with more expressive non-linear operations. AFA exploits both spatial and channel attention to compute weighted average of the layer activations. Inspired by neural volume rendering, we extend AFA with Scale-Space Rendering (SSR) to perform late fusion of multi-scale predictions. AFA is applicable to a wide range of existing network designs. Our experiments show consistent and significant improvements on challenging semantic segmentation benchmarks, including Cityscapes, BDD100K, and Mapillary Vistas, at negligible computational and parameter overhead. In particular, AFA improves the performance of the Deep Layer Aggregation (DLA) model by nearly 6% mIoU on Cityscapes. Our experimental analyses show that AFA learns to progressively refine segmentation maps and to improve boundary details, leading to new state-of-the-art results on boundary detection benchmarks on BSDS500 and NYUDv2. Code and video resources are available at http://vis.xyz/pub/dla-afa.

* 18 pages, 16 figures

Via

Access Paper or Ask Questions

TADA: Taxonomy Adaptive Domain Adaptation

Sep 10, 2021

Rui Gong, Martin Danelljan, Dengxin Dai, Wenguan Wang, Danda Pani Paudel, Ajad Chhatkuli, Fisher Yu, Luc Van Gool

Figure 1 for TADA: Taxonomy Adaptive Domain Adaptation

Figure 2 for TADA: Taxonomy Adaptive Domain Adaptation

Figure 3 for TADA: Taxonomy Adaptive Domain Adaptation

Figure 4 for TADA: Taxonomy Adaptive Domain Adaptation

Abstract:Traditional domain adaptation addresses the task of adapting a model to a novel target domain under limited or no additional supervision. While tackling the input domain gap, the standard domain adaptation settings assume no domain change in the output space. In semantic prediction tasks, different datasets are often labeled according to different semantic taxonomies. In many real-world settings, the target domain task requires a different taxonomy than the one imposed by the source domain. We therefore introduce the more general taxonomy adaptive domain adaptation (TADA) problem, allowing for inconsistent taxonomies between the two domains. We further propose an approach that jointly addresses the image-level and label-level domain adaptation. On the label-level, we employ a bilateral mixed sampling strategy to augment the target domain, and a relabelling method to unify and align the label spaces. We address the image-level domain gap by proposing an uncertainty-rectified contrastive learning method, leading to more domain-invariant and class discriminative features. We extensively evaluate the effectiveness of our framework under different TADA settings: open taxonomy, coarse-to-fine taxonomy, and partially-overlapping taxonomy. Our framework outperforms previous state-of-the-art by a large margin, while capable of adapting to new target domain taxonomies.

* 15 pages, 5 figures, 6 tables

Via

Access Paper or Ask Questions

End-to-End Urban Driving by Imitating a Reinforcement Learning Coach

Aug 26, 2021

Zhejun Zhang, Alexander Liniger, Dengxin Dai, Fisher Yu, Luc Van Gool

Abstract:End-to-end approaches to autonomous driving commonly rely on expert demonstrations. Although humans are good drivers, they are not good coaches for end-to-end algorithms that demand dense on-policy supervision. On the contrary, automated experts that leverage privileged information can efficiently generate large scale on-policy and off-policy demonstrations. However, existing automated experts for urban driving make heavy use of hand-crafted rules and perform suboptimally even on driving simulators, where ground-truth information is available. To address these issues, we train a reinforcement learning expert that maps bird's-eye view images to continuous low-level actions. While setting a new performance upper-bound on CARLA, our expert is also a better coach that provides informative supervision signals for imitation learning agents to learn from. Supervised by our reinforcement learning coach, a baseline end-to-end agent with monocular camera-input achieves expert-level performance. Our end-to-end agent achieves a 78% success rate while generalizing to a new town and new weather on the NoCrash-dense benchmark and state-of-the-art performance on the more challenging CARLA LeaderBoard.

* ICCV 2021

Via

Access Paper or Ask Questions

Deep Reparametrization of Multi-Frame Super-Resolution and Denoising

Aug 18, 2021

Goutam Bhat, Martin Danelljan, Fisher Yu, Luc Van Gool, Radu Timofte

Figure 1 for Deep Reparametrization of Multi-Frame Super-Resolution and Denoising

Figure 2 for Deep Reparametrization of Multi-Frame Super-Resolution and Denoising

Figure 3 for Deep Reparametrization of Multi-Frame Super-Resolution and Denoising

Figure 4 for Deep Reparametrization of Multi-Frame Super-Resolution and Denoising

Abstract:We propose a deep reparametrization of the maximum a posteriori formulation commonly employed in multi-frame image restoration tasks. Our approach is derived by introducing a learned error metric and a latent representation of the target image, which transforms the MAP objective to a deep feature space. The deep reparametrization allows us to directly model the image formation process in the latent space, and to integrate learned image priors into the prediction. Our approach thereby leverages the advantages of deep learning, while also benefiting from the principled multi-frame fusion provided by the classical MAP formulation. We validate our approach through comprehensive experiments on burst denoising and burst super-resolution datasets. Our approach sets a new state-of-the-art for both tasks, demonstrating the generality and effectiveness of the proposed formulation.

* ICCV 2021 Oral

Via

Access Paper or Ask Questions

On the Practicality of Deterministic Epistemic Uncertainty

Jul 13, 2021

Janis Postels, Mattia Segu, Tao Sun, Luc Van Gool, Fisher Yu, Federico Tombari

Figure 1 for On the Practicality of Deterministic Epistemic Uncertainty

Figure 2 for On the Practicality of Deterministic Epistemic Uncertainty

Figure 3 for On the Practicality of Deterministic Epistemic Uncertainty

Figure 4 for On the Practicality of Deterministic Epistemic Uncertainty

Abstract:A set of novel approaches for estimating epistemic uncertainty in deep neural networks with a single forward pass has recently emerged as a valid alternative to Bayesian Neural Networks. On the premise of informative representations, these deterministic uncertainty methods (DUMs) achieve strong performance on detecting out-of-distribution (OOD) data while adding negligible computational costs at inference time. However, it remains unclear whether DUMs are well calibrated and can seamlessly scale to real-world applications - both prerequisites for their practical deployment. To this end, we first provide a taxonomy of DUMs, evaluate their calibration under continuous distributional shifts and their performance on OOD detection for image classification tasks. Then, we extend the most promising approaches to semantic segmentation. We find that, while DUMs scale to realistic vision tasks and perform well on OOD detection, the practicality of current methods is undermined by poor calibration under realistic distributional shifts.

Via

Access Paper or Ask Questions

Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

Jun 22, 2021

Lei Ke, Xia Li, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu

Figure 1 for Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

Figure 2 for Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

Figure 3 for Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

Figure 4 for Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

Abstract:Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes. Most approaches only exploit the temporal dimension to address the association problem, while relying on single frame predictions for the segmentation mask itself. We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation. PCAN first distills a space-time memory into a set of prototypes and then employs cross-attention to retrieve rich information from the past frames. To segment each object, PCAN adopts a prototypical appearance module to learn a set of contrastive foreground and background prototypes, which are then propagated over time. Extensive experiments demonstrate that PCAN outperforms current video instance tracking and segmentation competition winners on both Youtube-VIS and BDD100K datasets, and shows efficacy to both one-stage and two-stage segmentation frameworks. Code will be available at http://vis.xyz/pub/pcan.

* Multiple object tracking and segmentation on large-scale datasets

Via

Access Paper or Ask Questions

Robust Object Detection via Instance-Level Temporal Cycle Confusion

Apr 16, 2021

Xin Wang, Thomas E. Huang, Benlin Liu, Fisher Yu, Xiaolong Wang, Joseph E. Gonzalez, Trevor Darrell

Figure 1 for Robust Object Detection via Instance-Level Temporal Cycle Confusion

Figure 2 for Robust Object Detection via Instance-Level Temporal Cycle Confusion

Figure 3 for Robust Object Detection via Instance-Level Temporal Cycle Confusion

Figure 4 for Robust Object Detection via Instance-Level Temporal Cycle Confusion

Abstract:Building reliable object detectors that are robust to domain shifts, such as various changes in context, viewpoint, and object appearances, is critical for real-world applications. In this work, we study the effectiveness of auxiliary self-supervised tasks to improve the out-of-distribution generalization of object detectors. Inspired by the principle of maximum entropy, we introduce a novel self-supervised task, instance-level temporal cycle confusion (CycConf), which operates on the region features of the object detectors. For each object, the task is to find the most different object proposals in the adjacent frame in a video and then cycle back to itself for self-supervision. CycConf encourages the object detector to explore invariant structures across instances under various motions, which leads to improved model robustness in unseen domains at test time. We observe consistent out-of-domain performance improvements when training object detectors in tandem with self-supervised tasks on large-scale video datasets (BDD100K and Waymo open data). The joint training framework also establishes a new state-of-the-art on standard unsupervised domain adaptative detection benchmarks (Cityscapes, Foggy Cityscapes, and Sim10K). The project page is available at https://xinw.ai/cyc-conf.

Via

Access Paper or Ask Questions

Warp Consistency for Unsupervised Learning of Dense Correspondences

Apr 08, 2021

Prune Truong, Martin Danelljan, Fisher Yu, Luc Van Gool

Figure 1 for Warp Consistency for Unsupervised Learning of Dense Correspondences

Figure 2 for Warp Consistency for Unsupervised Learning of Dense Correspondences

Figure 3 for Warp Consistency for Unsupervised Learning of Dense Correspondences

Figure 4 for Warp Consistency for Unsupervised Learning of Dense Correspondences

Abstract:The key challenge in learning dense correspondences lies in the lack of ground-truth matches for real image pairs. While photometric consistency losses provide unsupervised alternatives, they struggle with large appearance changes, which are ubiquitous in geometric and semantic matching tasks. Moreover, methods relying on synthetic training pairs often suffer from poor generalisation to real data. We propose Warp Consistency, an unsupervised learning objective for dense correspondence regression. Our objective is effective even in settings with large appearance and view-point changes. Given a pair of real images, we first construct an image triplet by applying a randomly sampled warp to one of the original images. We derive and analyze all flow-consistency constraints arising between the triplet. From our observations and empirical results, we design a general unsupervised objective employing two of the derived constraints. We validate our warp consistency loss by training three recent dense correspondence networks for the geometric and semantic matching tasks. Our approach sets a new state-of-the-art on several challenging benchmarks, including MegaDepth, RobotCar and TSS. Code and models will be released at https://github.com/PruneTruong/DenseMatching.

* code: https://github.com/PruneTruong/DenseMatching

Via

Access Paper or Ask Questions

Monocular Quasi-Dense 3D Object Tracking

Mar 12, 2021

Hou-Ning Hu, Yung-Hsu Yang, Tobias Fischer, Trevor Darrell, Fisher Yu, Min Sun

Figure 1 for Monocular Quasi-Dense 3D Object Tracking

Figure 2 for Monocular Quasi-Dense 3D Object Tracking

Figure 3 for Monocular Quasi-Dense 3D Object Tracking

Figure 4 for Monocular Quasi-Dense 3D Object Tracking

Abstract:A reliable and accurate 3D tracking framework is essential for predicting future locations of surrounding objects and planning the observer's actions in numerous applications such as autonomous driving. We propose a framework that can effectively associate moving objects over time and estimate their full 3D bounding box information from a sequence of 2D images captured on a moving platform. The object association leverages quasi-dense similarity learning to identify objects in various poses and viewpoints with appearance cues only. After initial 2D association, we further utilize 3D bounding boxes depth-ordering heuristics for robust instance association and motion-based 3D trajectory prediction for re-identification of occluded vehicles. In the end, an LSTM-based object velocity learning module aggregates the long-term trajectory information for more accurate motion extrapolation. Experiments on our proposed simulation data and real-world benchmarks, including KITTI, nuScenes, and Waymo datasets, show that our tracking framework offers robust object association and tracking on urban-driving scenarios. On the Waymo Open benchmark, we establish the first camera-only baseline in the 3D tracking and 3D detection challenges. Our quasi-dense 3D tracking pipeline achieves impressive improvements on the nuScenes 3D tracking benchmark with near five times tracking accuracy of the best vision-only submission among all published methods. Our code, data and trained models are available at https://github.com/SysCV/qd-3dt.

* 24 pages, 14 figures, including appendix (6 pages)

Via

Access Paper or Ask Questions

Exploring Cross-Image Pixel Contrast for Semantic Segmentation

Feb 11, 2021

Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, Ender Konukoglu, Luc Van Gool

Figure 1 for Exploring Cross-Image Pixel Contrast for Semantic Segmentation

Figure 2 for Exploring Cross-Image Pixel Contrast for Semantic Segmentation

Figure 3 for Exploring Cross-Image Pixel Contrast for Semantic Segmentation

Figure 4 for Exploring Cross-Image Pixel Contrast for Semantic Segmentation

Abstract:Current semantic segmentation methods focus only on mining "local" context, i.e., dependencies between pixels within individual images, by context-aggregation modules (e.g., dilated convolution, neural attention) or structure-aware optimization criteria (e.g., IoU-like loss). However, they ignore "global" context of the training data, i.e., rich semantic relations between pixels across different images. Inspired by the recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive framework for semantic segmentation in the fully supervised setting. The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes. It raises a pixel-wise metric learning paradigm for semantic segmentation, by explicitly exploring the structures of labeled pixels, which were rarely explored before. Our method can be effortlessly incorporated into existing segmentation frameworks without extra overhead during testing. We experimentally show that, with famous segmentation models (i.e., DeepLabV3, HRNet, OCR) and backbones (i.e., ResNet, HR-Net), our method brings consistent performance improvements across diverse datasets (i.e., Cityscapes, PASCAL-Context, COCO-Stuff). We expect this work will encourage our community to rethink the current de facto training paradigm in fully supervised semantic segmentation.

* Our code will be available at https://github.com/tfzhou/ContrastiveSeg

Via

Access Paper or Ask Questions