Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Fisher Yu

and Other Contributors

How To Not Train Your Dragon: Training-free Embodied Object Goal Navigation with Semantic Frontiers

May 26, 2023

Junting Chen, Guohao Li, Suryansh Kumar, Bernard Ghanem, Fisher Yu

Figure 1 for How To Not Train Your Dragon: Training-free Embodied Object Goal Navigation with Semantic Frontiers

Figure 2 for How To Not Train Your Dragon: Training-free Embodied Object Goal Navigation with Semantic Frontiers

Figure 3 for How To Not Train Your Dragon: Training-free Embodied Object Goal Navigation with Semantic Frontiers

Figure 4 for How To Not Train Your Dragon: Training-free Embodied Object Goal Navigation with Semantic Frontiers

Abstract:Object goal navigation is an important problem in Embodied AI that involves guiding the agent to navigate to an instance of the object category in an unknown environment -- typically an indoor scene. Unfortunately, current state-of-the-art methods for this problem rely heavily on data-driven approaches, \eg, end-to-end reinforcement learning, imitation learning, and others. Moreover, such methods are typically costly to train and difficult to debug, leading to a lack of transferability and explainability. Inspired by recent successes in combining classical and learning methods, we present a modular and training-free solution, which embraces more classic approaches, to tackle the object goal navigation problem. Our method builds a structured scene representation based on the classic visual simultaneous localization and mapping (V-SLAM) framework. We then inject semantics into geometric-based frontier exploration to reason about promising areas to search for a goal object. Our structured scene representation comprises a 2D occupancy map, semantic point cloud, and spatial scene graph. Our method propagates semantics on the scene graphs based on language priors and scene statistics to introduce semantic knowledge to the geometric frontiers. With injected semantic priors, the agent can reason about the most promising frontier to explore. The proposed pipeline shows strong experimental performance for object goal navigation on the Gibson benchmark dataset, outperforming the previous state-of-the-art. We also perform comprehensive ablation studies to identify the current bottleneck in the object navigation task.

* Accepted by/To be published in Robotics: Science and Systems (RSS) 2023; 11 pages, 5 figures

Via

Access Paper or Ask Questions

Maskomaly:Zero-Shot Mask Anomaly Segmentation

May 26, 2023

Jan Ackermann, Christos Sakaridis, Fisher Yu

Figure 1 for Maskomaly:Zero-Shot Mask Anomaly Segmentation

Figure 2 for Maskomaly:Zero-Shot Mask Anomaly Segmentation

Figure 3 for Maskomaly:Zero-Shot Mask Anomaly Segmentation

Figure 4 for Maskomaly:Zero-Shot Mask Anomaly Segmentation

Abstract:We present a simple and practical framework for anomaly segmentation called Maskomaly. It builds upon mask-based standard semantic segmentation networks by adding a simple inference-time post-processing step which leverages the raw mask outputs of such networks. Maskomaly does not require additional training and only adds a small computational overhead to inference. Most importantly, it does not require anomalous data at training. We show top results for our method on SMIYC, RoadAnomaly, and StreetHazards. On the most central benchmark, SMIYC, Maskomaly outperforms all directly comparable approaches. Further, we introduce a novel metric that benefits the development of robust anomaly segmentation methods and demonstrate its informativeness on RoadAnomaly.

Via

Access Paper or Ask Questions

OVTrack: Open-Vocabulary Multiple Object Tracking

Apr 17, 2023

Siyuan Li, Tobias Fischer, Lei Ke, Henghui Ding, Martin Danelljan, Fisher Yu

Abstract:The ability to recognize, localize and track dynamic objects in a scene is fundamental to many real-world applications, such as self-driving and robotic systems. Yet, traditional multiple object tracking (MOT) benchmarks rely only on a few object categories that hardly represent the multitude of possible objects that are encountered in the real world. This leaves contemporary MOT methods limited to a small set of pre-defined object categories. In this paper, we address this limitation by tackling a novel task, open-vocabulary MOT, that aims to evaluate tracking beyond pre-defined training categories. We further develop OVTrack, an open-vocabulary tracker that is capable of tracking arbitrary object classes. Its design is based on two key ingredients: First, leveraging vision-language models for both classification and association via knowledge distillation; second, a data hallucination strategy for robust appearance feature learning from denoising diffusion probabilistic models. The result is an extremely data-efficient open-vocabulary tracker that sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark, while being trained solely on static images. Project page: https://www.vis.xyz/pub/ovtrack/

* CVPR 2023

Via

Access Paper or Ask Questions

iDisc: Internal Discretization for Monocular Depth Estimation

Apr 13, 2023

Luigi Piccinelli, Christos Sakaridis, Fisher Yu

Figure 1 for iDisc: Internal Discretization for Monocular Depth Estimation

Figure 2 for iDisc: Internal Discretization for Monocular Depth Estimation

Figure 3 for iDisc: Internal Discretization for Monocular Depth Estimation

Figure 4 for iDisc: Internal Discretization for Monocular Depth Estimation

Abstract:Monocular depth estimation is fundamental for 3D scene understanding and downstream applications. However, even under the supervised setup, it is still challenging and ill-posed due to the lack of full geometric constraints. Although a scene can consist of millions of pixels, there are fewer high-level patterns. We propose iDisc to learn those patterns with internal discretized representations. The method implicitly partitions the scene into a set of high-level patterns. In particular, our new module, Internal Discretization (ID), implements a continuous-discrete-continuous bottleneck to learn those concepts without supervision. In contrast to state-of-the-art methods, the proposed model does not enforce any explicit constraints or priors on the depth output. The whole network with the ID module can be trained end-to-end, thanks to the bottleneck module based on attention. Our method sets the new state of the art with significant improvements on NYU-Depth v2 and KITTI, outperforming all published methods on the official KITTI benchmark. iDisc can also achieve state-of-the-art results on surface normal estimation. Further, we explore the model generalization capability via zero-shot testing. We observe the compelling need to promote diversification in the outdoor scenario. Hence, we introduce splits of two autonomous driving datasets, DDAD and Argoverse. Code is available at http://vis.xyz/pub/idisc .

* Accepted at CVPR 2023

Via

Access Paper or Ask Questions

Mask-Free Video Instance Segmentation

Mar 28, 2023

Lei Ke, Martin Danelljan, Henghui Ding, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu

Abstract:The recent advancement in Video Instance Segmentation (VIS) has largely been driven by the use of deeper and increasingly data-hungry transformer-based models. However, video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets. In this work, we aim to remove the mask-annotation requirement. We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state. We leverage the rich temporal mask consistency constraints in videos by introducing the Temporal KNN-patch Loss (TK-Loss), providing strong mask supervision without any labels. Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection. A consistency loss is then enforced on the found matches. Our mask-free objective is simple to implement, has no trainable parameters, is computationally efficient, yet outperforms baselines employing, e.g., state-of-the-art optical flow to enforce temporal mask consistency. We validate MaskFreeVIS on the YouTube-VIS 2019/2021, OVIS and BDD100K MOTS benchmarks. The results clearly demonstrate the efficacy of our method by drastically narrowing the gap between fully and weakly-supervised VIS performance. Our code and trained models are available at https://github.com/SysCV/MaskFreeVis.

* Accepted in CVPR 2023; Code: https://github.com/SysCV/MaskFreeVis; Project page: http://vis.xyz/pub/maskfreevis

Via

Access Paper or Ask Questions

A Multiplicative Value Function for Safe and Efficient Reinforcement Learning

Mar 07, 2023

Nick Bührer, Zhejun Zhang, Alexander Liniger, Fisher Yu, Luc Van Gool

Abstract:An emerging field of sequential decision problems is safe Reinforcement Learning (RL), where the objective is to maximize the reward while obeying safety constraints. Being able to handle constraints is essential for deploying RL agents in real-world environments, where constraint violations can harm the agent and the environment. To this end, we propose a safe model-free RL algorithm with a novel multiplicative value function consisting of a safety critic and a reward critic. The safety critic predicts the probability of constraint violation and discounts the reward critic that only estimates constraint-free returns. By splitting responsibilities, we facilitate the learning task leading to increased sample efficiency. We integrate our approach into two popular RL algorithms, Proximal Policy Optimization and Soft Actor-Critic, and evaluate our method in four safety-focused environments, including classical RL benchmarks augmented with safety constraints and robot navigation tasks with images and raw Lidar scans as observations. Finally, we make the zero-shot sim-to-real transfer where a differential drive robot has to navigate through a cluttered room. Our code can be found at https://github.com/nikeke19/Safe-Mult-RL.

* Repository available at https://github.com/nikeke19/Safe-Mult-RL

Via

Access Paper or Ask Questions

TrafficBots: Towards World Models for Autonomous Driving Simulation and Motion Prediction

Mar 07, 2023

Zhejun Zhang, Alexander Liniger, Dengxin Dai, Fisher Yu, Luc Van Gool

Figure 1 for TrafficBots: Towards World Models for Autonomous Driving Simulation and Motion Prediction

Figure 2 for TrafficBots: Towards World Models for Autonomous Driving Simulation and Motion Prediction

Figure 3 for TrafficBots: Towards World Models for Autonomous Driving Simulation and Motion Prediction

Figure 4 for TrafficBots: Towards World Models for Autonomous Driving Simulation and Motion Prediction

Abstract:Data-driven simulation has become a favorable way to train and test autonomous driving algorithms. The idea of replacing the actual environment with a learned simulator has also been explored in model-based reinforcement learning in the context of world models. In this work, we show data-driven traffic simulation can be formulated as a world model. We present TrafficBots, a multi-agent policy built upon motion prediction and end-to-end driving, and based on TrafficBots we obtain a world model tailored for the planning module of autonomous vehicles. Existing data-driven traffic simulators are lacking configurability and scalability. To generate configurable behaviors, for each agent we introduce a destination as navigational information, and a time-invariant latent personality that specifies the behavioral style. To improve the scalability, we present a new scheme of positional encoding for angles, allowing all agents to share the same vectorized context and the use of an architecture based on dot-product attention. As a result, we can simulate all traffic participants seen in dense urban scenarios. Experiments on the Waymo open motion dataset show TrafficBots can simulate realistic multi-agent behaviors and achieve good performance on the motion prediction task.

* Accepted at ICRA 2023. The repository is available at https://github.com/SysCV/TrafficBots

Via

Access Paper or Ask Questions

Uncertainty-Driven Dense Two-View Structure from Motion

Feb 12, 2023

Weirong Chen, Suryansh Kumar, Fisher Yu

Abstract:This work introduces an effective and practical solution to the dense two-view structure from motion (SfM) problem. One vital question addressed is how to mindfully use per-pixel optical flow correspondence between two frames for accurate pose estimation -- as perfect per-pixel correspondence between two images is difficult, if not impossible, to establish. With the carefully estimated camera pose and predicted per-pixel optical flow correspondences, a dense depth of the scene is computed. Later, an iterative refinement procedure is introduced to further improve optical flow matching confidence, camera pose, and depth, exploiting their inherent dependency in rigid SfM. The fundamental idea presented is to benefit from per-pixel uncertainty in the optical flow estimation and provide robustness to the dense SfM system via an online refinement. Concretely, we introduce our uncertainty-driven Dense Two-View SfM pipeline (DTV-SfM), consisting of an uncertainty-aware dense optical flow estimation approach that provides per-pixel correspondence with their confidence score of matching; a weighted dense bundle adjustment formulation that depends on optical flow uncertainty and bidirectional optical flow consistency to refine both pose and depth; a depth estimation network that considers its consistency with the estimated poses and optical flow respecting epipolar constraint. Extensive experiments show that the proposed approach achieves remarkable depth accuracy and state-of-the-art camera pose results superseding SuperPoint and SuperGlue accuracy when tested on benchmark datasets such as DeMoN, YFCC100M, and ScanNet. Code and more materials are available at http://vis.xyz/pub/dtv-sfm.

* Accepted for publication at IEEE Robotics and Automation Letters (RA-L) 2023

Via

Access Paper or Ask Questions

Scaling Vision Transformers to 22 Billion Parameters

Feb 10, 2023

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin(+32 more)

Figure 1 for Scaling Vision Transformers to 22 Billion Parameters

Figure 2 for Scaling Vision Transformers to 22 Billion Parameters

Figure 3 for Scaling Vision Transformers to 22 Billion Parameters

Figure 4 for Scaling Vision Transformers to 22 Billion Parameters

Abstract:The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.

Via

Access Paper or Ask Questions

BiBench: Benchmarking and Analyzing Network Binarization

Jan 26, 2023

Haotong Qin, Mingyuan Zhang, Yifu Ding, Aoyu Li, Zhongang Cai, Ziwei Liu, Fisher Yu, Xianglong Liu

Figure 1 for BiBench: Benchmarking and Analyzing Network Binarization

Figure 2 for BiBench: Benchmarking and Analyzing Network Binarization

Figure 3 for BiBench: Benchmarking and Analyzing Network Binarization

Figure 4 for BiBench: Benchmarking and Analyzing Network Binarization

Abstract:Network binarization emerges as one of the most promising compression approaches offering extraordinary computation and memory savings by minimizing the bit-width. However, recent research has shown that applying existing binarization algorithms to diverse tasks, architectures, and hardware in realistic scenarios is still not straightforward. Common challenges of binarization, such as accuracy degradation and efficiency limitation, suggest that its attributes are not fully understood. To close this gap, we present BiBench, a rigorously designed benchmark with in-depth analysis for network binarization. We first carefully scrutinize the requirements of binarization in the actual production and define evaluation tracks and metrics for a comprehensive and fair investigation. Then, we evaluate and analyze a series of milestone binarization algorithms that function at the operator level and with extensive influence. Our benchmark reveals that 1) the binarized operator has a crucial impact on the performance and deployability of binarized networks; 2) the accuracy of binarization varies significantly across different learning tasks and neural architectures; 3) binarization has demonstrated promising efficiency potential on edge devices despite the limited hardware support. The results and analysis also lead to a promising paradigm for accurate and efficient binarization. We believe that BiBench will contribute to the broader adoption of binarization and serve as a foundation for future research.

Via

Access Paper or Ask Questions