Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Siyuan Qiao

CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation

Jun 17, 2022
Qihang Yu, Huiyu Wang, Dahun Kim, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen

Figure 1 for CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation

Figure 2 for CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation

Figure 3 for CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation

Figure 4 for CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation

We propose Clustering Mask Transformer (CMT-DeepLab), a transformer-based framework for panoptic segmentation designed around clustering. It rethinks the existing transformer architectures used in segmentation and detection; CMT-DeepLab considers the object queries as cluster centers, which fill the role of grouping the pixels when applied to segmentation. The clustering is computed with an alternating procedure, by first assigning pixels to the clusters by their feature affinity, and then updating the cluster centers and pixel features. Together, these operations comprise the Clustering Mask Transformer (CMT) layer, which produces cross-attention that is denser and more consistent with the final segmentation task. CMT-DeepLab improves the performance over prior art significantly by 4.4% PQ, achieving a new state-of-the-art of 55.7% PQ on the COCO test-dev set.

* CVPR 2022 Oral

Via

Access Paper or Ask Questions

Waymo Open Dataset: Panoramic Video Panoptic Segmentation

Jun 15, 2022
Jieru Mei, Alex Zihao Zhu, Xinchen Yan, Hang Yan, Siyuan Qiao, Yukun Zhu, Liang-Chieh Chen, Henrik Kretzschmar, Dragomir Anguelov

Figure 1 for Waymo Open Dataset: Panoramic Video Panoptic Segmentation

Figure 2 for Waymo Open Dataset: Panoramic Video Panoptic Segmentation

Figure 3 for Waymo Open Dataset: Panoramic Video Panoptic Segmentation

Figure 4 for Waymo Open Dataset: Panoramic Video Panoptic Segmentation

Panoptic image segmentation is the computer vision task of finding groups of pixels in an image and assigning semantic classes and object instance identifiers to them. Research in image segmentation has become increasingly popular due to its critical applications in robotics and autonomous driving. The research community thereby relies on publicly available benchmark dataset to advance the state-of-the-art in computer vision. Due to the high costs of densely labeling the images, however, there is a shortage of publicly available ground truth labels that are suitable for panoptic segmentation. The high labeling costs also make it challenging to extend existing datasets to the video domain and to multi-camera setups. We therefore present the Waymo Open Dataset: Panoramic Video Panoptic Segmentation Dataset, a large-scale dataset that offers high-quality panoptic segmentation labels for autonomous driving. We generate our dataset using the publicly available Waymo Open Dataset, leveraging the diverse set of camera images. Our labels are consistent over time for video processing and consistent across multiple cameras mounted on the vehicles for full panoramic scene understanding. Specifically, we offer labels for 28 semantic categories and 2,860 temporal sequences that were captured by five cameras mounted on autonomous vehicles driving in three different geographical locations, leading to a total of 100k labeled camera images. To the best of our knowledge, this makes our dataset an order of magnitude larger than existing datasets that offer video panoptic segmentation labels. We further propose a new benchmark for Panoramic Video Panoptic Segmentation and establish a number of strong baselines based on the DeepLab family of models. We will make the benchmark and the code publicly available. Find the dataset at https://waymo.com/open.

* Our dataset can be found at https://waymo.com/open

Via

Access Paper or Ask Questions

TubeFormer-DeepLab: Video Mask Transformer

May 30, 2022
Dahun Kim, Jun Xie, Huiyu Wang, Siyuan Qiao, Qihang Yu, Hong-Seok Kim, Hartwig Adam, In So Kweon, Liang-Chieh Chen

Figure 1 for TubeFormer-DeepLab: Video Mask Transformer

Figure 2 for TubeFormer-DeepLab: Video Mask Transformer

Figure 3 for TubeFormer-DeepLab: Video Mask Transformer

Figure 4 for TubeFormer-DeepLab: Video Mask Transformer

We present TubeFormer-DeepLab, the first attempt to tackle multiple core video segmentation tasks in a unified manner. Different video segmentation tasks (e.g., video semantic/instance/panoptic segmentation) are usually considered as distinct problems. State-of-the-art models adopted in the separate communities have diverged, and radically different approaches dominate in each task. By contrast, we make a crucial observation that video segmentation tasks could be generally formulated as the problem of assigning different predicted labels to video tubes (where a tube is obtained by linking segmentation masks along the time axis) and the labels may encode different values depending on the target task. The observation motivates us to develop TubeFormer-DeepLab, a simple and effective video mask transformer model that is widely applicable to multiple video segmentation tasks. TubeFormer-DeepLab directly predicts video tubes with task-specific labels (either pure semantic categories, or both semantic categories and instance identities), which not only significantly simplifies video segmentation models, but also advances state-of-the-art results on multiple video segmentation benchmarks

* CVPR 2022

Via

Access Paper or Ask Questions

Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

Jul 12, 2021
Chenglin Yang, Siyuan Qiao, Adam Kortylewski, Alan Yuille

Figure 1 for Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

Figure 2 for Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

Figure 3 for Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

Figure 4 for Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

Self-Attention has become prevalent in computer vision models. Inspired by fully connected Conditional Random Fields (CRFs), we decompose it into local and context terms. They correspond to the unary and binary terms in CRF and are implemented by attention mechanisms with projection matrices. We observe that the unary terms only make small contributions to the outputs, and meanwhile standard CNNs that rely solely on the unary terms achieve great performances on a variety of tasks. Therefore, we propose Locally Enhanced Self-Attention (LESA), which enhances the unary term by incorporating it with convolutions, and utilizes a fusion module to dynamically couple the unary and binary operations. In our experiments, we replace the self-attention modules with LESA. The results on ImageNet and COCO show the superiority of LESA over convolution and self-attention baselines for the tasks of image recognition, object detection, and instance segmentation. The code is made publicly available.

Via

Access Paper or Ask Questions

DeepLab2: A TensorFlow Library for Deep Labeling

Jun 17, 2021
Mark Weber, Huiyu Wang, Siyuan Qiao, Jun Xie, Maxwell D. Collins, Yukun Zhu, Liangzhe Yuan, Dahun Kim, Qihang Yu, Daniel Cremers, Laura Leal-Taixe, Alan L. Yuille, Florian Schroff, Hartwig Adam, Liang-Chieh Chen

Figure 1 for DeepLab2: A TensorFlow Library for Deep Labeling

Figure 2 for DeepLab2: A TensorFlow Library for Deep Labeling

Figure 3 for DeepLab2: A TensorFlow Library for Deep Labeling

DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a state-of-the-art and easy-to-use TensorFlow codebase for general dense pixel prediction problems in computer vision. DeepLab2 includes all our recently developed DeepLab model variants with pretrained checkpoints as well as model training and evaluation code, allowing the community to reproduce and further improve upon the state-of-art systems. To showcase the effectiveness of DeepLab2, our Panoptic-DeepLab employing Axial-SWideRNet as network backbone achieves 68.0% PQ or 83.5% mIoU on Cityscaspes validation set, with only single-scale inference and ImageNet-1K pretrained checkpoints. We hope that publicly sharing our library could facilitate future research on dense pixel labeling tasks and envision new applications of this technology. Code is made publicly available at \url{https://github.com/google-research/deeplab2}.

* 4-page technical report. The first three authors contributed equally to this work

Via

Access Paper or Ask Questions

ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation

Dec 09, 2020
Siyuan Qiao, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen

Figure 1 for ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation

Figure 2 for ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation

Figure 3 for ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation

Figure 4 for ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation

In this paper, we present ViP-DeepLab, a unified model attempting to tackle the long-standing and challenging inverse projection problem in vision, which we model as restoring the point clouds from perspective image sequences while providing each point with instance-level semantic interpretations. Solving this problem requires the vision models to predict the spatial location, semantic class, and temporally consistent instance label for each 3D point. ViP-DeepLab approaches it by jointly performing monocular depth estimation and video panoptic segmentation. We name this joint task as Depth-aware Video Panoptic Segmentation, and propose a new evaluation metric along with two derived datasets for it, which will be made available to the public. On the individual sub-tasks, ViP-DeepLab also achieves state-of-the-art results, outperforming previous methods by 5.1% VPQ on Cityscapes-VPS, ranking 1st on the KITTI monocular depth estimation benchmark, and 1st on KITTI MOTS pedestrian. The datasets and the evaluation codes are made publicly available.

* Video: https://youtu.be/XR4HFiwwao0 GitHub: https://github.com/joe-siyuan-qiao/ViP-DeepLab

Via

Access Paper or Ask Questions

Batch Normalization with Enhanced Linear Transformation

Nov 28, 2020
Yuhui Xu, Lingxi Xie, Cihang Xie, Jieru Mei, Siyuan Qiao, Wei Shen, Hongkai Xiong, Alan Yuille

Figure 1 for Batch Normalization with Enhanced Linear Transformation

Figure 2 for Batch Normalization with Enhanced Linear Transformation

Figure 3 for Batch Normalization with Enhanced Linear Transformation

Figure 4 for Batch Normalization with Enhanced Linear Transformation

Batch normalization (BN) is a fundamental unit in modern deep networks, in which a linear transformation module was designed for improving BN's flexibility of fitting complex data distributions. In this paper, we demonstrate properly enhancing this linear transformation module can effectively improve the ability of BN. Specifically, rather than using a single neuron, we propose to additionally consider each neuron's neighborhood for calculating the outputs of the linear transformation. Our method, named BNET, can be implemented with 2-3 lines of code in most deep learning libraries. Despite the simplicity, BNET brings consistent performance gains over a wide range of backbones and visual benchmarks. Moreover, we verify that BNET accelerates the convergence of network training and enhances spatial information by assigning the important neurons with larger weights accordingly. The code is available at https://github.com/yuhuixu1993/BNET.

* 12 pages. The code is available at https://github.com/yuhuixu1993/BNET

Via

Access Paper or Ask Questions

Scaling Wide Residual Networks for Panoptic Segmentation

Nov 23, 2020
Liang-Chieh Chen, Huiyu Wang, Siyuan Qiao

Figure 1 for Scaling Wide Residual Networks for Panoptic Segmentation

Figure 2 for Scaling Wide Residual Networks for Panoptic Segmentation

Figure 3 for Scaling Wide Residual Networks for Panoptic Segmentation

Figure 4 for Scaling Wide Residual Networks for Panoptic Segmentation

The Wide Residual Networks (Wide-ResNets), a shallow but wide model variant of the Residual Networks (ResNets) by stacking a small number of residual blocks with large channel sizes, have demonstrated outstanding performance on multiple dense prediction tasks. However, since proposed, the Wide-ResNet architecture has barely evolved over the years. In this work, we revisit its architecture design for the recent challenging panoptic segmentation task, which aims to unify semantic segmentation and instance segmentation. A baseline model is obtained by incorporating the simple and effective Squeeze-and-Excitation and Switchable Atrous Convolution to the Wide-ResNets. Its network capacity is further scaled up or down by adjusting the width (i.e., channel size) and depth (i.e., number of layers), resulting in a family of SWideRNets (short for Scaling Wide Residual Networks). We demonstrate that such a simple scaling scheme, coupled with grid search, identifies several SWideRNets that significantly advance state-of-the-art performance on panoptic segmentation datasets in both the fast model regime and strong model regime.

* 12 pages including reference

Via

Access Paper or Ask Questions

DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution

Jun 03, 2020
Siyuan Qiao, Liang-Chieh Chen, Alan Yuille

Figure 1 for DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution

Figure 2 for DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution

Figure 3 for DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution

Figure 4 for DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution

Many modern object detectors demonstrate outstanding performances by using the mechanism of looking and thinking twice. In this paper, we explore this mechanism in the backbone design for object detection. At the macro level, we propose Recursive Feature Pyramid, which incorporates extra feedback connections from Feature Pyramid Networks into the bottom-up backbone layers. At the micro level, we propose Switchable Atrous Convolution, which convolves the features with different atrous rates and gathers the results using switch functions. Combining them results in DetectoRS, which significantly improves the performances of object detection. On COCO test-dev, DetectoRS achieves state-of-the-art 54.7% box AP for object detection, 47.1% mask AP for instance segmentation, and 49.6% PQ for panoptic segmentation. The code is made publicly available.

Via

Access Paper or Ask Questions

Shape-aware Feature Extraction for Instance Segmentation

Nov 25, 2019
Hao Ding, Siyuan Qiao, Wei Shen, Alan Yuille

Figure 1 for Shape-aware Feature Extraction for Instance Segmentation

Figure 2 for Shape-aware Feature Extraction for Instance Segmentation

Figure 3 for Shape-aware Feature Extraction for Instance Segmentation

Figure 4 for Shape-aware Feature Extraction for Instance Segmentation

Modern instance segmentation approaches mainly adopt a sequential paradigm - ``detect then segment'', as popularized by Mask R-CNN, which have achieved considerable progress. However, they usually struggle to segment huddled instances, i.e., instances which are crowded together. The essential reason is the detection step is only learned under box-level supervision. Without the guidance from the mask-level supervision, the features extracted from the regions containing huddled instances are noisy and ambiguous, which makes the detection problem ill-posed. To address this issue, we propose a new region-of-interest (RoI) feature extraction strategy, named Shape-aware RoIAlign, which focuses feature extraction within a region aligned well with the shape of the instance-of-interest rather than a rectangular RoI. We instantiate Shape-aware RoIAlign by introducing a novel refining module built upon Mask R-CNN, which takes the mask predicted by Mask R-CNN as the region to guide the computation of Shape-aware RoIAlign. Based on the RoI features re-computed by Shape-aware RoIAlign, the refining module updates the bounding box as well as the mask predicted by Mask R-CNN. Experimental results show that the refining module equipped with Shape-aware RoIAlign achieves consistent and remarkable improvements than Mask R-CNN models with different backbones, respectively, on the challenging COCO dataset. The code will be released.

Via

Access Paper or Ask Questions