Alert button
Picture for Alex Zihao Zhu

Alex Zihao Zhu

Alert button

Instance Segmentation with Cross-Modal Consistency

Oct 14, 2022
Alex Zihao Zhu, Vincent Casser, Reza Mahjourian, Henrik Kretzschmar, Sören Pirk

Figure 1 for Instance Segmentation with Cross-Modal Consistency
Figure 2 for Instance Segmentation with Cross-Modal Consistency
Figure 3 for Instance Segmentation with Cross-Modal Consistency
Figure 4 for Instance Segmentation with Cross-Modal Consistency

Segmenting object instances is a key task in machine perception, with safety-critical applications in robotics and autonomous driving. We introduce a novel approach to instance segmentation that jointly leverages measurements from multiple sensor modalities, such as cameras and LiDAR. Our method learns to predict embeddings for each pixel or point that give rise to a dense segmentation of the scene. Specifically, our technique applies contrastive learning to points in the scene both across sensor modalities and the temporal domain. We demonstrate that this formulation encourages the models to learn embeddings that are invariant to viewpoint variations and consistent across sensor modalities. We further demonstrate that the embeddings are stable over time as objects move around the scene. This not only provides stable instance masks, but can also provide valuable signals to downstream tasks, such as object tracking. We evaluate our method on the Cityscapes and KITTI-360 datasets. We further conduct a number of ablation studies, demonstrating benefits when applying additional inputs for the contrastive loss.

* 8 pages, 9 figures, 5 tables. Presented at IROS 2022 
Viaarxiv icon

Waymo Open Dataset: Panoramic Video Panoptic Segmentation

Jun 15, 2022
Jieru Mei, Alex Zihao Zhu, Xinchen Yan, Hang Yan, Siyuan Qiao, Yukun Zhu, Liang-Chieh Chen, Henrik Kretzschmar, Dragomir Anguelov

Figure 1 for Waymo Open Dataset: Panoramic Video Panoptic Segmentation
Figure 2 for Waymo Open Dataset: Panoramic Video Panoptic Segmentation
Figure 3 for Waymo Open Dataset: Panoramic Video Panoptic Segmentation
Figure 4 for Waymo Open Dataset: Panoramic Video Panoptic Segmentation

Panoptic image segmentation is the computer vision task of finding groups of pixels in an image and assigning semantic classes and object instance identifiers to them. Research in image segmentation has become increasingly popular due to its critical applications in robotics and autonomous driving. The research community thereby relies on publicly available benchmark dataset to advance the state-of-the-art in computer vision. Due to the high costs of densely labeling the images, however, there is a shortage of publicly available ground truth labels that are suitable for panoptic segmentation. The high labeling costs also make it challenging to extend existing datasets to the video domain and to multi-camera setups. We therefore present the Waymo Open Dataset: Panoramic Video Panoptic Segmentation Dataset, a large-scale dataset that offers high-quality panoptic segmentation labels for autonomous driving. We generate our dataset using the publicly available Waymo Open Dataset, leveraging the diverse set of camera images. Our labels are consistent over time for video processing and consistent across multiple cameras mounted on the vehicles for full panoramic scene understanding. Specifically, we offer labels for 28 semantic categories and 2,860 temporal sequences that were captured by five cameras mounted on autonomous vehicles driving in three different geographical locations, leading to a total of 100k labeled camera images. To the best of our knowledge, this makes our dataset an order of magnitude larger than existing datasets that offer video panoptic segmentation labels. We further propose a new benchmark for Panoramic Video Panoptic Segmentation and establish a number of strong baselines based on the DeepLab family of models. We will make the benchmark and the code publicly available. Find the dataset at https://waymo.com/open.

* Our dataset can be found at https://waymo.com/open 
Viaarxiv icon

Spike-FlowNet: Event-based Optical Flow Estimation with Energy-Efficient Hybrid Neural Networks

Mar 14, 2020
Chankyu Lee, Adarsh Kosta, Alex Zihao Zhu, Kenneth Chaney, Kostas Daniilidis, Kaushik Roy

Figure 1 for Spike-FlowNet: Event-based Optical Flow Estimation with Energy-Efficient Hybrid Neural Networks
Figure 2 for Spike-FlowNet: Event-based Optical Flow Estimation with Energy-Efficient Hybrid Neural Networks
Figure 3 for Spike-FlowNet: Event-based Optical Flow Estimation with Energy-Efficient Hybrid Neural Networks
Figure 4 for Spike-FlowNet: Event-based Optical Flow Estimation with Energy-Efficient Hybrid Neural Networks

Event-based cameras display great potential for a variety of conditions such as high-speed motion detection and enabling navigation in low-light environments where conventional frame-based cameras suffer critically. This is attributed to their high temporal resolution, high dynamic range, and low-power consumption. However, conventional computer vision methods as well as deep Analog Neural Networks (ANNs) are not suited to work well with the asynchronous and discrete nature of event camera outputs. Spiking Neural Networks (SNNs) serve as ideal paradigms to handle event camera outputs, but deep SNNs suffer in terms of performance due to spike vanishing phenomenon. To overcome these issues, we present Spike-FlowNet, a deep hybrid neural network architecture integrating SNNs and ANNs for efficiently estimating optical flow from sparse event camera outputs without sacrificing the performance. The network is end-to-end trained with self-supervised learning on Multi-Vehicle Stereo Event Camera (MVSEC) dataset. Spike-FlowNet outperforms its corresponding ANN-based method in terms of the optical flow prediction capability while providing significant computational efficiency.

Viaarxiv icon

EventGAN: Leveraging Large Scale Image Datasets for Event Cameras

Dec 19, 2019
Alex Zihao Zhu, Ziyun Wang, Kaung Khant, Kostas Daniilidis

Figure 1 for EventGAN: Leveraging Large Scale Image Datasets for Event Cameras
Figure 2 for EventGAN: Leveraging Large Scale Image Datasets for Event Cameras
Figure 3 for EventGAN: Leveraging Large Scale Image Datasets for Event Cameras
Figure 4 for EventGAN: Leveraging Large Scale Image Datasets for Event Cameras

Event cameras provide a number of benefits over traditional cameras, such as the ability to track incredibly fast motions, high dynamic range, and low power consumption. However, their application into computer vision problems, many of which are primarily dominated by deep learning solutions, has been limited by the lack of labeled training data for events. In this work, we propose a method which leverages the existing labeled data for images by simulating events from a pair of temporal image frames, using a convolutional neural network. We train this network on pairs of images and events, using an adversarial discriminator loss and a pair of cycle consistency losses. The cycle consistency losses utilize a pair of pre-trained self-supervised networks which perform optical flow estimation and image reconstruction from events, and constrain our network to generate events which result in accurate outputs from both of these networks. Trained fully end to end, our network learns a generative model for events from images without the need for accurate modeling of the motion in the scene, exhibited by modeling based methods, while also implicitly modeling event noise. Using this simulator, we train a pair of downstream networks on object detection and 2D human pose estimation from events, using simulated data from large scale image datasets, and demonstrate the networks' abilities to generalize to datasets with real events.

* 10 pages, 5 figures, 2 tables, Code: https://github.com/alexzzhu/EventGAN, Video: https://www.youtube.com/watch?v=Vcm4Iox4H2w 
Viaarxiv icon

Motion Equivariant Networks for Event Cameras with the Temporal Normalization Transform

Feb 18, 2019
Alex Zihao Zhu, Ziyun Wang, Kostas Daniilidis

Figure 1 for Motion Equivariant Networks for Event Cameras with the Temporal Normalization Transform
Figure 2 for Motion Equivariant Networks for Event Cameras with the Temporal Normalization Transform
Figure 3 for Motion Equivariant Networks for Event Cameras with the Temporal Normalization Transform

In this work, we propose a novel transformation for events from an event camera that is equivariant to optical flow under convolutions in the 3-D spatiotemporal domain. Events are generated by changes in the image, which are typically due to motion, either of the camera or the scene. As a result, different motions result in a different set of events. For learning based tasks based on a static scene such as classification which directly use the events, we must either rely on the learning method to learn the underlying object distinct from the motion, or to memorize all possible motions for each object with extensive data augmentation. Instead, we propose a novel transformation of the input event data which normalizes the $x$ and $y$ positions by the timestamp of each event. We show that this transformation generates a representation of the events that is equivariant to this motion when the optical flow is constant, allowing a deep neural network to learn the classification task without the need for expensive data augmentation. We test our method on the event based N-MNIST dataset, as well as a novel dataset N-MOVING-MNIST, with significantly more variety in motion compared to the standard N-MNIST dataset. In all sequences, we demonstrate that our transformed network is able to achieve similar or better performance compared to a network with a standard volumetric event input, and performs significantly better when the test set has a larger set of motions than seen at training.

* 8 pages, 2 figures, 1 table 
Viaarxiv icon

Robustness Meets Deep Learning: An End-to-End Hybrid Pipeline for Unsupervised Learning of Egomotion

Dec 21, 2018
Alex Zihao Zhu, Wenxin Liu, Ziyun Wang, Vijay Kumar, Kostas Daniilidis

Figure 1 for Robustness Meets Deep Learning: An End-to-End Hybrid Pipeline for Unsupervised Learning of Egomotion
Figure 2 for Robustness Meets Deep Learning: An End-to-End Hybrid Pipeline for Unsupervised Learning of Egomotion
Figure 3 for Robustness Meets Deep Learning: An End-to-End Hybrid Pipeline for Unsupervised Learning of Egomotion
Figure 4 for Robustness Meets Deep Learning: An End-to-End Hybrid Pipeline for Unsupervised Learning of Egomotion

In this work, we propose a method that combines unsupervised deep learning predictions for optical flow and monocular disparity with a model based optimization procedure for camera pose. Given the flow and disparity predictions from the network, we apply a RANSAC outlier rejection scheme to find an inlier set of flows and disparities, which we use to solve for the camera pose in a least squares fashion. We show that this pipeline is fully differentiable, allowing us to combine the pose with the network outputs as an additional unsupervised training loss to further refine the predicted flows and disparities. This method not only allows us to directly regress pose from the network outputs, but also automatically segments away pixels that do not fit the rigid scene assumptions that many unsupervised structure from motion methods apply, such as on independently moving objects. We evaluate our method on the KITTI dataset, and demonstrate state of the art results, even in the presence of challenging independently moving objects.

* 10 pages, 6 figures, 6 tables 
Viaarxiv icon

Unsupervised Event-based Learning of Optical Flow, Depth, and Egomotion

Dec 19, 2018
Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, Kostas Daniilidis

Figure 1 for Unsupervised Event-based Learning of Optical Flow, Depth, and Egomotion
Figure 2 for Unsupervised Event-based Learning of Optical Flow, Depth, and Egomotion
Figure 3 for Unsupervised Event-based Learning of Optical Flow, Depth, and Egomotion
Figure 4 for Unsupervised Event-based Learning of Optical Flow, Depth, and Egomotion

In this work, we propose a novel framework for unsupervised learning for event cameras that learns motion information from only the event stream. In particular, we propose an input representation of the events in the form of a discretized volume that maintains the temporal distribution of the events, which we pass through a neural network to predict the motion of the events. This motion is used to attempt to remove any motion blur in the event image. We then propose a loss function applied to the motion compensated event image that measures the motion blur in this image. We train two networks with this framework, one to predict optical flow, and one to predict egomotion and depths, and evaluate these networks on the Multi Vehicle Stereo Event Camera dataset, along with qualitative results from a variety of different scenes.

* 9 pages, 7 figures 
Viaarxiv icon

Realtime Time Synchronized Event-based Stereo

Oct 18, 2018
Alex Zihao Zhu, Yibo Chen, Kostas Daniilidis

Figure 1 for Realtime Time Synchronized Event-based Stereo
Figure 2 for Realtime Time Synchronized Event-based Stereo
Figure 3 for Realtime Time Synchronized Event-based Stereo
Figure 4 for Realtime Time Synchronized Event-based Stereo

In this work, we propose a novel event based stereo method which addresses the problem of motion blur for a moving event camera. Our method uses the velocity of the camera and a range of disparities to synchronize the positions of the events, as if they were captured at a single point in time. We represent these events using a pair of novel time synchronized event disparity volumes, which we show remove motion blur for pixels at the correct disparity in the volume, while further blurring pixels at the wrong disparity. We then apply a novel matching cost over these time synchronized event disparity volumes, which both rewards similarity between the volumes while penalizing blurriness. We show that our method outperforms more expensive, smoothing based event stereo methods, by evaluating on the Multi Vehicle Stereo Event Camera dataset.

* European Conference on Computer Vision 2018  
* 13 pages, 3 figures, 1 table. Video: https://youtu.be/4oa7e4hsrYo. Updated with final version with additional experiments 
Viaarxiv icon

EV-FlowNet: Self-Supervised Optical Flow Estimation for Event-based Cameras

Aug 13, 2018
Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, Kostas Daniilidis

Figure 1 for EV-FlowNet: Self-Supervised Optical Flow Estimation for Event-based Cameras
Figure 2 for EV-FlowNet: Self-Supervised Optical Flow Estimation for Event-based Cameras
Figure 3 for EV-FlowNet: Self-Supervised Optical Flow Estimation for Event-based Cameras
Figure 4 for EV-FlowNet: Self-Supervised Optical Flow Estimation for Event-based Cameras

Event-based cameras have shown great promise in a variety of situations where frame based cameras suffer, such as high speed motions and high dynamic range scenes. However, developing algorithms for event measurements requires a new class of hand crafted algorithms. Deep learning has shown great success in providing model free solutions to many problems in the vision community, but existing networks have been developed with frame based images in mind, and there does not exist the wealth of labeled data for events as there does for images for supervised training. To these points, we present EV-FlowNet, a novel self-supervised deep learning pipeline for optical flow estimation for event based cameras. In particular, we introduce an image based representation of a given event stream, which is fed into a self-supervised neural network as the sole input. The corresponding grayscale images captured from the same camera at the same time as the events are then used as a supervisory signal to provide a loss function at training time, given the estimated flow from the network. We show that the resulting network is able to accurately predict optical flow from events only in a variety of different scenes, with performance competitive to image based networks. This method not only allows for accurate estimation of dense optical flow, but also provides a framework for the transfer of other self-supervised methods to the event-based domain.

* 9 pages, 5 figures, 1 table. Accompanying video: https://youtu.be/eMHZBSoq0sE. Dataset: https://daniilidis-group.github.io/mvsec/, Robotics: Science and Systems 2018 
Viaarxiv icon