Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alireza Fathi

DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes

Apr 07, 2020

Mahyar Najibi, Guangda Lai, Abhijit Kundu, Zhichao Lu, Vivek Rathod, Thomas Funkhouser, Caroline Pantofaru, David Ross, Larry S. Davis, Alireza Fathi

Figure 1 for DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes

Figure 2 for DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes

Figure 3 for DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes

Figure 4 for DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes

Abstract:We propose DOPS, a fast single-stage 3D object detection method for LIDAR data. Previous methods often make domain-specific design decisions, for example projecting points into a bird-eye view image in autonomous driving scenarios. In contrast, we propose a general-purpose method that works on both indoor and outdoor scenes. The core novelty of our method is a fast, single-pass architecture that both detects objects in 3D and estimates their shapes. 3D bounding box parameters are estimated in one pass for every point, aggregated through graph convolutions, and fed into a branch of the network that predicts latent codes representing the shape of each detected object. The latent shape space and shape decoder are learned on a synthetic dataset and then used as supervision for the end-to-end training of the 3D object detection pipeline. Thus our model is able to extract shapes without access to ground-truth shape information in the target dataset. During experiments, we find that our proposed method achieves state-of-the-art results by ~5% on object detection in ScanNet scenes, and it gets top results by 3.4% in the Waymo Open Dataset, while reproducing the shapes of detected cars.

* To appear in CVPR 2020

Via

Access Paper or Ask Questions

3D-MPA: Multi Proposal Aggregation for 3D Semantic Instance Segmentation

Mar 30, 2020

Francis Engelmann, Martin Bokeloh, Alireza Fathi, Bastian Leibe, Matthias Nießner

Figure 1 for 3D-MPA: Multi Proposal Aggregation for 3D Semantic Instance Segmentation

Figure 2 for 3D-MPA: Multi Proposal Aggregation for 3D Semantic Instance Segmentation

Figure 3 for 3D-MPA: Multi Proposal Aggregation for 3D Semantic Instance Segmentation

Figure 4 for 3D-MPA: Multi Proposal Aggregation for 3D Semantic Instance Segmentation

Abstract:We present 3D-MPA, a method for instance segmentation on 3D point clouds. Given an input point cloud, we propose an object-centric approach where each point votes for its object center. We sample object proposals from the predicted object centers. Then, we learn proposal features from grouped point features that voted for the same object center. A graph convolutional network introduces inter-proposal relations, providing higher-level feature learning in addition to the lower-level point features. Each proposal comprises a semantic label, a set of associated points over which we define a foreground-background mask, an objectness score and aggregation features. Previous works usually perform non-maximum-suppression (NMS) over proposals to obtain the final object detections or semantic instances. However, NMS can discard potentially correct predictions. Instead, our approach keeps all proposals and groups them together based on the learned aggregation features. We show that grouping proposals improves over NMS and outperforms previous state-of-the-art methods on the tasks of 3D object detection and semantic instance segmentation on the ScanNetV2 benchmark and the S3DIS dataset.

* CVPR2020, Video: https://youtu.be/ifL8yTbRFDk Project Page: https://www.vision.rwth-aachen.de/3d_instance_segmentation/

Via

Access Paper or Ask Questions

Floors are Flat: Leveraging Semantics for Real-Time Surface Normal Prediction

Jun 16, 2019

Steven Hickson, Karthik Raveendran, Alireza Fathi, Kevin Murphy, Irfan Essa

Figure 1 for Floors are Flat: Leveraging Semantics for Real-Time Surface Normal Prediction

Figure 2 for Floors are Flat: Leveraging Semantics for Real-Time Surface Normal Prediction

Figure 3 for Floors are Flat: Leveraging Semantics for Real-Time Surface Normal Prediction

Figure 4 for Floors are Flat: Leveraging Semantics for Real-Time Surface Normal Prediction

Abstract:We propose 4 insights that help to significantly improve the performance of deep learning models that predict surface normals and semantic labels from a single RGB image. These insights are: (1) denoise the "ground truth" surface normals in the training set to ensure consistency with the semantic labels; (2) concurrently train on a mix of real and synthetic data, instead of pretraining on synthetic and finetuning on real; (3) jointly predict normals and semantics using a shared model, but only backpropagate errors on pixels that have valid training labels; (4) slim down the model and use grayscale instead of color inputs. Despite the simplicity of these steps, we demonstrate consistently improved results on several datasets, using a model that runs at 12 fps on a standard mobile phone.

Via

Access Paper or Ask Questions

Tracking Emerges by Colorizing Videos

Jul 27, 2018

Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, Kevin Murphy

Figure 1 for Tracking Emerges by Colorizing Videos

Figure 2 for Tracking Emerges by Colorizing Videos

Figure 3 for Tracking Emerges by Colorizing Videos

Figure 4 for Tracking Emerges by Colorizing Videos

Abstract:We use large amounts of unlabeled video to learn models for visual tracking without manual human supervision. We leverage the natural temporal coherency of color to create a model that learns to colorize gray-scale videos by copying colors from a reference frame. Quantitative and qualitative experiments suggest that this task causes the model to automatically learn to track visual regions. Although the model is trained without any ground-truth labels, our method learns to track well enough to outperform the latest methods based on optical flow. Moreover, our results suggest that failures to track are correlated with failures to colorize, indicating that advancing video colorization may further improve self-supervised visual tracking.

* ECCV 2018. Blog post: https://ai.googleblog.com/2018/06/self-supervised-tracking-via-video.html

Via

Access Paper or Ask Questions

Instance Embedding Transfer to Unsupervised Video Object Segmentation

Feb 27, 2018

Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, C. -C. Jay Kuo

Figure 1 for Instance Embedding Transfer to Unsupervised Video Object Segmentation

Figure 2 for Instance Embedding Transfer to Unsupervised Video Object Segmentation

Figure 3 for Instance Embedding Transfer to Unsupervised Video Object Segmentation

Figure 4 for Instance Embedding Transfer to Unsupervised Video Object Segmentation

Abstract:We propose a method for unsupervised video object segmentation by transferring the knowledge encapsulated in image-based instance embedding networks. The instance embedding network produces an embedding vector for each pixel that enables identifying all pixels belonging to the same object. Though trained on static images, the instance embeddings are stable over consecutive video frames, which allows us to link objects together over time. Thus, we adapt the instance networks trained on static images to video object segmentation and incorporate the embeddings with objectness and optical flow features, without model retraining or online fine-tuning. The proposed method outperforms state-of-the-art unsupervised segmentation methods in the DAVIS dataset and the FBMS dataset.

* To appear in CVPR 2018

Via

Access Paper or Ask Questions

The Devil is in the Decoder

Aug 12, 2017

Zbigniew Wojna, Vittorio Ferrari, Sergio Guadarrama, Nathan Silberman, Liang-Chieh Chen, Alireza Fathi, Jasper Uijlings

Figure 1 for The Devil is in the Decoder

Figure 2 for The Devil is in the Decoder

Figure 3 for The Devil is in the Decoder

Figure 4 for The Devil is in the Decoder

Abstract:Many machine vision applications require predictions for every pixel of the input image (for example semantic segmentation, boundary detection). Models for such problems usually consist of encoders which decreases spatial resolution while learning a high-dimensional representation, followed by decoders who recover the original input resolution and result in low-dimensional predictions. While encoders have been studied rigorously, relatively few studies address the decoder side. Therefore this paper presents an extensive comparison of a variety of decoders for a variety of pixel-wise prediction tasks. Our contributions are: (1) Decoders matter: we observe significant variance in results between different types of decoders on various problems. (2) We introduce a novel decoder: bilinear additive upsampling. (3) We introduce new residual-like connections for decoders. (4) We identify two decoder types which give a consistently high performance.

Via

Access Paper or Ask Questions

Speed/accuracy trade-offs for modern convolutional object detectors

Apr 25, 2017

Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama(+1 more)

Figure 1 for Speed/accuracy trade-offs for modern convolutional object detectors

Figure 2 for Speed/accuracy trade-offs for modern convolutional object detectors

Figure 3 for Speed/accuracy trade-offs for modern convolutional object detectors

Figure 4 for Speed/accuracy trade-offs for modern convolutional object detectors

Abstract:The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end, we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-to-apples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN [Ren et al., 2015], R-FCN [Dai et al., 2016] and SSD [Liu et al., 2015] systems, which we view as "meta-architectures" and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that achieves real time speeds and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.

* Accepted to CVPR 2017

Via

Access Paper or Ask Questions

Semantic Instance Segmentation via Deep Metric Learning

Mar 30, 2017

Alireza Fathi, Zbigniew Wojna, Vivek Rathod, Peng Wang, Hyun Oh Song, Sergio Guadarrama, Kevin P. Murphy

Figure 1 for Semantic Instance Segmentation via Deep Metric Learning

Figure 2 for Semantic Instance Segmentation via Deep Metric Learning

Figure 3 for Semantic Instance Segmentation via Deep Metric Learning

Figure 4 for Semantic Instance Segmentation via Deep Metric Learning

Abstract:We propose a new method for semantic instance segmentation, by first computing how likely two pixels are to belong to the same object, and then by grouping similar pixels together. Our similarity metric is based on a deep, fully convolutional embedding model. Our grouping method is based on selecting all points that are sufficiently similar to a set of "seed points", chosen from a deep, fully convolutional scoring model. We show competitive results on the Pascal VOC instance segmentation benchmark.

Via

Access Paper or Ask Questions

VideoSET: Video Summary Evaluation through Text

Jun 23, 2014

Serena Yeung, Alireza Fathi, Li Fei-Fei

Figure 1 for VideoSET: Video Summary Evaluation through Text

Figure 2 for VideoSET: Video Summary Evaluation through Text

Figure 3 for VideoSET: Video Summary Evaluation through Text

Figure 4 for VideoSET: Video Summary Evaluation through Text

Abstract:In this paper we present VideoSET, a method for Video Summary Evaluation through Text that can evaluate how well a video summary is able to retain the semantic information contained in its original video. We observe that semantics is most easily expressed in words, and develop a text-based approach for the evaluation. Given a video summary, a text representation of the video summary is first generated, and an NLP-based metric is then used to measure its semantic distance to ground-truth text summaries written by humans. We show that our technique has higher agreement with human judgment than pixel-based distance metrics. We also release text annotations and ground-truth text summaries for a number of publicly available video datasets, for use by the computer vision community.

Via

Access Paper or Ask Questions

An introduction to synchronous self-learning Pareto strategy

Dec 15, 2013

Ahmad Mozaffari, Alireza Fathi

Figure 1 for An introduction to synchronous self-learning Pareto strategy

Figure 2 for An introduction to synchronous self-learning Pareto strategy

Figure 3 for An introduction to synchronous self-learning Pareto strategy

Figure 4 for An introduction to synchronous self-learning Pareto strategy

Abstract:In last decades optimization and control of complex systems that possessed various conflicted objectives simultaneously attracted an incremental interest of scientists. This is because of the vast applications of these systems in various fields of real life engineering phenomena that are generally multi modal, non convex and multi criterion. Hence, many researchers utilized versatile intelligent models such as Pareto based techniques, game theory (cooperative and non cooperative games), neuro evolutionary systems, fuzzy logic and advanced neural networks for handling these types of problems. In this paper a novel method called Synchronous Self Learning Pareto Strategy Algorithm (SSLPSA) is presented which utilizes Evolutionary Computing (EC), Swarm Intelligence (SI) techniques and adaptive Classical Self Organizing Map (CSOM) simultaneously incorporating with a data shuffling behavior. Evolutionary Algorithms (EA) which attempt to simulate the phenomenon of natural evolution are powerful numerical optimization algorithms that reach an approximate global maximum of a complex multi variable function over a wide search space and swarm base technique can improved the intensity and the robustness in EA. CSOM is a neural network capable of learning and can improve the quality of obtained optimal Pareto front. To prove the efficient performance of proposed algorithm, authors utilized some well known benchmark test functions. Obtained results indicate that the cited method is best suit in the case of vector optimization.

* 17 pages, 7 figure, 3 table

Via

Access Paper or Ask Questions