Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hamid Rezatofighi

PiFeNet: Pillar-Feature Network for Real-Time 3D Pedestrian Detection from Point Cloud

Dec 31, 2021
Duy-Tho Le, Hengcan Shi, Hamid Rezatofighi, Jianfei Cai

Figure 1 for PiFeNet: Pillar-Feature Network for Real-Time 3D Pedestrian Detection from Point Cloud

Figure 2 for PiFeNet: Pillar-Feature Network for Real-Time 3D Pedestrian Detection from Point Cloud

Figure 3 for PiFeNet: Pillar-Feature Network for Real-Time 3D Pedestrian Detection from Point Cloud

Figure 4 for PiFeNet: Pillar-Feature Network for Real-Time 3D Pedestrian Detection from Point Cloud

We present PiFeNet, an efficient and accurate real-time 3D detector for pedestrian detection from point clouds. We address two challenges that 3D object detection frameworks encounter when detecting pedestrians: low expressiveness of pillar features and small occupation areas of pedestrians in point clouds. Firstly, we introduce a stackable Pillar Aware Attention (PAA) module for enhanced pillar features extraction while suppressing noises in the point clouds. By integrating multi-point-aware-pooling, point-wise, channel-wise, and task-aware attention into a simple module, the representation capabilities are boosted while requiring little additional computing resources. We also present Mini-BiFPN, a small yet effective feature network that creates bidirectional information flow and multi-level cross-scale feature fusion to better integrate multi-resolution features. Our approach is ranked 1st in KITTI pedestrian BEV and 3D leaderboards while running at 26 frames per second (FPS), and achieves state-of-the-art performance on Nuscenes detection benchmark.

* Submitted to IEEE International Conference on Multimedia and Expo (ICME) 2022

Via

Access Paper or Ask Questions

GMFlow: Learning Optical Flow via Global Matching

Nov 26, 2021
Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Dacheng Tao

Figure 1 for GMFlow: Learning Optical Flow via Global Matching

Figure 2 for GMFlow: Learning Optical Flow via Global Matching

Figure 3 for GMFlow: Learning Optical Flow via Global Matching

Figure 4 for GMFlow: Learning Optical Flow via Global Matching

Learning-based optical flow estimation has been dominated with the pipeline of cost volume with convolutions for flow regression, which is inherently limited to local correlations and thus is hard to address the long-standing challenge of large displacements. To alleviate this, the state-of-the-art method, i.e., RAFT, gradually improves the quality of its predictions by producing a sequence of flow updates via a large number of iterative refinements, achieving remarkable performance but slowing down the inference speed. To enable both high accuracy and efficiency optical flow estimation, we completely revamp the dominating flow regression pipeline by reformulating optical flow as a global matching problem. Specifically, we propose a GMFlow framework, which consists of three main components: a customized Transformer for feature enhancement, a correlation and softmax layer for global feature matching, and a self-attention layer for flow propagation. Moreover, we further introduce a refinement step that reuses GMFlow at higher-resolutions for residual flow prediction. Our new framework outperforms 32-iteration RAFT's performance on the challenging Sintel benchmark, while using only one refinement and running faster, offering new possibilities for efficient and accurate optical flow estimation. Code will be available at https://github.com/haofeixu/gmflow.

* Tech report

Via

Access Paper or Ask Questions

Guided-GAN: Adversarial Representation Learning for Activity Recognition with Wearables

Oct 12, 2021
Alireza Abedin, Hamid Rezatofighi, Damith C. Ranasinghe

Figure 1 for Guided-GAN: Adversarial Representation Learning for Activity Recognition with Wearables

Figure 2 for Guided-GAN: Adversarial Representation Learning for Activity Recognition with Wearables

Figure 3 for Guided-GAN: Adversarial Representation Learning for Activity Recognition with Wearables

Figure 4 for Guided-GAN: Adversarial Representation Learning for Activity Recognition with Wearables

Human activity recognition (HAR) is an important research field in ubiquitous computing where the acquisition of large-scale labeled sensor data is tedious, labor-intensive and time consuming. State-of-the-art unsupervised remedies investigated to alleviate the burdens of data annotations in HAR mainly explore training autoencoder frameworks. In this paper: we explore generative adversarial network (GAN) paradigms to learn unsupervised feature representations from wearable sensor data; and design a new GAN framework-Geometrically-Guided GAN or Guided-GAN-for the task. To demonstrate the effectiveness of our formulation, we evaluate the features learned by Guided-GAN in an unsupervised manner on three downstream classification benchmarks. Our results demonstrate Guided-GAN to outperform existing unsupervised approaches whilst closely approaching the performance with fully supervised learned representations. The proposed approach paves the way to bridge the gap between unsupervised and supervised human activity recognition whilst helping to reduce the cost of human data annotation tasks.

Via

Access Paper or Ask Questions

ODAM: Object Detection, Association, and Mapping using Posed RGB Video

Aug 23, 2021
Kejie Li, Daniel DeTone, Steven Chen, Minh Vo, Ian Reid, Hamid Rezatofighi, Chris Sweeney, Julian Straub, Richard Newcombe

Figure 1 for ODAM: Object Detection, Association, and Mapping using Posed RGB Video

Figure 2 for ODAM: Object Detection, Association, and Mapping using Posed RGB Video

Figure 3 for ODAM: Object Detection, Association, and Mapping using Posed RGB Video

Figure 4 for ODAM: Object Detection, Association, and Mapping using Posed RGB Video

Localizing objects and estimating their extent in 3D is an important step towards high-level 3D scene understanding, which has many applications in Augmented Reality and Robotics. We present ODAM, a system for 3D Object Detection, Association, and Mapping using posed RGB videos. The proposed system relies on a deep learning front-end to detect 3D objects from a given RGB frame and associate them to a global object-based map using a graph neural network (GNN). Based on these frame-to-model associations, our back-end optimizes object bounding volumes, represented as super-quadrics, under multi-view geometry constraints and the object scale prior. We validate the proposed system on ScanNet where we show a significant improvement over existing RGB-only methods.

* Accepted in ICCV 2021 as oral

Via

Access Paper or Ask Questions

Unsupervised Image Segmentation by Mutual Information Maximization and Adversarial Regularization

Jul 01, 2021
S. Ehsan Mirsadeghi, Ali Royat, Hamid Rezatofighi

Figure 1 for Unsupervised Image Segmentation by Mutual Information Maximization and Adversarial Regularization

Figure 2 for Unsupervised Image Segmentation by Mutual Information Maximization and Adversarial Regularization

Figure 3 for Unsupervised Image Segmentation by Mutual Information Maximization and Adversarial Regularization

Figure 4 for Unsupervised Image Segmentation by Mutual Information Maximization and Adversarial Regularization

Semantic segmentation is one of the basic, yet essential scene understanding tasks for an autonomous agent. The recent developments in supervised machine learning and neural networks have enjoyed great success in enhancing the performance of the state-of-the-art techniques for this task. However, their superior performance is highly reliant on the availability of a large-scale annotated dataset. In this paper, we propose a novel fully unsupervised semantic segmentation method, the so-called Information Maximization and Adversarial Regularization Segmentation (InMARS). Inspired by human perception which parses a scene into perceptual groups, rather than analyzing each pixel individually, our proposed approach first partitions an input image into meaningful regions (also known as superpixels). Next, it utilizes Mutual-Information-Maximization followed by an adversarial training strategy to cluster these regions into semantically meaningful classes. To customize an adversarial training scheme for the problem, we incorporate adversarial pixel noise along with spatial perturbations to impose photometrical and geometrical invariance on the deep neural network. Our experiments demonstrate that our method achieves the state-of-the-art performance on two commonly used unsupervised semantic segmentation datasets, COCO-Stuff, and Potsdam.

* IEEE Robotics and Automation Letters (RA-L 2021) & IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2021)

Via

Access Paper or Ask Questions

JRDB-Act: A Large-scale Multi-modal Dataset for Spatio-temporal Action, Social Group and Activity Detection

Jun 16, 2021
Mahsa Ehsanpour, Fatemeh Saleh, Silvio Savarese, Ian Reid, Hamid Rezatofighi

Figure 1 for JRDB-Act: A Large-scale Multi-modal Dataset for Spatio-temporal Action, Social Group and Activity Detection

Figure 2 for JRDB-Act: A Large-scale Multi-modal Dataset for Spatio-temporal Action, Social Group and Activity Detection

The availability of large-scale video action understanding datasets has facilitated advances in the interpretation of visual scenes containing people. However, learning to recognize human activities in an unconstrained real-world environment, with potentially highly unbalanced and long-tailed distributed data remains a significant challenge, not least owing to the lack of a reflective large-scale dataset. Most existing large-scale datasets are either collected from a specific or constrained environment, e.g. kitchens or rooms, or video sharing platforms such as YouTube. In this paper, we introduce JRDB-Act, a multi-modal dataset, as an extension of the existing JRDB, which is captured by asocial mobile manipulator and reflects a real distribution of human daily life actions in a university campus environment. JRDB-Act has been densely annotated with atomic actions, comprises over 2.8M action labels, constituting a large-scale spatio-temporal action detection dataset. Each human bounding box is labelled with one pose-based action label and multiple (optional) interaction-based action labels. Moreover JRDB-Act comes with social group identification annotations conducive to the task of grouping individuals based on their interactions in the scene to infer their social activities (common activities in each social group).

Via

Access Paper or Ask Questions

TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild

Apr 08, 2021
Vida Adeli, Mahsa Ehsanpour, Ian Reid, Juan Carlos Niebles, Silvio Savarese, Ehsan Adeli, Hamid Rezatofighi

Figure 1 for TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild

Figure 2 for TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild

Figure 3 for TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild

Figure 4 for TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild

Joint forecasting of human trajectory and pose dynamics is a fundamental building block of various applications ranging from robotics and autonomous driving to surveillance systems. Predicting body dynamics requires capturing subtle information embedded in the humans' interactions with each other and with the objects present in the scene. In this paper, we propose a novel TRajectory and POse Dynamics (nicknamed TRiPOD) method based on graph attentional networks to model the human-human and human-object interactions both in the input space and the output space (decoded future output). The model is supplemented by a message passing interface over the graphs to fuse these different levels of interactions efficiently. Furthermore, to incorporate a real-world challenge, we propound to learn an indicator representing whether an estimated body joint is visible/invisible at each frame, e.g. due to occlusion or being outside the sensor field of view. Finally, we introduce a new benchmark for this joint task based on two challenging datasets (PoseTrack and 3DPW) and propose evaluation metrics to measure the effectiveness of predictions in the global space, even when there are invisible cases of joints. Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.

Via

Access Paper or Ask Questions

Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using Spatial and Temporal Transformers

Mar 27, 2021
Tianyu Zhu, Markus Hiller, Mahsa Ehsanpour, Rongkai Ma, Tom Drummond, Hamid Rezatofighi

Figure 1 for Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using Spatial and Temporal Transformers

Figure 2 for Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using Spatial and Temporal Transformers

Figure 3 for Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using Spatial and Temporal Transformers

Figure 4 for Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using Spatial and Temporal Transformers

Tracking a time-varying indefinite number of objects in a video sequence over time remains a challenge despite recent advances in the field. Ignoring long-term temporal information, most existing approaches are not able to properly handle multi-object tracking challenges such as occlusion. To address these shortcomings, we present MO3TR: a truly end-to-end Transformer-based online multi-object tracking (MOT) framework that learns to handle occlusions, track initiation and termination without the need for an explicit data association module or any heuristics/post-processing. MO3TR encodes object interactions into long-term temporal embeddings using a combination of spatial and temporal Transformers, and recursively uses the information jointly with the input data to estimate the states of all tracked objects over time. The spatial attention mechanism enables our framework to learn implicit representations between all the objects and the objects to the measurements, while the temporal attention mechanism focuses on specific parts of past information, allowing our approach to resolve occlusions over multiple frames. Our experiments demonstrate the potential of this new approach, reaching new state-of-the-art results on multiple MOT metrics for two popular multi-object tracking benchmarks. Our code will be made publicly available.

Via

Access Paper or Ask Questions

Distributed Multi-object Tracking under Limited Field of View Sensors

Dec 23, 2020
Hoa Van Nguyen, Hamid Rezatofighi, Ba-Ngu Vo, Damith C. Ranasinghe

Figure 1 for Distributed Multi-object Tracking under Limited Field of View Sensors

Figure 2 for Distributed Multi-object Tracking under Limited Field of View Sensors

Figure 3 for Distributed Multi-object Tracking under Limited Field of View Sensors

Figure 4 for Distributed Multi-object Tracking under Limited Field of View Sensors

We consider the challenging problem of tracking multiple objects using a distributed network of sensors. In the pragmatic settings of a limited field of view (FoV) sensors, computing and communication resources of nodes, we develop a novel distributed multi-target algorithm that fuses local multi-object states instead of local multi-object densities. This algorithm uses a novel label consensus approach that reduces label inconsistency, caused by movements of objects from one node's limited FoV to another. To accomplish this, we formalise the concept of label consistency and determine a sufficient condition to achieve it. The proposed algorithm is i) fast and requires significantly less processing time than fusion methods using multi-object filtering densities, and ii) achieves better tracking accuracy by considering tracking errors measured by the Optimal Sub-Pattern Assignment (OSPA) metric over several scans rather than a single scan. Numerical experiments demonstrate the real-time capability of our proposed solution, in computational efficiency and accuracy compared to state-of-the-art solutions in challenging scenarios.

* 13 pages, 10 figures. Submitted to the IEEE Transactions on Signal Processing (TSP)

Via

Access Paper or Ask Questions

Probabilistic Tracklet Scoring and Inpainting for Multiple Object Tracking

Dec 10, 2020
Fatemeh Saleh, Sadegh Aliakbarian, Hamid Rezatofighi, Mathieu Salzmann, Stephen Gould

Figure 1 for Probabilistic Tracklet Scoring and Inpainting for Multiple Object Tracking

Figure 2 for Probabilistic Tracklet Scoring and Inpainting for Multiple Object Tracking

Figure 3 for Probabilistic Tracklet Scoring and Inpainting for Multiple Object Tracking

Figure 4 for Probabilistic Tracklet Scoring and Inpainting for Multiple Object Tracking

Despite the recent advances in multiple object tracking (MOT), achieved by joint detection and tracking, dealing with long occlusions remains a challenge. This is due to the fact that such techniques tend to ignore the long-term motion information. In this paper, we introduce a probabilistic autoregressive motion model to score tracklet proposals by directly measuring their likelihood. This is achieved by training our model to learn the underlying distribution of natural tracklets. As such, our model allows us not only to assign new detections to existing tracklets, but also to inpaint a tracklet when an object has been lost for a long time, e.g., due to occlusion, by sampling tracklets so as to fill the gap caused by misdetections. Our experiments demonstrate the superiority of our approach at tracking objects in challenging sequences; it outperforms the state of the art in most standard MOT metrics on multiple MOT benchmark datasets, including MOT16, MOT17, and MOT20.

Via

Access Paper or Ask Questions