Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Silvio Giancola

SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation

Jun 02, 2021

Bing Li, Cheng Zheng, Silvio Giancola, Bernard Ghanem

Figure 1 for SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation

Figure 2 for SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation

Figure 3 for SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation

Figure 4 for SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation

Abstract:We propose a novel scene flow estimation approach to capture and infer 3D motions from point clouds. Estimating 3D motions for point clouds is challenging, since a point cloud is unordered and its density is significantly non-uniform. Such unstructured data poses difficulties in matching corresponding points between point clouds, leading to inaccurate flow estimation. We propose a novel architecture named Sparse Convolution-Transformer Network (SCTN) that equips the sparse convolution with the transformer. Specifically, by leveraging the sparse convolution, SCTN transfers irregular point cloud into locally consistent flow features for estimating continuous and consistent motions within an object/local object part. We further propose to explicitly learn point relations using a point transformer module, different from exiting methods. We show that the learned relation-based contextual information is rich and helpful for matching corresponding points, benefiting scene flow estimation. In addition, a novel loss function is proposed to adaptively encourage flow consistency according to feature similarity. Extensive experiments demonstrate that our proposed approach achieves a new state of the art in scene flow estimation. Our approach achieves an error of 0.038 and 0.037 (EPE3D) on FlyingThings3D and KITTI Scene Flow respectively, which significantly outperforms previous methods by large margins.

Via

Access Paper or Ask Questions

Camera Calibration and Player Localization in SoccerNet-v2 and Investigation of their Representations for Action Spotting

Apr 19, 2021

Anthony Cioppa, Adrien Deliège, Floriane Magera, Silvio Giancola, Olivier Barnich, Bernard Ghanem, Marc Van Droogenbroeck

Figure 1 for Camera Calibration and Player Localization in SoccerNet-v2 and Investigation of their Representations for Action Spotting

Figure 2 for Camera Calibration and Player Localization in SoccerNet-v2 and Investigation of their Representations for Action Spotting

Figure 3 for Camera Calibration and Player Localization in SoccerNet-v2 and Investigation of their Representations for Action Spotting

Figure 4 for Camera Calibration and Player Localization in SoccerNet-v2 and Investigation of their Representations for Action Spotting

Abstract:Soccer broadcast video understanding has been drawing a lot of attention in recent years within data scientists and industrial companies. This is mainly due to the lucrative potential unlocked by effective deep learning techniques developed in the field of computer vision. In this work, we focus on the topic of camera calibration and on its current limitations for the scientific community. More precisely, we tackle the absence of a large-scale calibration dataset and of a public calibration network trained on such a dataset. Specifically, we distill a powerful commercial calibration tool in a recent neural network architecture on the large-scale SoccerNet dataset, composed of untrimmed broadcast videos of 500 soccer games. We further release our distilled network, and leverage it to provide 3 ways of representing the calibration results along with player localization. Finally, we exploit those representations within the current best architecture for the action spotting task of SoccerNet-v2, and achieve new state-of-the-art performances.

* Paper accepted at the CVsports workshop at CVPR2021

Via

Access Paper or Ask Questions

Temporally-Aware Feature Pooling for Action Spotting in Soccer Broadcasts

Apr 14, 2021

Silvio Giancola, Bernard Ghanem

Figure 1 for Temporally-Aware Feature Pooling for Action Spotting in Soccer Broadcasts

Figure 2 for Temporally-Aware Feature Pooling for Action Spotting in Soccer Broadcasts

Figure 3 for Temporally-Aware Feature Pooling for Action Spotting in Soccer Broadcasts

Figure 4 for Temporally-Aware Feature Pooling for Action Spotting in Soccer Broadcasts

Abstract:Toward the goal of automatic production for sports broadcasts, a paramount task consists in understanding the high-level semantic information of the game in play. For instance, recognizing and localizing the main actions of the game would allow producers to adapt and automatize the broadcast production, focusing on the important details of the game and maximizing the spectator engagement. In this paper, we focus our analysis on action spotting in soccer broadcast, which consists in temporally localizing the main actions in a soccer game. To that end, we propose a novel feature pooling method based on NetVLAD, dubbed NetVLAD++, that embeds temporally-aware knowledge. Different from previous pooling methods that consider the temporal context as a single set to pool from, we split the context before and after an action occurs. We argue that considering the contextual information around the action spot as a single entity leads to a sub-optimal learning for the pooling module. With NetVLAD++, we disentangle the context from the past and future frames and learn specific vocabularies of semantics for each subsets, avoiding to blend and blur such vocabulary in time. Injecting such prior knowledge creates more informative pooling modules and more discriminative pooled features, leading into a better understanding of the actions. We train and evaluate our methodology on the recent large-scale dataset SoccerNet-v2, reaching 53.4% Average-mAP for action spotting, a +12.7% improvement w.r.t the current state-of-the-art.

* 8 pages, Camera-Ready for CVSports 2021 (CVPRW)

Via

Access Paper or Ask Questions

SALA: Soft Assignment Local Aggregation for 3D Semantic Segmentation

Dec 29, 2020

Hani Itani, Silvio Giancola, Ali Thabet, Bernard Ghanem

Figure 1 for SALA: Soft Assignment Local Aggregation for 3D Semantic Segmentation

Figure 2 for SALA: Soft Assignment Local Aggregation for 3D Semantic Segmentation

Figure 3 for SALA: Soft Assignment Local Aggregation for 3D Semantic Segmentation

Figure 4 for SALA: Soft Assignment Local Aggregation for 3D Semantic Segmentation

Abstract:We introduce the idea of using learnable neighbor-to-grid soft assignment in grid-based aggregation functions for the task of 3D semantic segmentation. Previous methods in literature operate on a predefined geometric grid such as local volume partitions or irregular kernel points. These methods use geometric functions to assign local neighbors to their corresponding grid. Such geometric heuristics are potentially sub-optimal for the end task of semantic segmentation. Furthermore, they are applied uniformly throughout the depth of the network. A more general alternative would allow the network to learn its own neighbor-to-grid assignment function that best suits the end task. Since it is learnable, this mapping has the flexibility to be different per layer. This paper leverages learned neighbor-to-grid soft assignment to define an aggregation function that balances efficiency and performance. We demonstrate the efficacy of our method by reaching state-of-the-art (SOTA) performance on S3DIS with almost 10$\times$ less parameters than the current reigning method. We also demonstrate competitive performance on ScanNet and PartNet as compared with much larger SOTA models.

Via

Access Paper or Ask Questions

SoccerNet-v2 : A Dataset and Benchmarks for Holistic Understanding of Broadcast Soccer Videos

Nov 26, 2020

Adrien Deliège, Anthony Cioppa, Silvio Giancola, Meisam J. Seikavandi, Jacob V. Dueholm, Kamal Nasrollahi, Bernard Ghanem, Thomas B. Moeslund, Marc Van Droogenbroeck

Figure 1 for SoccerNet-v2 : A Dataset and Benchmarks for Holistic Understanding of Broadcast Soccer Videos

Figure 2 for SoccerNet-v2 : A Dataset and Benchmarks for Holistic Understanding of Broadcast Soccer Videos

Figure 3 for SoccerNet-v2 : A Dataset and Benchmarks for Holistic Understanding of Broadcast Soccer Videos

Figure 4 for SoccerNet-v2 : A Dataset and Benchmarks for Holistic Understanding of Broadcast Soccer Videos

Abstract:Understanding broadcast videos is a challenging task in computer vision, as it requires generic reasoning capabilities to appreciate the content offered by the video editing. In this work, we propose SoccerNet-v2, a novel large-scale corpus of manual annotations for the SoccerNet video dataset, along with open challenges to encourage more research in soccer understanding and broadcast production. Specifically, we release around 300k annotations within SoccerNet's 500 untrimmed broadcast soccer videos. We extend current tasks in the realm of soccer to include action spotting, camera shot segmentation with boundary detection, and we define a novel replay grounding task. For each task, we provide and discuss benchmark results, reproducible with our open-source adapted implementations of the most relevant works in the field. SoccerNet-v2 is presented to the broader research community to help push computer vision closer to automatic solutions for more general video understanding and production purposes.

* This document contains 8 pages + references + supplementary material

Via

Access Paper or Ask Questions

MVTN: Multi-View Transformation Network for 3D Shape Recognition

Nov 26, 2020

Abdullah Hamdi, Silvio Giancola, Bing Li, Ali Thabet, Bernard Ghanem

Figure 1 for MVTN: Multi-View Transformation Network for 3D Shape Recognition

Figure 2 for MVTN: Multi-View Transformation Network for 3D Shape Recognition

Figure 3 for MVTN: Multi-View Transformation Network for 3D Shape Recognition

Figure 4 for MVTN: Multi-View Transformation Network for 3D Shape Recognition

Abstract:Multi-view projection methods have shown the capability to reach state-of-the-art performance on 3D shape recognition. Most advances in multi-view representation focus on pooling techniques that learn to aggregate information from the different views, which tend to be heuristically set and fixed for all shapes. To circumvent the lack of dynamism of current multi-view methods, we propose to learn those viewpoints. In particular, we introduce a Multi-View Transformation Network (MVTN) that regresses optimal viewpoints for 3D shape recognition. By leveraging advances in differentiable rendering, our MVTN is trained end-to-end with any multi-view network and optimized for 3D shape classification. We show that MVTN can be seamlessly integrated into various multi-view approaches to exhibit clear performance gains in the tasks of 3D shape classification and shape retrieval without any extra training supervision. Furthermore, our MVTN improves multi-view networks to achieve state-of-the-art performance in rotation robustness and in object shape retrieval on ModelNet40.

* preprint

Via

Access Paper or Ask Questions

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Nov 23, 2020

Humam Alwassel, Silvio Giancola, Bernard Ghanem

Figure 1 for TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Figure 2 for TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Figure 3 for TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Figure 4 for TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Abstract:Understanding videos is challenging in computer vision. In particular, the large memory footprint of an untrimmed video makes most tasks infeasible to train end-to-end without dropping part of the input data. To cope with the memory limitation of commodity GPUs, current video localization models encode videos in an offline fashion. Even though these encoders are learned, they are typically trained for action classification tasks at the frame- or clip-level. Since it is difficult to finetune these encoders for other video tasks, they might be sub-optimal for temporal localization tasks. In this work, we propose a novel, supervised pretraining paradigm for clip-level video representation that does not only train to classify activities, but also considers background clips and global video information to gain temporal sensitivity. Extensive experiments show that features extracted by clip-level encoders trained with our novel pretraining task are more discriminative for several temporal localization tasks. Specifically, we show that using our newly trained features with state-of-the-art methods significantly improves performance on three tasks: Temporal Action Localization (+1.72% in average mAP on ActivityNet and +4.4% in mAP@0.5 on THUMOS14), Action Proposal Generation (+1.94% in AUC on ActivityNet), and Dense Video Captioning (+0.31% in average METEOR on ActivityNet Captions). We believe video feature encoding is an important building block for many video algorithms, and extracting meaningful features should be of paramount importance in the effort to build more accurate models.

Via

Access Paper or Ask Questions

LC-NAS: Latency Constrained Neural Architecture Search for Point Cloud Networks

Aug 24, 2020

Guohao Li, Mengmeng Xu, Silvio Giancola, Ali Thabet, Bernard Ghanem

Figure 1 for LC-NAS: Latency Constrained Neural Architecture Search for Point Cloud Networks

Figure 2 for LC-NAS: Latency Constrained Neural Architecture Search for Point Cloud Networks

Figure 3 for LC-NAS: Latency Constrained Neural Architecture Search for Point Cloud Networks

Figure 4 for LC-NAS: Latency Constrained Neural Architecture Search for Point Cloud Networks

Abstract:Point cloud architecture design has become a crucial problem for 3D deep learning. Several efforts exist to manually design architectures with high accuracy in point cloud tasks such as classification, segmentation, and detection. Recent progress in automatic Neural Architecture Search (NAS) minimizes the human effort in network design and optimizes high performing architectures. However, these efforts fail to consider important factors such as latency during inference. Latency is of high importance in time critical applications like self-driving cars, robot navigation, and mobile applications, that are generally bound by the available hardware. In this paper, we introduce a new NAS framework, dubbed LC-NAS, where we search for point cloud architectures that are constrained to a target latency. We implement a novel latency constraint formulation to trade-off between accuracy and latency in our architecture search. Contrary to previous works, our latency loss guarantees that the final network achieves latency under a specified target value. This is crucial when the end task is to be deployed in a limited hardware setting. Extensive experiments show that LC-NAS is able to find state-of-the-art architectures for point cloud classification in ModelNet40 with minimal computational cost. We also show how our searched architectures achieve any desired latency with a reasonably low drop in accuracy. Finally, we show how our searched architectures easily transfer to a different task, part segmentation on PartNet, where we achieve state-of-the-art results while lowering latency by a factor of 10.

* Originally submitted to ECCV'2020 but rejected. This work was filed with the United States Patent and Trademark Office (USPTO) on May 19, 2020 and assigned Serial No. 63/027,241

Via

Access Paper or Ask Questions

A Context-Aware Loss Function for Action Spotting in Soccer Videos

Dec 03, 2019

Anthony Cioppa, Adrien Deliège, Silvio Giancola, Bernard Ghanem, Marc Van Droogenbroeck, Rikke Gade, Thomas B. Moeslund

Figure 1 for A Context-Aware Loss Function for Action Spotting in Soccer Videos

Figure 2 for A Context-Aware Loss Function for Action Spotting in Soccer Videos

Figure 3 for A Context-Aware Loss Function for Action Spotting in Soccer Videos

Figure 4 for A Context-Aware Loss Function for Action Spotting in Soccer Videos

Abstract:Action spotting is an important element of general activity understanding. It consists of detecting human-induced events annotated with single timestamps. In this paper, we propose a novel loss function for action spotting. Our loss aims at dealing specifically with the temporal context naturally present around an action. Rather than focusing on the single annotated frame of the action to spot, we consider different temporal segments surrounding it and shape our loss function accordingly. We test our loss on SoccerNet, a large dataset of soccer videos, showing an improvement of 12.8% on the current baseline. We also show the generalization capability of our loss function on ActivityNet for activity proposals and detection, by spotting the beginning and the end of each activity. Furthermore, we provide an extended ablation study and identify challenging cases for action spotting in soccer videos. Finally, we qualitatively illustrate how our loss induces a precise temporal understanding of actions, and how such semantic knowledge can be leveraged to design a highlights generator.

Via

Access Paper or Ask Questions

PointRGCN: Graph Convolution Networks for 3D Vehicles Detection Refinement

Nov 27, 2019

Jesus Zarzar, Silvio Giancola, Bernard Ghanem

Figure 1 for PointRGCN: Graph Convolution Networks for 3D Vehicles Detection Refinement

Figure 2 for PointRGCN: Graph Convolution Networks for 3D Vehicles Detection Refinement

Figure 3 for PointRGCN: Graph Convolution Networks for 3D Vehicles Detection Refinement

Figure 4 for PointRGCN: Graph Convolution Networks for 3D Vehicles Detection Refinement

Abstract:In autonomous driving pipelines, perception modules provide a visual understanding of the surrounding road scene. Among the perception tasks, vehicle detection is of paramount importance for a safe driving as it identifies the position of other agents sharing the road. In our work, we propose PointRGCN: a graph-based 3D object detection pipeline based on graph convolutional networks (GCNs) which operates exclusively on 3D LiDAR point clouds. To perform more accurate 3D object detection, we leverage a graph representation that performs proposal feature and context aggregation. We integrate residual GCNs in a two-stage 3D object detection pipeline, where 3D object proposals are refined using a novel graph representation. In particular, R-GCN is a residual GCN that classifies and regresses 3D proposals, and C-GCN is a contextual GCN that further refines proposals by sharing contextual information between multiple proposals. We integrate our refinement modules into a novel 3D detection pipeline, PointRGCN, and achieve state-of-the-art performance on the easy difficulty for the bird eye view detection task.

Via

Access Paper or Ask Questions