Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition

Jun 27, 2018

Dongliang He, Fu Li, Qijie Zhao, Xiang Long, Yi Fu, Shilei Wen

Figure 1 for Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition

Figure 2 for Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition

Figure 3 for Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition

Figure 4 for Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition

Share this with someone who'll enjoy it:

Abstract:In this report, our approach to tackling the task of ActivityNet 2018 Kinetics-600 challenge is described in detail. Though spatial-temporal modelling methods, which adopt either such end-to-end framework as I3D \cite{i3d} or two-stage frameworks (i.e., CNN+RNN), have been proposed in existing state-of-the-arts for this task, video modelling is far from being well solved. In this challenge, we propose spatial-temporal network (StNet) for better joint spatial-temporal modelling and comprehensively video understanding. Besides, given that multi-modal information is contained in video source, we manage to integrate both early-fusion and later-fusion strategy of multi-modal information via our proposed improved temporal Xception network (iTXN) for video understanding. Our StNet RGB single model achieves 78.99\% top-1 precision in the Kinetics-600 validation set and that of our improved temporal Xception network which integrates RGB, flow and audio modalities is up to 82.35\%. After model ensemble, we achieve top-1 precision as high as 85.0\% on the validation set and rank No.1 among all submissions.

View paper on

Share this with someone who'll enjoy it:

Title:Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition

Paper and Code