Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Li-Wei Wang

ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos

May 25, 2021

Meng-Jiun Chiou, Chun-Yu Liao, Li-Wei Wang, Roger Zimmermann, Jiashi Feng

Figure 1 for ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos

Figure 2 for ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos

Figure 3 for ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos

Figure 4 for ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos

Abstract:Detecting human-object interactions (HOI) is an important step toward a comprehensive visual understanding of machines. While detecting non-temporal HOIs (e.g., sitting on a chair) from static images is feasible, it is unlikely even for humans to guess temporal-related HOIs (e.g., opening/closing a door) from a single video frame, where the neighboring frames play an essential role. However, conventional HOI methods operating on only static images have been used to predict temporal-related interactions, which is essentially guessing without temporal contexts and may lead to sub-optimal performance. In this paper, we bridge this gap by detecting video-based HOIs with explicit temporal information. We first show that a naive temporal-aware variant of a common action detection baseline does not work on video-based HOIs due to a feature-inconsistency issue. We then propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI) utilizing temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features. We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.

* Accepted at ACM ICMR'21 Workshop on Intelligent Cross-Data Analysis and Retrieval. The dataset and source code are available at https://github.com/coldmanck/VidHOI

Via

Access Paper or Ask Questions

eCNN: A Block-Based and Highly-Parallel CNN Accelerator for Edge Inference

Oct 13, 2019

Chao-Tsung Huang, Yu-Chun Ding, Huan-Ching Wang, Chi-Wen Weng, Kai-Ping Lin, Li-Wei Wang, Li-De Chen

Figure 1 for eCNN: A Block-Based and Highly-Parallel CNN Accelerator for Edge Inference

Figure 2 for eCNN: A Block-Based and Highly-Parallel CNN Accelerator for Edge Inference

Figure 3 for eCNN: A Block-Based and Highly-Parallel CNN Accelerator for Edge Inference

Figure 4 for eCNN: A Block-Based and Highly-Parallel CNN Accelerator for Edge Inference

Abstract:Convolutional neural networks (CNNs) have recently demonstrated superior quality for computational imaging applications. Therefore, they have great potential to revolutionize the image pipelines on cameras and displays. However, it is difficult for conventional CNN accelerators to support ultra-high-resolution videos at the edge due to their considerable DRAM bandwidth and power consumption. Therefore, finding a further memory- and computation-efficient microarchitecture is crucial to speed up this coming revolution. In this paper, we approach this goal by considering the inference flow, network model, instruction set, and processor design jointly to optimize hardware performance and image quality. We apply a block-based inference flow which can eliminate all the DRAM bandwidth for feature maps and accordingly propose a hardware-oriented network model, ERNet, to optimize image quality based on hardware constraints. Then we devise a coarse-grained instruction set architecture, FBISA, to support power-hungry convolution by massive parallelism. Finally,we implement an embedded processor---eCNN---which accommodates to ERNet and FBISA with a flexible processing architecture. Layout results show that it can support high-quality ERNets for super-resolution and denoising at up to 4K Ultra-HD 30 fps while using only DDR-400 and consuming 6.94W on average. By comparison, the state-of-the-art Diffy uses dual-channel DDR3-2133 and consumes 54.3W to support lower-quality VDSR at Full HD 30 fps. Lastly, we will also present application examples of high-performance style transfer and object recognition to demonstrate the flexibility of eCNN.

* 14 pages; appearing in IEEE/ACM International Symposium on Microarchitecture (MICRO), 2019

Via

Access Paper or Ask Questions