Alert button
Picture for Wei Mao

Wei Mao

Alert button

TransMUSIC: A Transformer-Aided Subspace Method for DOA Estimation with Low-Resolution ADCs

Sep 15, 2023
Junkai Ji, Wei Mao, Feng Xi, Shengyao Chen

Direction of arrival (DOA) estimation employing low-resolution analog-to-digital convertors (ADCs) has emerged as a challenging and intriguing problem, particularly with the rise in popularity of large-scale arrays. The substantial quantization distortion complicates the extraction of signal and noise subspaces from the quantized data. To address this issue, this paper introduces a novel approach that leverages the Transformer model to aid the subspace estimation. In this model, multiple snapshots are processed in parallel, enabling the capture of global correlations that span them. The learned subspace empowers us to construct the MUSIC spectrum and perform gridless DOA estimation using a neural network-based peak finder. Additionally, the acquired subspace encodes the vital information of model order, allowing us to determine the exact number of sources. These integrated components form a unified algorithmic framework referred to as TransMUSIC. Numerical results demonstrate the superiority of the TransMUSIC algorithm, even when dealing with one-bit quantized data. The results highlight the potential of Transformer-based techniques in DOA estimation.

* 5 pages, 5 figures 
Viaarxiv icon

VisFusion: Visibility-aware Online 3D Scene Reconstruction from Videos

Apr 21, 2023
Huiyu Gao, Wei Mao, Miaomiao Liu

Figure 1 for VisFusion: Visibility-aware Online 3D Scene Reconstruction from Videos
Figure 2 for VisFusion: Visibility-aware Online 3D Scene Reconstruction from Videos
Figure 3 for VisFusion: Visibility-aware Online 3D Scene Reconstruction from Videos
Figure 4 for VisFusion: Visibility-aware Online 3D Scene Reconstruction from Videos

We propose VisFusion, a visibility-aware online 3D scene reconstruction approach from posed monocular videos. In particular, we aim to reconstruct the scene from volumetric features. Unlike previous reconstruction methods which aggregate features for each voxel from input views without considering its visibility, we aim to improve the feature fusion by explicitly inferring its visibility from a similarity matrix, computed from its projected features in each image pair. Following previous works, our model is a coarse-to-fine pipeline including a volume sparsification process. Different from their works which sparsify voxels globally with a fixed occupancy threshold, we perform the sparsification on a local feature volume along each visual ray to preserve at least one voxel per ray for more fine details. The sparse local volume is then fused with a global one for online reconstruction. We further propose to predict TSDF in a coarse-to-fine manner by learning its residuals across scales leading to better TSDF predictions. Experimental results on benchmarks show that our method can achieve superior performance with more scene details. Code is available at: https://github.com/huiyu-gao/VisFusion

* CVPR 2023 
Viaarxiv icon

Interacting Hand-Object Pose Estimation via Dense Mutual Attention

Nov 16, 2022
Rong Wang, Wei Mao, Hongdong Li

Figure 1 for Interacting Hand-Object Pose Estimation via Dense Mutual Attention
Figure 2 for Interacting Hand-Object Pose Estimation via Dense Mutual Attention
Figure 3 for Interacting Hand-Object Pose Estimation via Dense Mutual Attention
Figure 4 for Interacting Hand-Object Pose Estimation via Dense Mutual Attention

3D hand-object pose estimation is the key to the success of many computer vision applications. The main focus of this task is to effectively model the interaction between the hand and an object. To this end, existing works either rely on interaction constraints in a computationally-expensive iterative optimization, or consider only a sparse correlation between sampled hand and object keypoints. In contrast, we propose a novel dense mutual attention mechanism that is able to model fine-grained dependencies between the hand and the object. Specifically, we first construct the hand and object graphs according to their mesh structures. For each hand node, we aggregate features from every object node by the learned attention and vice versa for each object node. Thanks to such dense mutual attention, our method is able to produce physically plausible poses with high quality and real-time inference speed. Extensive quantitative and qualitative experiments on large benchmark datasets show that our method outperforms state-of-the-art methods. The code is available at https://github.com/rongakowang/DenseMutualAttention.git.

Viaarxiv icon

Contact-aware Human Motion Forecasting

Oct 08, 2022
Wei Mao, Miaomiao Liu, Richard Hartley, Mathieu Salzmann

Figure 1 for Contact-aware Human Motion Forecasting
Figure 2 for Contact-aware Human Motion Forecasting
Figure 3 for Contact-aware Human Motion Forecasting
Figure 4 for Contact-aware Human Motion Forecasting

In this paper, we tackle the task of scene-aware 3D human motion forecasting, which consists of predicting future human poses given a 3D scene and a past human motion. A key challenge of this task is to ensure consistency between the human and the scene, accounting for human-scene interactions. Previous attempts to do so model such interactions only implicitly, and thus tend to produce artifacts such as "ghost motion" because of the lack of explicit constraints between the local poses and the global motion. Here, by contrast, we propose to explicitly model the human-scene contacts. To this end, we introduce distance-based contact maps that capture the contact relationships between every joint and every 3D scene point at each time instant. We then develop a two-stage pipeline that first predicts the future contact maps from the past ones and the scene point cloud, and then forecasts the future human poses by conditioning them on the predicted contact maps. During training, we explicitly encourage consistency between the global motion and the local poses via a prior defined using the contact maps and future poses. Our approach outperforms the state-of-the-art human motion forecasting and human synthesis methods on both synthetic and real datasets. Our code is available at https://github.com/wei-mao-2019/ContAwareMotionPred.

* Accepted to NeurIPS2022 
Viaarxiv icon

Weakly-supervised Action Transition Learning for Stochastic Human Motion Prediction

May 31, 2022
Wei Mao, Miaomiao Liu, Mathieu Salzmann

Figure 1 for Weakly-supervised Action Transition Learning for Stochastic Human Motion Prediction
Figure 2 for Weakly-supervised Action Transition Learning for Stochastic Human Motion Prediction
Figure 3 for Weakly-supervised Action Transition Learning for Stochastic Human Motion Prediction
Figure 4 for Weakly-supervised Action Transition Learning for Stochastic Human Motion Prediction

We introduce the task of action-driven stochastic human motion prediction, which aims to predict multiple plausible future motions given a sequence of action labels and a short motion history. This differs from existing works, which predict motions that either do not respect any specific action category, or follow a single action label. In particular, addressing this task requires tackling two challenges: The transitions between the different actions must be smooth; the length of the predicted motion depends on the action sequence and varies significantly across samples. As we cannot realistically expect training data to cover sufficiently diverse action transitions and motion lengths, we propose an effective training strategy consisting of combining multiple motions from different actions and introducing a weak form of supervision to encourage smooth transitions. We then design a VAE-based model conditioned on both the observed motion and the action label sequence, allowing us to generate multiple plausible future motions of varying length. We illustrate the generality of our approach by exploring its use with two different temporal encoding models, namely RNNs and Transformers. Our approach outperforms baseline models constructed by adapting state-of-the-art single action-conditioned motion generation methods and stochastic human motion prediction approaches to our new task of action-driven stochastic motion prediction. Our code is available at https://github.com/wei-mao-2019/WAT.

* CVPR2022 (Oral) 
Viaarxiv icon

Generating Smooth Pose Sequences for Diverse Human Motion Prediction

Aug 21, 2021
Wei Mao, Miaomiao Liu, Mathieu Salzmann

Figure 1 for Generating Smooth Pose Sequences for Diverse Human Motion Prediction
Figure 2 for Generating Smooth Pose Sequences for Diverse Human Motion Prediction
Figure 3 for Generating Smooth Pose Sequences for Diverse Human Motion Prediction
Figure 4 for Generating Smooth Pose Sequences for Diverse Human Motion Prediction

Recent progress in stochastic motion prediction, i.e., predicting multiple possible future human motions given a single past pose sequence, has led to producing truly diverse future motions and even providing control over the motion of some body parts. However, to achieve this, the state-of-the-art method requires learning several mappings for diversity and a dedicated model for controllable motion prediction. In this paper, we introduce a unified deep generative network for both diverse and controllable motion prediction. To this end, we leverage the intuition that realistic human motions consist of smooth sequences of valid poses, and that, given limited data, learning a pose prior is much more tractable than a motion one. We therefore design a generator that predicts the motion of different body parts sequentially, and introduce a normalizing flow based pose prior, together with a joint angle loss, to achieve motion realism.Our experiments on two standard benchmark datasets, Human3.6M and HumanEva-I, demonstrate that our approach outperforms the state-of-the-art baselines in terms of both sample diversity and accuracy. The code is available at https://github.com/wei-mao-2019/gsps

* ICCV21(oral) 
Viaarxiv icon

Multi-level Motion Attention for Human Motion Prediction

Jun 17, 2021
Wei Mao, Miaomiao Liu, Mathieu Salzmann, Hongdong Li

Figure 1 for Multi-level Motion Attention for Human Motion Prediction
Figure 2 for Multi-level Motion Attention for Human Motion Prediction
Figure 3 for Multi-level Motion Attention for Human Motion Prediction
Figure 4 for Multi-level Motion Attention for Human Motion Prediction

Human motion prediction aims to forecast future human poses given a historical motion. Whether based on recurrent or feed-forward neural networks, existing learning based methods fail to model the observation that human motion tends to repeat itself, even for complex sports actions and cooking activities. Here, we introduce an attention based feed-forward network that explicitly leverages this observation. In particular, instead of modeling frame-wise attention via pose similarity, we propose to extract motion attention to capture the similarity between the current motion context and the historical motion sub-sequences. In this context, we study the use of different types of attention, computed at joint, body part, and full pose levels. Aggregating the relevant past motions and processing the result with a graph convolutional network allows us to effectively exploit motion patterns from the long-term history to predict the future poses. Our experiments on Human3.6M, AMASS and 3DPW validate the benefits of our approach for both periodical and non-periodical actions. Thanks to our attention model, it yields state-of-the-art results on all three datasets. Our code is available at https://github.com/wei-mao-2019/HisRepItself.

* Accepted by IJCV. arXiv admin note: substantial text overlap with arXiv:2007.11755 
Viaarxiv icon

Panoptic Lintention Network: Towards Efficient Navigational Perception for the Visually Impaired

Mar 06, 2021
Wei Mao, Jiaming Zhang, Kailun Yang, Rainer Stiefelhagen

Figure 1 for Panoptic Lintention Network: Towards Efficient Navigational Perception for the Visually Impaired
Figure 2 for Panoptic Lintention Network: Towards Efficient Navigational Perception for the Visually Impaired
Figure 3 for Panoptic Lintention Network: Towards Efficient Navigational Perception for the Visually Impaired
Figure 4 for Panoptic Lintention Network: Towards Efficient Navigational Perception for the Visually Impaired

Classic computer vision algorithms, instance segmentation, and semantic segmentation can not provide a holistic understanding of the surroundings for the visually impaired. In this paper, we utilize panoptic segmentation to assist the navigation of visually impaired people by offering both things and stuff awareness in the proximity of the visually impaired efficiently. To this end, we propose an efficient Attention module -- Lintention which can model long-range interactions in linear time using linear space. Based on Lintention, we then devise a novel panoptic segmentation model which we term Panoptic Lintention Net. Experiments on the COCO dataset indicate that the Panoptic Lintention Net raises the Panoptic Quality (PQ) from 39.39 to 41.42 with 4.6\% performance gain while only requiring 10\% fewer GFLOPs and 25\% fewer parameters in the semantic branch. Furthermore, a real-world test via our designed compact wearable panoptic segmentation system, indicates that our system based on the Panoptic Lintention Net accomplishes a relatively stable and exceptionally remarkable panoptic segmentation in real-world scenes.

* 6 pages, 4 figures, 2 tables 
Viaarxiv icon

History Repeats Itself: Human Motion Prediction via Motion Attention

Jul 23, 2020
Wei Mao, Miaomiao Liu, Mathieu Salzmann

Figure 1 for History Repeats Itself: Human Motion Prediction via Motion Attention
Figure 2 for History Repeats Itself: Human Motion Prediction via Motion Attention
Figure 3 for History Repeats Itself: Human Motion Prediction via Motion Attention
Figure 4 for History Repeats Itself: Human Motion Prediction via Motion Attention

Human motion prediction aims to forecast future human poses given a past motion. Whether based on recurrent or feed-forward neural networks, existing methods fail to model the observation that human motion tends to repeat itself, even for complex sports actions and cooking activities. Here, we introduce an attention-based feed-forward network that explicitly leverages this observation. In particular, instead of modeling frame-wise attention via pose similarity, we propose to extract motion attention to capture the similarity between the current motion context and the historical motion sub-sequences. Aggregating the relevant past motions and processing the result with a graph convolutional network allows us to effectively exploit motion patterns from the long-term history to predict the future poses. Our experiments on Human3.6M, AMASS and 3DPW evidence the benefits of our approach for both periodical and non-periodical actions. Thanks to our attention model, it yields state-of-the-art results on all three datasets. Our code is available at https://github.com/wei-mao-2019/HisRepItself.

* Accepted by ECCV2020 
Viaarxiv icon