Alert button
Picture for Xi Peng

Xi Peng

Alert button

Rutgers University

Automatic Health Problem Detection from Gait Videos Using Deep Neural Networks

Jun 04, 2019
Rahil Mehrizi, Xi Peng, Shaoting Zhang, Ruisong Liao, Kang Li

Figure 1 for Automatic Health Problem Detection from Gait Videos Using Deep Neural Networks
Figure 2 for Automatic Health Problem Detection from Gait Videos Using Deep Neural Networks
Figure 3 for Automatic Health Problem Detection from Gait Videos Using Deep Neural Networks
Figure 4 for Automatic Health Problem Detection from Gait Videos Using Deep Neural Networks

The aim of this study is developing an automatic system for detection of gait-related health problems using Deep Neural Networks (DNNs). The proposed system takes a video of patients as the input and estimates their 3D body pose using a DNN based method. Our code is publicly available at https://github.com/rmehrizi/multi-view-pose-estimation. The resulting 3D body pose time series are then analyzed in a classifier, which classifies input gait videos into four different groups including Healthy, with Parkinsons disease, Post Stroke patient, and with orthopedic problems. The proposed system removes the requirement of complex and heavy equipment and large laboratory space, and makes the system practical for home use. Moreover, it does not need domain knowledge for feature engineering since it is capable of extracting semantic and high level features from the input data. The experimental results showed the classification accuracy of 56% to 96% for different groups. Furthermore, only 1 out of 25 healthy subjects were misclassified (False positive), and only 1 out of 70 patients were classified as a healthy subject (False negative). This study presents a starting point toward a powerful tool for automatic classification of gait disorders and can be used as a basis for future applications of Deep Learning in clinical gait analysis. Since the system uses digital cameras as the only required equipment, it can be employed in domestic environment of patients and elderly people for consistent gait monitoring and early detection of gait alterations.

Viaarxiv icon

Semantic Graph Convolutional Networks for 3D Human Pose Regression

Apr 06, 2019
Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, Dimitris N. Metaxas

Figure 1 for Semantic Graph Convolutional Networks for 3D Human Pose Regression
Figure 2 for Semantic Graph Convolutional Networks for 3D Human Pose Regression
Figure 3 for Semantic Graph Convolutional Networks for 3D Human Pose Regression
Figure 4 for Semantic Graph Convolutional Networks for 3D Human Pose Regression

In this paper, we study the problem of learning Graph Convolutional Networks (GCNs) for regression. Current architectures of GCNs are limited to the small receptive field of convolution filters and shared transformation matrix for each node. To address these limitations, we propose Semantic Graph Convolutional Networks (SemGCN), a novel neural network architecture that operates on regression tasks with graph-structured data. SemGCN learns to capture semantic information such as local and global node relationships, which is not explicitly represented in the graph. These semantic relationships can be learned through end-to-end training from the ground truth without additional supervision or hand-crafted rules. We further investigate applying SemGCN to 3D human pose regression. Our formulation is intuitive and sufficient since both 2D and 3D human poses can be represented as a structured graph encoding the relationships between joints in the skeleton of a human body. We carry out comprehensive studies to validate our method. The results prove that SemGCN outperforms state of the art while using 90% fewer parameters.

* In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. (13 pages including supplementary material) 
Viaarxiv icon

Learning where to look: Semantic-Guided Multi-Attention Localization for Zero-Shot Learning

Mar 01, 2019
Yizhe Zhu, Jianwen Xie, Zhiqiang Tang, Xi Peng, Ahmed Elgammal

Figure 1 for Learning where to look: Semantic-Guided Multi-Attention Localization for Zero-Shot Learning
Figure 2 for Learning where to look: Semantic-Guided Multi-Attention Localization for Zero-Shot Learning
Figure 3 for Learning where to look: Semantic-Guided Multi-Attention Localization for Zero-Shot Learning
Figure 4 for Learning where to look: Semantic-Guided Multi-Attention Localization for Zero-Shot Learning

Zero-shot learning extends the conventional object classification to the unseen class recognition by introducing semantic representations of classes. Existing approaches predominantly focus on learning the proper mapping function for visual-semantic embedding, while neglecting the effect of learning discriminative visual features. In this paper, we study the significance of the discriminative region localization. We propose a semantic-guided multi-attention localization model, which automatically discovers the most discriminative parts of objects for zero-shot learning without any human annotations. Our model jointly learns cooperative global and local features from the whole object as well as the detected parts to categorize objects based on semantic descriptions. Moreover, with the joint supervision of embedding softmax loss and class-center triplet loss, the model is encouraged to learn features with high inter-class dispersion and intra-class compactness. Through comprehensive experiments on three widely used zero-shot learning benchmarks, we show the efficacy of the multi-attention localization and our proposed approach improves the state-of-the-art results by a considerable margin.

* under review 
Viaarxiv icon

k-meansNet: When k-means Meets Differentiable Programming

Aug 22, 2018
Xi Peng, Joey Tianyi Zhou, Hongyuan Zhu

Figure 1 for k-meansNet: When k-means Meets Differentiable Programming
Figure 2 for k-meansNet: When k-means Meets Differentiable Programming
Figure 3 for k-meansNet: When k-means Meets Differentiable Programming
Figure 4 for k-meansNet: When k-means Meets Differentiable Programming

In this paper, we study how to make clustering benefiting from differentiable programming whose basic idea is treating the neural network as a language instead of a machine learning method. To this end, we recast the vanilla $k$-means as a novel feedforward neural network in an elegant way. Our contribution is two-fold. On the one hand, the proposed \textit{k}-meansNet is a neural network implementation of the vanilla \textit{k}-means, which enjoys four advantages highly desired, i.e., robustness to initialization, fast inference speed, the capability of handling new coming data, and provable convergence. On the other hand, this work may provide novel insights into differentiable programming. More specifically, most existing differentiable programming works unroll an \textbf{optimizer} as a \textbf{recurrent neural network}, namely, the neural network is employed to solve an existing optimization problem. In contrast, we reformulate the \textbf{objective function} of \textit{k}-means as a \textbf{feedforward neural network}, namely, we employ the neural network to describe a problem. In such a way, we advance the boundary of differentiable programming by treating the neural network as from an alternative optimization approach to the problem formulation. Extensive experimental studies show that our method achieves promising performance comparing with 12 clustering methods on some challenging datasets.

* 10 pages 
Viaarxiv icon

CU-Net: Coupled U-Nets

Aug 20, 2018
Zhiqiang Tang, Xi Peng, Shijie Geng, Yizhe Zhu, Dimitris N. Metaxas

Figure 1 for CU-Net: Coupled U-Nets
Figure 2 for CU-Net: Coupled U-Nets
Figure 3 for CU-Net: Coupled U-Nets
Figure 4 for CU-Net: Coupled U-Nets

We design a new connectivity pattern for the U-Net architecture. Given several stacked U-Nets, we couple each U-Net pair through the connections of their semantic blocks, resulting in the coupled U-Nets (CU-Net). The coupling connections could make the information flow more efficiently across U-Nets. The feature reuse across U-Nets makes each U-Net very parameter efficient. We evaluate the coupled U-Nets on two benchmark datasets of human pose estimation. Both the accuracy and model parameter number are compared. The CU-Net obtains comparable accuracy as state-of-the-art methods. However, it only has at least 60% fewer parameters than other approaches.

* BMVC 2018 (Oral) 
Viaarxiv icon

Quantized Densely Connected U-Nets for Efficient Landmark Localization

Aug 10, 2018
Zhiqiang Tang, Xi Peng, Shijie Geng, Lingfei Wu, Shaoting Zhang, Dimitris Metaxas

Figure 1 for Quantized Densely Connected U-Nets for Efficient Landmark Localization
Figure 2 for Quantized Densely Connected U-Nets for Efficient Landmark Localization
Figure 3 for Quantized Densely Connected U-Nets for Efficient Landmark Localization
Figure 4 for Quantized Densely Connected U-Nets for Efficient Landmark Localization

In this paper, we propose quantized densely connected U-Nets for efficient visual landmark localization. The idea is that features of the same semantic meanings are globally reused across the stacked U-Nets. This dense connectivity largely improves the information flow, yielding improved localization accuracy. However, a vanilla dense design would suffer from critical efficiency issue in both training and testing. To solve this problem, we first propose order-K dense connectivity to trim off long-distance shortcuts; then, we use a memory-efficient implementation to significantly boost the training efficiency and investigate an iterative refinement that may slice the model size in half. Finally, to reduce the memory consumption and high precision operations both in training and testing, we further quantize weights, inputs, and gradients of our localization network to low bit-width numbers. We validate our approach in two tasks: human pose estimation and face alignment. The results show that our approach achieves state-of-the-art localization accuracy, but using ~70% fewer parameters, ~98% less model size and saving ~75% training memory compared with other benchmark localizers. The code is available at https://github.com/zhiqiangdon/CU-Net.

* ECCV2018 
Viaarxiv icon

Learning to Forecast and Refine Residual Motion for Image-to-Video Generation

Jul 26, 2018
Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, Dimitris Metaxas

Figure 1 for Learning to Forecast and Refine Residual Motion for Image-to-Video Generation
Figure 2 for Learning to Forecast and Refine Residual Motion for Image-to-Video Generation
Figure 3 for Learning to Forecast and Refine Residual Motion for Image-to-Video Generation
Figure 4 for Learning to Forecast and Refine Residual Motion for Image-to-Video Generation

We consider the problem of image-to-video translation, where an input image is translated into an output video containing motions of a single object. Recent methods for such problems typically train transformation networks to generate future frames conditioned on the structure sequence. Parallel work has shown that short high-quality motions can be generated by spatiotemporal generative networks that leverage temporal knowledge from the training data. We combine the benefits of both approaches and propose a two-stage generation framework where videos are generated from structures and then refined by temporal signals. To model motions more efficiently, we train networks to learn residual motion between the current and future frames, which avoids learning motion-irrelevant details. We conduct extensive experiments on two image-to-video translation tasks: facial expression retargeting and human pose forecasting. Superior results over the state-of-the-art methods on both tasks demonstrate the effectiveness of our approach.

* 17 pages, 8 figures, 4 tables, accepted by ECCV 2018 
Viaarxiv icon

CR-GAN: Learning Complete Representations for Multi-view Generation

Jun 28, 2018
Yu Tian, Xi Peng, Long Zhao, Shaoting Zhang, Dimitris N. Metaxas

Figure 1 for CR-GAN: Learning Complete Representations for Multi-view Generation
Figure 2 for CR-GAN: Learning Complete Representations for Multi-view Generation
Figure 3 for CR-GAN: Learning Complete Representations for Multi-view Generation
Figure 4 for CR-GAN: Learning Complete Representations for Multi-view Generation

Generating multi-view images from a single-view input is an essential yet challenging problem. It has broad applications in vision, graphics, and robotics. Our study indicates that the widely-used generative adversarial network (GAN) may learn "incomplete" representations due to the single-pathway framework: an encoder-decoder network followed by a discriminator network. We propose CR-GAN to address this problem. In addition to the single reconstruction path, we introduce a generation sideway to maintain the completeness of the learned embedding space. The two learning pathways collaborate and compete in a parameter-sharing manner, yielding considerably improved generalization ability to "unseen" dataset. More importantly, the two-pathway framework makes it possible to combine both labeled and unlabeled data for self-supervised learning, which further enriches the embedding space for realistic generations. The experimental results prove that CR-GAN significantly outperforms state-of-the-art methods, especially when generating from "unseen" inputs in wild conditions.

* 7 pages, 9 figures, accepted by IJCAI 2018 
Viaarxiv icon

Sketch-Based Face Editing in Videos Using Identity Deformation Transfer

May 31, 2018
Long Zhao, Fangda Han, Xi Peng, Xun Zhang, Mubbasir Kapadia, Vladimir Pavlovic, Dimitris N. Metaxas

Figure 1 for Sketch-Based Face Editing in Videos Using Identity Deformation Transfer
Figure 2 for Sketch-Based Face Editing in Videos Using Identity Deformation Transfer
Figure 3 for Sketch-Based Face Editing in Videos Using Identity Deformation Transfer
Figure 4 for Sketch-Based Face Editing in Videos Using Identity Deformation Transfer

We address the problem of using hand-drawn sketches to edit the facial identity, such as enlarging the shape or modifying the position of eyes or mouth, in the entire video. This task is formulated as a 3D face model reconstruction and deformation problem. We first introduce a two-stage real-time 3D face model fitting schema to recover the facial identity and expressions from the video. User's editing intention is recognized from input sketches as a set of facial modifications. Then a novel identity deformation algorithm is proposed to transfer these facial deformations from 2D space to the 3D facial identity directly, while preserving the facial expressions. After an optional stage for further refining the 3D face model, these changes are propagated to the whole video with the modified identity. Both the user study and experimental results demonstrate that our sketching framework can help users effectively edit facial identities in videos, while high consistency and fidelity are ensured at the same time.

Viaarxiv icon

Jointly Optimize Data Augmentation and Network Training: Adversarial Data Augmentation in Human Pose Estimation

May 24, 2018
Xi Peng, Zhiqiang Tang, Fei Yang, Rogerio Feris, Dimitris Metaxas

Figure 1 for Jointly Optimize Data Augmentation and Network Training: Adversarial Data Augmentation in Human Pose Estimation
Figure 2 for Jointly Optimize Data Augmentation and Network Training: Adversarial Data Augmentation in Human Pose Estimation
Figure 3 for Jointly Optimize Data Augmentation and Network Training: Adversarial Data Augmentation in Human Pose Estimation
Figure 4 for Jointly Optimize Data Augmentation and Network Training: Adversarial Data Augmentation in Human Pose Estimation

Random data augmentation is a critical technique to avoid overfitting in training deep neural network models. However, data augmentation and network training are usually treated as two isolated processes, limiting the effectiveness of network training. Why not jointly optimize the two? We propose adversarial data augmentation to address this limitation. The main idea is to design an augmentation network (generator) that competes against a target network (discriminator) by generating `hard' augmentation operations online. The augmentation network explores the weaknesses of the target network, while the latter learns from `hard' augmentations to achieve better performance. We also design a reward/penalty strategy for effective joint training. We demonstrate our approach on the problem of human pose estimation and carry out a comprehensive experimental analysis, showing that our method can significantly improve state-of-the-art models without additional data efforts.

* CVPR 2018 
Viaarxiv icon