Human motion prediction and understanding is a challenging problem. Due to the complex dynamic of human motion and the non-deterministic aspect of future prediction. We propose a novel sequence-to-sequence model for human motion prediction and feature learning, trained with a modified version of generative adversarial network, with a custom loss function that takes inspiration from human motion animation and can control the variation between multiple predicted motion from the same input poses. Our model learns to predict multiple future sequences of human poses from the same input sequence. We show that the discriminator learns general presentation of human motion by using the learned feature in action recognition task. Furthermore, to quantify the quality of the non-deterministic predictions, we simultaneously train a motion-quality-assessment network that learns the probability that a given sequence of poses is a real human motion or not. We test our model on two of the largest human pose datasets: NTURGB-D and Human3.6M. We train on both single and multiple action types. Its predictive power for motion estimation is demonstrated by generating multiple plausible futures from the same input and show the effect of each of the loss functions. Furthermore, we show that it takes less than half the number of epochs to train an activity recognition network by using the feature learned from the discriminator.
Stochastic gradient descent (SGD) is an inherently sequential training algorithm--computing the gradient at batch $i$ depends on the model parameters learned from batch $i-1$. Prior approaches that break this dependence do not honor them (e.g., sum the gradients for each batch, which is not what sequential SGD would do) and thus potentially suffer from poor convergence. This paper introduces a novel method to combine gradients called Adasum (for adaptive sum) that converges faster than prior work. Adasum is easy to implement, almost as efficient as simply summing gradients, and is integrated into the open-source toolkit Horovod. This paper first provides a formal justification for Adasum and then empirically demonstrates Adasum is more accurate than prior gradient accumulation methods. It then introduces a series of case-studies to show Adasum works with multiple frameworks, (TensorFlow and PyTorch), scales multiple optimizers (Momentum-SGD, Adam, and LAMB) to larger batch-sizes while still giving good downstream accuracy. Finally, it proves that Adasum converges. To summarize, Adasum scales Momentum-SGD on the MLPerf Resnet50 benchmark to 64K examples before communication (no MLPerf v0.5 entry converged with more than 16K), the Adam optimizer to 64K examples before communication on BERT-LARGE (prior work showed Adam stopped scaling at 16K), and the LAMB optimizer to 128K before communication on BERT-LARGE (prior work used 64K), all while maintaining downstream accuracy metrics. Finally, if a user does not need to scale, we show LAMB with Adasum on BERT-LARGE converges in 30% fewer steps than the baseline.
Quantization has emerged to be an effective way to significantly boost the performance of deep neural networks (DNNs) by utilizing low-bit computations. Despite having lower numerical precision, quantized DNNs are able to reduce both memory bandwidth and computation cycles with little losses of accuracy. Integer GEMM (General Matrix Multiplication) is critical to running quantized DNN models efficiently, as GEMM operations often dominate the computations in these models. Various approaches have been developed by leveraging techniques such as vectorization and memory layout to improve the performance of integer GEMM. However, these existing approaches are not fast enough in certain scenarios. We developed NGEMM, a compiler-based GEMM implementation for accelerating lower-precision training and inference. NGEMM has better use of the vector units by avoiding unnecessary vector computation that is introduced during tree reduction. We compared NGEMM's performance with the state-of-art BLAS libraries such as MKL. Our experimental results showed that NGEMM outperformed MKL non-pack and pack version by an average of 1.86x and 1.16x, respectively. We have applied NGEMM to a number of production services in Microsoft.
Inspired by CapsNet's routing-by-agreement mechanism, with its ability to learn object properties, and by center-of-mass calculations from physics, we propose a CapsNet architecture with object coordinate atoms and an LSTM network for evaluation. The first is based on CapsNet but uses a new routing algorithm to find the objects' approximate positions in the image coordinate system, and the second is a parameterized affine transformation network that can predict future positions from past positions by learning the translation transformation from 2D object coordinates generated from the first network. We demonstrate the learned translation transformation is transferable to another dataset without the need to train the transformation network again. Only the CapsNet needs training on the new dataset. As a result, our work shows that object recognition and motion prediction can be separated, and that motion prediction can be transferred to another dataset with different object types.
Predicting and understanding human motion dynamics has many applications, such as motion synthesis, augmented reality, security, and autonomous vehicles. Due to the recent success of generative adversarial networks (GAN), there has been much interest in probabilistic estimation and synthetic data generation using deep neural network architectures and learning algorithms. We propose a novel sequence-to-sequence model for probabilistic human motion prediction, trained with a modified version of improved Wasserstein generative adversarial networks (WGAN-GP), in which we use a custom loss function designed for human motion prediction. Our model, which we call HP-GAN, learns a probability density function of future human poses conditioned on previous poses. It predicts multiple sequences of possible future human poses, each from the same input sequence but a different vector z drawn from a random distribution. Furthermore, to quantify the quality of the non-deterministic predictions, we simultaneously train a motion-quality-assessment model that learns the probability that a given skeleton sequence is a real human motion. We test our algorithm on two of the largest skeleton datasets: NTURGB-D and Human3.6M. We train our model on both single and multiple action types. Its predictive power for long-term motion estimation is demonstrated by generating multiple plausible futures of more than 30 frames from just 10 frames of input. We show that most sequences generated from the same input have more than 50\% probabilities of being judged as a real human sequence. We will release all the code used in this paper to Github.
Crowd sourcing has become a widely adopted scheme to collect ground truth labels. However, it is a well-known problem that these labels can be very noisy. In this paper, we demonstrate how to learn a deep convolutional neural network (DCNN) from noisy labels, using facial expression recognition as an example. More specifically, we have 10 taggers to label each input image, and compare four different approaches to utilizing the multiple labels: majority voting, multi-label learning, probabilistic label drawing, and cross-entropy loss. We show that the traditional majority voting scheme does not perform as well as the last two approaches that fully leverage the label distribution. An enhanced FER+ data set with multiple labels for each face image will also be shared with the research community.
With the increase number of companies focusing on commercializing Augmented Reality (AR), Virtual Reality (VR) and wearable devices, the need for a hand based input mechanism is becoming essential in order to make the experience natural, seamless and immersive. Hand pose estimation has progressed drastically in recent years due to the introduction of commodity depth cameras. Hand pose estimation based on vision is still a challenging problem due to its complexity from self-occlusion (between fingers), close similarity between fingers, dexterity of the hands, speed of the pose and the high dimension of the hand kinematic parameters. Articulated hand pose estimation is still an open problem and under intensive research from both academia and industry. The 2 approaches used for hand pose estimation are: discriminative and generative. Generative approach is a model based that tries to fit a hand model to the observed data. Discriminative approach is appearance based, usually implemented with machine learning (ML) and require a large amount of training data. Recent hand pose estimation uses hybrid approach by combining both discriminative and generative methods into a single hand pipeline. In this paper, we focus on reviewing recent progress of hand pose estimation from depth sensor. We will survey discriminative methods, generative methods and hybrid methods. This paper is not a comprehensive review of all hand pose estimation techniques, it is a subset of some of the recent state-of-the-art techniques.