Alert button
Picture for Kyoung Mu Lee

Kyoung Mu Lee

Alert button

PoseFix: Model-agnostic General Human Pose Refinement Network

Dec 10, 2018
Gyeongsik Moon, Ju Yong Chang, Kyoung Mu Lee

Figure 1 for PoseFix: Model-agnostic General Human Pose Refinement Network
Figure 2 for PoseFix: Model-agnostic General Human Pose Refinement Network
Figure 3 for PoseFix: Model-agnostic General Human Pose Refinement Network
Figure 4 for PoseFix: Model-agnostic General Human Pose Refinement Network

Multi-person pose estimation from a 2D image is an essential technique for human behavior understanding. In this paper, we propose a human pose refinement network that estimates a refined pose from a tuple of an input image and input pose. The pose refinement was performed mainly through an end-to-end trainable multi-stage architecture in previous methods. However, they are highly dependent on pose estimation models and require careful model design. By contrast, we propose a model-agnostic pose refinement method. According to a recent study, state-of-the-art 2D human pose estimation methods have similar error distributions. We use this error statistics as prior information to generate synthetic poses and use the synthesized poses to train our model. In the testing stage, pose estimation results of any other methods can be input to the proposed method. Moreover, the proposed model does not require code or knowledge about other methods, which allows it to be easily used in the post-processing step. We show that the proposed approach achieves better performance than the conventional multi-stage refinement models and consistently improves the performance of various state-of-the-art pose estimation methods on the commonly used benchmark. We will release the code and pre-trained model for easy access.

Viaarxiv icon

SPNet: Deep 3D Object Classification and Retrieval using Stereographic Projection

Nov 05, 2018
Mohsen Yavartanoo, Eu Young Kim, Kyoung Mu Lee

Figure 1 for SPNet: Deep 3D Object Classification and Retrieval using Stereographic Projection
Figure 2 for SPNet: Deep 3D Object Classification and Retrieval using Stereographic Projection
Figure 3 for SPNet: Deep 3D Object Classification and Retrieval using Stereographic Projection
Figure 4 for SPNet: Deep 3D Object Classification and Retrieval using Stereographic Projection

We propose an efficient Stereographic Projection Neural Network (SPNet) for learning representations of 3D objects. We first transform a 3D input volume into a 2D planar image using stereographic projection. We then present a shallow 2D convolutional neural network (CNN) to estimate the object category followed by view ensemble, which combines the responses from multiple views of the object to further enhance the predictions. Specifically, the proposed approach consists of four stages: (1) Stereographic projection of a 3D object, (2) view-specific feature learning, (3) view selection and (4) view ensemble. The proposed approach performs comparably to the state-of-the-art methods while having substantially lower GPU memory as well as network parameters. Despite its lightness, the experiments on 3D object classification and shape retrievals demonstrate the high performance of the proposed method.

Viaarxiv icon

Real-time visual tracking by deep reinforced decision making

Aug 17, 2018
Janghoon Choi, Junseok Kwon, Kyoung Mu Lee

Figure 1 for Real-time visual tracking by deep reinforced decision making
Figure 2 for Real-time visual tracking by deep reinforced decision making
Figure 3 for Real-time visual tracking by deep reinforced decision making
Figure 4 for Real-time visual tracking by deep reinforced decision making

One of the major challenges of model-free visual tracking problem has been the difficulty originating from the unpredictable and drastic changes in the appearance of objects we target to track. Existing methods tackle this problem by updating the appearance model on-line in order to adapt to the changes in the appearance. Despite the success of these methods however, inaccurate and erroneous updates of the appearance model result in a tracker drift. In this paper, we introduce a novel real-time visual tracking algorithm based on a template selection strategy constructed by deep reinforcement learning methods. The tracking algorithm utilizes this strategy to choose the appropriate template for tracking a given frame. The template selection strategy is self-learned by utilizing a simple policy gradient method on numerous training episodes randomly generated from a tracking benchmark dataset. Our proposed reinforcement learning framework is generally applicable to other confidence map based tracking algorithms. The experiment shows that our tracking algorithm runs in real-time speed of 43 fps and the proposed policy network effectively decides the appropriate template for successful visual tracking.

Viaarxiv icon

V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map

Aug 16, 2018
Gyeongsik Moon, Ju Yong Chang, Kyoung Mu Lee

Figure 1 for V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map
Figure 2 for V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map
Figure 3 for V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map
Figure 4 for V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map

Most of the existing deep learning-based methods for 3D hand and human pose estimation from a single depth map are based on a common framework that takes a 2D depth map and directly regresses the 3D coordinates of keypoints, such as hand or human body joints, via 2D convolutional neural networks (CNNs). The first weakness of this approach is the presence of perspective distortion in the 2D depth map. While the depth map is intrinsically 3D data, many previous methods treat depth maps as 2D images that can distort the shape of the actual object through projection from 3D to 2D space. This compels the network to perform perspective distortion-invariant estimation. The second weakness of the conventional approach is that directly regressing 3D coordinates from a 2D image is a highly non-linear mapping, which causes difficulty in the learning procedure. To overcome these weaknesses, we firstly cast the 3D hand and human pose estimation problem from a single depth map into a voxel-to-voxel prediction that uses a 3D voxelized grid and estimates the per-voxel likelihood for each keypoint. We design our model as a 3D CNN that provides accurate estimates while running in real-time. Our system outperforms previous methods in almost all publicly available 3D hand and human pose estimation datasets and placed first in the HANDS 2017 frame-based 3D hand pose estimation challenge. The code is available in https://github.com/mks0601/V2V-PoseNet_RELEASE.

* HANDS 2017 Challenge Frame-based 3D Hand Pose Estimation Winner (ICCV 2017), Published at CVPR 2018 
Viaarxiv icon

Joint Blind Motion Deblurring and Depth Estimation of Light Field

Jun 14, 2018
Dongwoo Lee, Haesol Park, In Kyu Park, Kyoung Mu Lee

Figure 1 for Joint Blind Motion Deblurring and Depth Estimation of Light Field
Figure 2 for Joint Blind Motion Deblurring and Depth Estimation of Light Field
Figure 3 for Joint Blind Motion Deblurring and Depth Estimation of Light Field
Figure 4 for Joint Blind Motion Deblurring and Depth Estimation of Light Field

Removing camera motion blur from a single light field is a challenging task since it is highly ill-posed inverse problem. The problem becomes even worse when blur kernel varies spatially due to scene depth variation and high-order camera motion. In this paper, we propose a novel algorithm to estimate all blur model variables jointly, including latent sub-aperture image, camera motion, and scene depth from the blurred 4D light field. Exploiting multi-view nature of a light field relieves the inverse property of the optimization by utilizing strong depth cues and multi-view blur observation. The proposed joint estimation achieves high quality light field deblurring and depth estimation simultaneously under arbitrary 6-DOF camera motion and unconstrained scene depth. Intensive experiment on real and synthetic blurred light field confirms that the proposed algorithm outperforms the state-of-the-art light field deblurring and depth estimation methods.

Viaarxiv icon

Deep Vessel Segmentation By Learning Graphical Connectivity

Jun 06, 2018
Seung Yeon Shin, Soochahn Lee, Il Dong Yun, Kyoung Mu Lee

Figure 1 for Deep Vessel Segmentation By Learning Graphical Connectivity
Figure 2 for Deep Vessel Segmentation By Learning Graphical Connectivity
Figure 3 for Deep Vessel Segmentation By Learning Graphical Connectivity
Figure 4 for Deep Vessel Segmentation By Learning Graphical Connectivity

We propose a novel deep-learning-based system for vessel segmentation. Existing methods using CNNs have mostly relied on local appearances learned on the regular image grid, without considering the graphical structure of vessel shape. To address this, we incorporate a graph convolutional network into a unified CNN architecture, where the final segmentation is inferred by combining the different types of features. The proposed method can be applied to expand any type of CNN-based vessel segmentation method to enhance the performance. Experiments show that the proposed method outperforms the current state-of-the-art methods on two retinal image datasets as well as a coronary artery X-ray angiography dataset.

Viaarxiv icon

Deep Multi-scale Convolutional Neural Network for Dynamic Scene Deblurring

May 07, 2018
Seungjun Nah, Tae Hyun Kim, Kyoung Mu Lee

Figure 1 for Deep Multi-scale Convolutional Neural Network for Dynamic Scene Deblurring
Figure 2 for Deep Multi-scale Convolutional Neural Network for Dynamic Scene Deblurring
Figure 3 for Deep Multi-scale Convolutional Neural Network for Dynamic Scene Deblurring
Figure 4 for Deep Multi-scale Convolutional Neural Network for Dynamic Scene Deblurring

Non-uniform blind deblurring for general dynamic scenes is a challenging computer vision problem as blurs arise not only from multiple object motions but also from camera shake, scene depth variation. To remove these complicated motion blurs, conventional energy optimization based methods rely on simple assumptions such that blur kernel is partially uniform or locally linear. Moreover, recent machine learning based methods also depend on synthetic blur datasets generated under these assumptions. This makes conventional deblurring methods fail to remove blurs where blur kernel is difficult to approximate or parameterize (e.g. object motion boundaries). In this work, we propose a multi-scale convolutional neural network that restores sharp images in an end-to-end manner where blur is caused by various sources. Together, we present multi-scale loss function that mimics conventional coarse-to-fine approaches. Furthermore, we propose a new large-scale dataset that provides pairs of realistic blurry image and the corresponding ground truth sharp image that are obtained by a high-speed camera. With the proposed model trained on this dataset, we demonstrate empirically that our method achieves the state-of-the-art performance in dynamic scene deblurring not only qualitatively, but also quantitatively.

Viaarxiv icon

Part-Aligned Bilinear Representations for Person Re-identification

Apr 19, 2018
Yumin Suh, Jingdong Wang, Siyu Tang, Tao Mei, Kyoung Mu Lee

Figure 1 for Part-Aligned Bilinear Representations for Person Re-identification
Figure 2 for Part-Aligned Bilinear Representations for Person Re-identification
Figure 3 for Part-Aligned Bilinear Representations for Person Re-identification
Figure 4 for Part-Aligned Bilinear Representations for Person Re-identification

We propose a novel network that learns a part-aligned representation for person re-identification. It handles the body part misalignment problem, that is, body parts are misaligned across human detections due to pose/viewpoint change and unreliable detection. Our model consists of a two-stream network (one stream for appearance map extraction and the other one for body part map extraction) and a bilinear-pooling layer that generates and spatially pools a part-aligned map. Each local feature of the part-aligned map is obtained by a bilinear mapping of the corresponding local appearance and body part descriptors. Our new representation leads to a robust image matching similarity, which is equivalent to an aggregation of the local similarities of the corresponding body parts combined with the weighted appearance similarity. This part-aligned representation reduces the part misalignment problem significantly. Our approach is also advantageous over other pose-guided representations (e.g., extracting representations over the bounding box of each body part) by learning part descriptors optimal for person re-identification. For training the network, our approach does not require any part annotation on the person re-identification dataset. Instead, we simply initialize the part sub-stream using a pre-trained sub-network of an existing pose estimation network, and train the whole network to minimize the re-identification loss. We validate the effectiveness of our approach by demonstrating its superiority over the state-of-the-art methods on the standard benchmark datasets, including Market-1501, CUHK03, CUHK01 and DukeMTMC, and standard video dataset MARS.

Viaarxiv icon

Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future Goals

Mar 29, 2018
Shanxin Yuan, Guillermo Garcia-Hernando, Bjorn Stenger, Gyeongsik Moon, Ju Yong Chang, Kyoung Mu Lee, Pavlo Molchanov, Jan Kautz, Sina Honari, Liuhao Ge, Junsong Yuan, Xinghao Chen, Guijin Wang, Fan Yang, Kai Akiyama, Yang Wu, Qingfu Wan, Meysam Madadi, Sergio Escalera, Shile Li, Dongheui Lee, Iason Oikonomidis, Antonis Argyros, Tae-Kyun Kim

Figure 1 for Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future Goals
Figure 2 for Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future Goals
Figure 3 for Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future Goals
Figure 4 for Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future Goals

In this paper, we strive to answer two questions: What is the current state of 3D hand pose estimation from depth images? And, what are the next challenges that need to be tackled? Following the successful Hands In the Million Challenge (HIM2017), we investigate the top 10 state-of-the-art methods on three tasks: single frame 3D pose estimation, 3D hand tracking, and hand pose estimation during object interaction. We analyze the performance of different CNN structures with regard to hand shape, joint visibility, view point and articulation distributions. Our findings include: (1) isolated 3D hand pose estimation achieves low mean errors (10 mm) in the view point range of [70, 120] degrees, but it is far from being solved for extreme view points; (2) 3D volumetric representations outperform 2D CNNs, better capturing the spatial structure of the depth data; (3) Discriminative methods still generalize poorly to unseen hand shapes; (4) While joint occlusions pose a challenge for most methods, explicit modeling of structure constraints can significantly narrow the gap between errors on visible and occluded joints.

Viaarxiv icon

2D-3D Pose Consistency-based Conditional Random Fields for 3D Human Pose Estimation

Dec 28, 2017
Ju Yong Chang, Kyoung Mu Lee

Figure 1 for 2D-3D Pose Consistency-based Conditional Random Fields for 3D Human Pose Estimation
Figure 2 for 2D-3D Pose Consistency-based Conditional Random Fields for 3D Human Pose Estimation
Figure 3 for 2D-3D Pose Consistency-based Conditional Random Fields for 3D Human Pose Estimation
Figure 4 for 2D-3D Pose Consistency-based Conditional Random Fields for 3D Human Pose Estimation

This study considers the 3D human pose estimation problem in a single RGB image by proposing a conditional random field (CRF) model over 2D poses, in which the 3D pose is obtained as a byproduct of the inference process. The unary term of the proposed CRF model is defined based on a powerful heat-map regression network, which has been proposed for 2D human pose estimation. This study also presents a regression network for lifting the 2D pose to 3D pose and proposes the prior term based on the consistency between the estimated 3D pose and the 2D pose. To obtain the approximate solution of the proposed CRF model, the N-best strategy is adopted. The proposed inference algorithm can be viewed as sequential processes of bottom-up generation of 2D and 3D pose proposals from the input 2D image based on deep networks and top-down verification of such proposals by checking their consistencies. To evaluate the proposed method, we use two large-scale datasets: Human3.6M and HumanEva. Experimental results show that the proposed method achieves the state-of-the-art 3D human pose estimation performance.

Viaarxiv icon