Neural radiance fields (NeRFs) are promising 3D representations for scenes, objects, and humans. However, most existing methods require multi-view inputs and per-scene training, which limits their real-life applications. Moreover, current methods focus on single-subject cases, leaving scenes of interacting hands that involve severe inter-hand occlusions and challenging view variations remain unsolved. To tackle these issues, this paper proposes a generalizable visibility-aware NeRF (VA-NeRF) framework for interacting hands. Specifically, given an image of interacting hands as input, our VA-NeRF first obtains a mesh-based representation of hands and extracts their corresponding geometric and textural features. Subsequently, a feature fusion module that exploits the visibility of query points and mesh vertices is introduced to adaptively merge features of both hands, enabling the recovery of features in unseen areas. Additionally, our VA-NeRF is optimized together with a novel discriminator within an adversarial learning paradigm. In contrast to conventional discriminators that predict a single real/fake label for the synthesized image, the proposed discriminator generates a pixel-wise visibility map, providing fine-grained supervision for unseen areas and encouraging the VA-NeRF to improve the visual quality of synthesized images. Experiments on the Interhand2.6M dataset demonstrate that our proposed VA-NeRF outperforms conventional NeRFs significantly. Project Page: \url{https://github.com/XuanHuang0/VANeRF}.
Current parametric models have made notable progress in 3D hand pose and shape estimation. However, due to the fixed hand topology and complex hand poses, current models are hard to generate meshes that are aligned with the image well. To tackle this issue, we introduce a dual noise estimation method in this paper. Given a single-view image as input, we first adopt a baseline parametric regressor to obtain the coarse hand meshes. We assume the mesh vertices and their image-plane projections are noisy, and can be associated in a unified probabilistic model. We then learn the distributions of noise to refine mesh vertices and their projections. The refined vertices are further utilized to refine camera parameters in a closed-form manner. Consequently, our method obtains well-aligned and high-quality 3D hand meshes. Extensive experiments on the large-scale Interhand2.6M dataset demonstrate that the proposed method not only improves the performance of its baseline by more than 10$\%$ but also achieves state-of-the-art performance. Project page: \url{https://github.com/hanhuili/DNE4Hand}.
In this paper, we target image-based person-to-person virtual try-on in the presence of diverse poses and large viewpoint variations. Existing methods are restricted in this setting as they estimate garment warping flows mainly based on 2D poses and appearance, which omits the geometric prior of the 3D human body shape. Moreover, current garment warping methods are confined to localized regions, which makes them ineffective in capturing long-range dependencies and results in inferior flows with artifacts. To tackle these issues, we present 3D-aware global correspondences, which are reliable flows that jointly encode global semantic correlations, local deformations, and geometric priors of 3D human bodies. Particularly, given an image pair depicting the source and target person, (a) we first obtain their pose-aware and high-level representations via two encoders, and introduce a coarse-to-fine decoder with multiple refinement modules to predict the pixel-wise global correspondence. (b) 3D parametric human models inferred from images are incorporated as priors to regularize the correspondence refinement process so that our flows can be 3D-aware and better handle variations of pose and viewpoint. (c) Finally, an adversarial generator takes the garment warped by the 3D-aware flow, and the image of the target person as inputs, to synthesize the photo-realistic try-on result. Extensive experiments on public benchmarks and our HardPose test set demonstrate the superiority of our method against the SOTA try-on approaches.
There are considerable advancements in medical health care in recent years, resulting in rising older population. As the workforce for such a population is not keeping pace, there is an urgent need to address this problem. Having robots to stimulating recreational activities for older adults can reduce the workload for caretakers and give them time to address the emotional needs of the elderly. In this paper, we investigate the effects of the humanoid social robot Nadine as an activity host for the elderly. This study aims to analyse if the elderly feels comfortable and enjoy playing game/activity with the humanoid robot Nadine. We propose to evaluate this by placing Nadine humanoid social robot in a nursing home as a caretaker where she hosts bingo game. We record sessions with and without Nadine to understand the difference and acceptance of these two scenarios. We use computer vision methods to analyse the activities of the elderly to detect emotions and their involvement in the game. We envision that such humanoid robots will make recreational activities more readily available for the elderly. Our results present positive enforcement during recreational activity, Bingo, in the presence of Nadine.
Most current anthropomorphic robotic hands can realize part of the human hand functions, particularly for object grasping. However, due to the complexity of the human hand, few current designs target at daily object manipulations, even for simple actions like rotating a pen. To tackle this problem, we introduce a gesture based framework, which adopts the widely-used 33 grasping gestures of Feix as the bases for hand design and implementation of manipulation. In the proposed framework, we first measure the motion ranges of human fingers for each gesture, and based on the results, we propose a simple yet dexterous robotic hand design with 13 degrees of freedom. Furthermore, we adopt a frame interpolation based method, in which we consider the base gestures as the key frames to represent a manipulation task, and use the simple linear interpolation strategy to accomplish the manipulation. To demonstrate the effectiveness of our framework, we define a three-level benchmark, which includes not only 62 test gestures from previous research, but also multiple complex and continuous actions. Experimental results on this benchmark validate the dexterity of the proposed design and our video is available in \url{https://entuedu-my.sharepoint.com/:v:/g/personal/hanhui_li_staff_main_ntu_edu_sg/Ean2GpnFo6JPjIqbKy1KHMEBftgCkcDhnSX-9uLZ6T0rUg?e=ppCGbC}
Current anthropomorphic robotic hands mainly focus on improving their dexterity by devising new mechanical structures and actuation systems. However, most of them rely on a single structure/system (e.g., bone-only) and ignore the fact that the human hand is composed of multiple functional structures (e.g., skin, bones, muscles, and tendons). This not only increases the difficulty of the design process but also lowers the robustness and flexibility of the fabricated hand. Besides, other factors like customization, the time and cost for production, and the degree of resemblance between human hands and robotic hands, remain omitted. To tackle these problems, this study proposes a 3D printable multi-layer design that models the hand with the layers of skin, tissues, and bones. The proposed design first obtains the 3D surface model of a target hand via 3D scanning, and then generates the 3D bone models from the surface model based on a fast template matching method. To overcome the disadvantage of the rigid bone layer in deformation, the tissue layer is introduced and represented by a concentric tube based structure, of which the deformability can be explicitly controlled by a parameter. Besides, a low-cost yet effective underactuated system is adopted to drive the fabricated hand. The proposed design is tested with 33 widely used object grasping types, as well as special objects like fragile silken tofu, and outperforms previous designs remarkably. With the proposed design, anthropomorphic robotic hands can be produced fast with low cost, and be customizable and deformable.
Multi-Person Tracking (MPT) is often addressed within the detection-to-association paradigm. In such approaches, human detections are first extracted in every frame and person trajectories are then recovered by a procedure of data association (usually offline). However, their performances usually degenerate in presence of detection errors, mutual interactions and occlusions. In this paper, we present a deep learning based MPT approach that learns instance-aware representations of tracked persons and robustly online infers states of the tracked persons. Specifically, we design a multi-branch neural network (MBN), which predicts the classification confidences and locations of all targets by taking a batch of candidate regions as input. In our MBN architecture, each branch (instance-subnet) corresponds to an individual to be tracked and new branches can be dynamically created for handling newly appearing persons. Then based on the output of MBN, we construct a joint association matrix that represents meaningful states of tracked persons (e.g., being tracked or disappearing from the scene) and solve it by using the efficient Hungarian algorithm. Moreover, we allow the instance-subnets to be updated during tracking by online mining hard examples, accounting to person appearance variations over time. We comprehensively evaluate our framework on a popular MPT benchmark, demonstrating its excellent performance in comparison with recent online MPT methods.
Tiny face detection aims to find faces with high degrees of variability in scale, resolution and occlusion in cluttered scenes. Due to the very little information available on tiny faces, it is not sufficient to detect them merely based on the information presented inside the tiny bounding boxes or their context. In this paper, we propose to exploit the semantic similarity among all predicted targets in each image to boost current face detectors. To this end, we present a novel framework to model semantic similarity as pairwise constraints within the metric learning scheme, and then refine our predictions with the semantic similarity by utilizing the graph cut techniques. Experiments conducted on three widely-used benchmark datasets have demonstrated the improvement over the-state-of-the-arts gained by applying this idea.
In this paper, we aim at tackling the problem of crowd counting in extremely high-density scenes, which contain hundreds, or even thousands of people. We begin by a comprehensive analysis of the most widely used density map-based methods, and demonstrate how easily existing methods are affected by the inhomogeneous density distribution problem, e.g., causing them to be sensitive to outliers, or be hard to optimized. We then present an extremely simple solution to the inhomogeneous density distribution problem, which can be intuitively summarized as extending the density map from 2D to 3D, with the extra dimension implicitly indicating the density level. Such solution can be implemented by a single Density-Aware Network, which is not only easy to train, but also can achieve the state-of-art performance on various challenging datasets.