Alert button
Picture for Zehong Shen

Zehong Shen

Alert button

Learning Human Mesh Recovery in 3D Scenes

Jun 06, 2023
Zehong Shen, Zhi Cen, Sida Peng, Qing Shuai, Hujun Bao, Xiaowei Zhou

Figure 1 for Learning Human Mesh Recovery in 3D Scenes
Figure 2 for Learning Human Mesh Recovery in 3D Scenes
Figure 3 for Learning Human Mesh Recovery in 3D Scenes
Figure 4 for Learning Human Mesh Recovery in 3D Scenes

We present a novel method for recovering the absolute pose and shape of a human in a pre-scanned scene given a single image. Unlike previous methods that perform sceneaware mesh optimization, we propose to first estimate absolute position and dense scene contacts with a sparse 3D CNN, and later enhance a pretrained human mesh recovery network by cross-attention with the derived 3D scene cues. Joint learning on images and scene geometry enables our method to reduce the ambiguity caused by depth and occlusion, resulting in more reasonable global postures and contacts. Encoding scene-aware cues in the network also allows the proposed method to be optimization-free, and opens up the opportunity for real-time applications. The experiments show that the proposed network is capable of recovering accurate and physically-plausible meshes by a single forward pass and outperforms state-of-the-art methods in terms of both accuracy and speed.

* Accepted to CVPR 2023. Project page: https://zju3dv.github.io/sahmr/ 
Viaarxiv icon

Long-term Visual Localization with Mobile Sensors

Apr 16, 2023
Shen Yan, Yu Liu, Long Wang, Zehong Shen, Zhen Peng, Haomin Liu, Maojun Zhang, Guofeng Zhang, Xiaowei Zhou

Figure 1 for Long-term Visual Localization with Mobile Sensors
Figure 2 for Long-term Visual Localization with Mobile Sensors
Figure 3 for Long-term Visual Localization with Mobile Sensors
Figure 4 for Long-term Visual Localization with Mobile Sensors

Despite the remarkable advances in image matching and pose estimation, image-based localization of a camera in a temporally-varying outdoor environment is still a challenging problem due to huge appearance disparity between query and reference images caused by illumination, seasonal and structural changes. In this work, we propose to leverage additional sensors on a mobile phone, mainly GPS, compass, and gravity sensor, to solve this challenging problem. We show that these mobile sensors provide decent initial poses and effective constraints to reduce the searching space in image matching and final pose estimation. With the initial pose, we are also able to devise a direct 2D-3D matching network to efficiently establish 2D-3D correspondences instead of tedious 2D-2D matching in existing systems. As no public dataset exists for the studied problem, we collect a new dataset that provides a variety of mobile sensor data and significant scene appearance variations, and develop a system to acquire ground-truth poses for query images. We benchmark our method as well as several state-of-the-art baselines and demonstrate the effectiveness of the proposed approach. The code and dataset will be released publicly.

Viaarxiv icon

LoFTR: Detector-Free Local Feature Matching with Transformers

Apr 01, 2021
Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, Xiaowei Zhou

Figure 1 for LoFTR: Detector-Free Local Feature Matching with Transformers
Figure 2 for LoFTR: Detector-Free Local Feature Matching with Transformers
Figure 3 for LoFTR: Detector-Free Local Feature Matching with Transformers
Figure 4 for LoFTR: Detector-Free Local Feature Matching with Transformers

We present a novel method for local image feature matching. Instead of performing image feature detection, description, and matching sequentially, we propose to first establish pixel-wise dense matches at a coarse level and later refine the good matches at a fine level. In contrast to dense methods that use a cost volume to search correspondences, we use self and cross attention layers in Transformer to obtain feature descriptors that are conditioned on both images. The global receptive field provided by Transformer enables our method to produce dense matches in low-texture areas, where feature detectors usually struggle to produce repeatable interest points. The experiments on indoor and outdoor datasets show that LoFTR outperforms state-of-the-art methods by a large margin. LoFTR also ranks first on two public benchmarks of visual localization among the published methods.

* Accepted to CVPR 2021. Project page: https://zju3dv.github.io/loftr/ 
Viaarxiv icon

GIFT: Learning Transformation-Invariant Dense Visual Descriptors via Group CNNs

Nov 14, 2019
Yuan Liu, Zehong Shen, Zhixuan Lin, Sida Peng, Hujun Bao, Xiaowei Zhou

Figure 1 for GIFT: Learning Transformation-Invariant Dense Visual Descriptors via Group CNNs
Figure 2 for GIFT: Learning Transformation-Invariant Dense Visual Descriptors via Group CNNs
Figure 3 for GIFT: Learning Transformation-Invariant Dense Visual Descriptors via Group CNNs
Figure 4 for GIFT: Learning Transformation-Invariant Dense Visual Descriptors via Group CNNs

Finding local correspondences between images with different viewpoints requires local descriptors that are robust against geometric transformations. An approach for transformation invariance is to integrate out the transformations by pooling the features extracted from transformed versions of an image. However, the feature pooling may sacrifice the distinctiveness of the resulting descriptors. In this paper, we introduce a novel visual descriptor named Group Invariant Feature Transform (GIFT), which is both discriminative and robust to geometric transformations. The key idea is that the features extracted from the transformed versions of an image can be viewed as a function defined on the group of the transformations. Instead of feature pooling, we use group convolutions to exploit underlying structures of the extracted features on the group, resulting in descriptors that are both discriminative and provably invariant to the group of transformations. Extensive experiments show that GIFT outperforms state-of-the-art methods on several benchmark datasets and practically improves the performance of relative pose estimation.

* Accepted by NeurIPS 2019 
Viaarxiv icon