Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiwen Lu

Efficient Meshy Neural Fields for Animatable Human Avatars

Mar 23, 2023

Xiaoke Huang, Yiji Cheng, Yansong Tang, Xiu Li, Jie Zhou, Jiwen Lu

Figure 1 for Efficient Meshy Neural Fields for Animatable Human Avatars

Figure 2 for Efficient Meshy Neural Fields for Animatable Human Avatars

Figure 3 for Efficient Meshy Neural Fields for Animatable Human Avatars

Figure 4 for Efficient Meshy Neural Fields for Animatable Human Avatars

Abstract:Efficiently digitizing high-fidelity animatable human avatars from videos is a challenging and active research topic. Recent volume rendering-based neural representations open a new way for human digitization with their friendly usability and photo-realistic reconstruction quality. However, they are inefficient for long optimization times and slow inference speed; their implicit nature results in entangled geometry, materials, and dynamics of humans, which are hard to edit afterward. Such drawbacks prevent their direct applicability to downstream applications, especially the prominent rasterization-based graphic ones. We present EMA, a method that Efficiently learns Meshy neural fields to reconstruct animatable human Avatars. It jointly optimizes explicit triangular canonical mesh, spatial-varying material, and motion dynamics, via inverse rendering in an end-to-end fashion. Each above component is derived from separate neural fields, relaxing the requirement of a template, or rigging. The mesh representation is highly compatible with the efficient rasterization-based renderer, thus our method only takes about an hour of training and can render in real-time. Moreover, only minutes of optimization is enough for plausible reconstruction results. The disentanglement of meshes enables direct downstream applications. Extensive experiments illustrate the very competitive performance and significant speed boost against previous methods. We also showcase applications including novel pose synthesis, material editing, and relighting. The project page: https://xk-huang.github.io/ema/.

* 25 pages, 21 figures. Project page: https://xk-huang.github.io/ema/

Via

Access Paper or Ask Questions

SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

Mar 16, 2023

Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu

Figure 1 for SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

Figure 2 for SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

Figure 3 for SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

Figure 4 for SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

Abstract:3D scene understanding plays a vital role in vision-based autonomous driving. While most existing methods focus on 3D object detection, they have difficulty describing real-world objects of arbitrary shapes and infinite classes. Towards a more comprehensive perception of a 3D scene, in this paper, we propose a SurroundOcc method to predict the 3D occupancy with multi-camera images. We first extract multi-scale features for each image and adopt spatial 2D-3D attention to lift them to the 3D volume space. Then we apply 3D convolutions to progressively upsample the volume features and impose supervision on multiple levels. To obtain dense occupancy prediction, we design a pipeline to generate dense occupancy ground truth without expansive occupancy annotations. Specifically, we fuse multi-frame LiDAR scans of dynamic objects and static scenes separately. Then we adopt Poisson Reconstruction to fill the holes and voxelize the mesh to get dense occupancy labels. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the superiority of our method. Code and dataset are available at https://github.com/weiyithu/SurroundOcc

* Code is available at https://github.com/weiyithu/SurroundOcc

Via

Access Paper or Ask Questions

OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception

Mar 07, 2023

Xiaofeng Wang, Zheng Zhu, Wenbo Xu, Yunpeng Zhang, Yi Wei, Xu Chi, Yun Ye, Dalong Du, Jiwen Lu, Xingang Wang

Figure 1 for OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception

Figure 2 for OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception

Figure 3 for OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception

Figure 4 for OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception

Abstract:Semantic occupancy perception is essential for autonomous driving, as automated vehicles require a fine-grained perception of the 3D urban structures. However, existing relevant benchmarks lack diversity in urban scenes, and they only evaluate front-view predictions. Towards a comprehensive benchmarking of surrounding perception algorithms, we propose OpenOccupancy, which is the first surrounding semantic occupancy perception benchmark. In the OpenOccupancy benchmark, we extend the large-scale nuScenes dataset with dense semantic occupancy annotations. Previous annotations rely on LiDAR points superimposition, where some occupancy labels are missed due to sparse LiDAR channels. To mitigate the problem, we introduce the Augmenting And Purifying (AAP) pipeline to ~2x densify the annotations, where ~4000 human hours are involved in the labeling process. Besides, camera-based, LiDAR-based and multi-modal baselines are established for the OpenOccupancy benchmark. Furthermore, considering the complexity of surrounding occupancy perception lies in the computational burden of high-resolution 3D predictions, we propose the Cascade Occupancy Network (CONet) to refine the coarse prediction, which relatively enhances the performance by ~30% than the baseline. We hope the OpenOccupancy benchmark will boost the development of surrounding occupancy perception algorithms.

* project page: https://github.com/JeffWang987/OpenOccupancy

Via

Access Paper or Ask Questions

Unleashing Text-to-Image Diffusion Models for Visual Perception

Mar 03, 2023

Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, Jiwen Lu

Abstract:Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly controllable by customizable prompts. Unlike the unconditional generative models that focus on low-level attributes and details, text-to-image diffusion models contain more high-level knowledge thanks to the vision-language pre-training. In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. Instead of using the pre-trained denoising autoencoder in a diffusion-based pipeline, we simply use it as a backbone and aim to study how to take full advantage of the learned knowledge. Specifically, we prompt the denoising decoder with proper textual inputs and refine the text features with an adapter, leading to a better alignment to the pre-trained stage and making the visual contents interact with the text prompts. We also propose to utilize the cross-attention maps between the visual features and the text features to provide explicit guidance. Compared with other pre-training methods, we show that vision-language pre-trained diffusion models can be faster adapted to downstream visual perception tasks using the proposed VPD. Extensive experiments on semantic segmentation, referring image segmentation and depth estimation demonstrates the effectiveness of our method. Notably, VPD attains 0.254 RMSE on NYUv2 depth estimation and 73.3% oIoU on RefCOCO-val referring image segmentation, establishing new records on these two benchmarks. Code is available at https://github.com/wl-zhao/VPD

* project page: https://vpd.ivg-research.xyz

Via

Access Paper or Ask Questions

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

Mar 02, 2023

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, Jiwen Lu

Abstract:Modern methods for vision-centric autonomous driving perception widely adopt the bird's-eye-view (BEV) representation to describe a 3D scene. Despite its better efficiency than voxel representation, it has difficulty describing the fine-grained 3D structure of a scene with a single plane. To address this, we propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes. We model each point in the 3D space by summing its projected features on the three planes. To lift image features to the 3D TPV space, we further propose a transformer-based TPV encoder (TPVFormer) to obtain the TPV features effectively. We employ the attention mechanism to aggregate the image features corresponding to each query in each TPV plane. Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels. We demonstrate for the first time that using only camera inputs can achieve comparable performance with LiDAR-based methods on the LiDAR segmentation task on nuScenes. Code: https://github.com/wzzheng/TPVFormer.

* Accepted to CVPR 2023. Code is available at https://github.com/wzzheng/TPVFormer

Via

Access Paper or Ask Questions

Category-level Shape Estimation for Densely Cluttered Objects

Feb 23, 2023

Zhenyu Wu, Ziwei Wang, Jiwen Lu, Haibin Yan

Figure 1 for Category-level Shape Estimation for Densely Cluttered Objects

Figure 2 for Category-level Shape Estimation for Densely Cluttered Objects

Figure 3 for Category-level Shape Estimation for Densely Cluttered Objects

Figure 4 for Category-level Shape Estimation for Densely Cluttered Objects

Abstract:Accurately estimating the shape of objects in dense clutters makes important contribution to robotic packing, because the optimal object arrangement requires the robot planner to acquire shape information of all existed objects. However, the objects for packing are usually piled in dense clutters with severe occlusion, and the object shape varies significantly across different instances for the same category. They respectively cause large object segmentation errors and inaccurate shape recovery on unseen instances, which both degrade the performance of shape estimation during deployment. In this paper, we propose a category-level shape estimation method for densely cluttered objects. Our framework partitions each object in the clutter via the multi-view visual information fusion to achieve high segmentation accuracy, and the instance shape is recovered by deforming the category templates with diverse geometric transformations to obtain strengthened generalization ability. Specifically, we first collect the multi-view RGB-D images of the object clutters for point cloud reconstruction. Then we fuse the feature maps representing the visual information of multi-view RGB images and the pixel affinity learned from the clutter point cloud, where the acquired instance segmentation masks of multi-view RGB images are projected to partition the clutter point cloud. Finally, the instance geometry information is obtained from the partially observed instance point cloud and the corresponding category template, and the deformation parameters regarding the template are predicted for shape estimation. Experiments in the simulated environment and real world show that our method achieves high shape estimation accuracy for densely cluttered everyday objects with various shapes.

* 7 pages, 7 figures, ICRA 2023

Via

Access Paper or Ask Questions

UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models

Feb 12, 2023

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, Jiwen Lu

Figure 1 for UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models

Figure 2 for UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models

Figure 3 for UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models

Figure 4 for UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models

Abstract:Diffusion probabilistic models (DPMs) have demonstrated a very promising ability in high-resolution image synthesis. However, sampling from a pre-trained DPM usually requires hundreds of model evaluations, which is computationally expensive. Despite recent progress in designing high-order solvers for DPMs, there still exists room for further speedup, especially in extremely few steps (e.g., 5~10 steps). Inspired by the predictor-corrector for ODE solvers, we develop a unified corrector (UniC) that can be applied after any existing DPM sampler to increase the order of accuracy without extra model evaluations, and derive a unified predictor (UniP) that supports arbitrary order as a byproduct. Combining UniP and UniC, we propose a unified predictor-corrector framework called UniPC for the fast sampling of DPMs, which has a unified analytical form for any order and can significantly improve the sampling quality over previous methods. We evaluate our methods through extensive experiments including both unconditional and conditional sampling using pixel-space and latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional) and 7.51 FID on ImageNet 256$\times$256 (conditional) with only 10 function evaluations. Code is available at https://github.com/wl-zhao/UniPC

* Project page: https://unipc.ivg-research.xyz

Via

Access Paper or Ask Questions

AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware Transformers

Jan 11, 2023

Xumin Yu, Yongming Rao, Ziyi Wang, Jiwen Lu, Jie Zhou

Figure 1 for AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware Transformers

Figure 2 for AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware Transformers

Figure 3 for AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware Transformers

Figure 4 for AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware Transformers

Abstract:In this paper, we present a new method that reformulates point cloud completion as a set-to-set translation problem and design a new model, called PoinTr, which adopts a Transformer encoder-decoder architecture for point cloud completion. By representing the point cloud as a set of unordered groups of points with position embeddings, we convert the input data to a sequence of point proxies and employ the Transformers for generation. To facilitate Transformers to better leverage the inductive bias about 3D geometric structures of point clouds, we further devise a geometry-aware block that models the local geometric relationships explicitly. The migration of Transformers enables our model to better learn structural knowledge and preserve detailed information for point cloud completion. Taking a step towards more complicated and diverse situations, we further propose AdaPoinTr by developing an adaptive query generation mechanism and designing a novel denoising task during completing a point cloud. Coupling these two techniques enables us to train the model efficiently and effectively: we reduce training time (by 15x or more) and improve completion performance (over 20%). We also show our method can be extended to the scene-level point cloud completion scenario by designing a new geometry-enhanced semantic scene completion framework. Extensive experiments on the existing and newly-proposed datasets demonstrate the effectiveness of our method, which attains 6.53 CD on PCN, 0.81 CD on ShapeNet-55 and 0.392 MMD on real-world KITTI, surpassing other work by a large margin and establishing new state-of-the-arts on various benchmarks. Most notably, AdaPoinTr can achieve such promising performance with higher throughputs and fewer FLOPs compared with the previous best methods in practice. The code and datasets are available at https://github.com/yuxumin/PoinTr

* Extension of our ICCV 2021 work: arXiv:2108.08839 . Code is available at https://github.com/yuxumin/PoinTr

Via

Access Paper or Ask Questions

DiffTalk: Crafting Diffusion Models for Generalized Talking Head Synthesis

Jan 10, 2023

Shuai Shen, Wenliang Zhao, Zibin Meng, Wanhua Li, Zheng Zhu, Jie Zhou, Jiwen Lu

Abstract:Talking head synthesis is a promising approach for the video production industry. Recently, a lot of effort has been devoted in this research area to improve the generation quality or enhance the model generalization. However, there are few works able to address both issues simultaneously, which is essential for practical applications. To this end, in this paper, we turn attention to the emerging powerful Latent Diffusion Models, and model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk). More specifically, instead of employing audio signals as the single driving factor, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis. In this way, the proposed DiffTalk is capable of producing high-quality talking head videos in synchronization with the source audio, and more importantly, it can be naturally generalized across different identities without any further fine-tuning. Additionally, our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost. Extensive experiments show that the proposed DiffTalk efficiently synthesizes high-fidelity audio-driven talking head videos for generalized novel identities. For more video results, please refer to this demonstration \url{https://cloud.tsinghua.edu.cn/f/e13f5aad2f4c4f898ae7/}.

* Project page https://sstzal.github.io/DiffTalk/

Via

Access Paper or Ask Questions

Bort: Towards Explainable Neural Networks with Bounded Orthogonal Constraint

Dec 18, 2022

Borui Zhang, Wenzhao Zheng, Jie Zhou, Jiwen Lu

Figure 1 for Bort: Towards Explainable Neural Networks with Bounded Orthogonal Constraint

Figure 2 for Bort: Towards Explainable Neural Networks with Bounded Orthogonal Constraint

Figure 3 for Bort: Towards Explainable Neural Networks with Bounded Orthogonal Constraint

Figure 4 for Bort: Towards Explainable Neural Networks with Bounded Orthogonal Constraint

Abstract:Deep learning has revolutionized human society, yet the black-box nature of deep neural networks hinders further application to reliability-demanded industries. In the attempt to unpack them, many works observe or impact internal variables to improve the model's comprehensibility and transparency. However, existing methods rely on intuitive assumptions and lack mathematical guarantees. To bridge this gap, we introduce Bort, an optimizer for improving model explainability with boundedness and orthogonality constraints on model parameters, derived from the sufficient conditions of model comprehensibility and transparency. We perform reconstruction and backtracking on the model representations optimized by Bort and observe an evident improvement in model explainability. Based on Bort, we are able to synthesize explainable adversarial samples without additional parameters and training. Surprisingly, we find Bort constantly improves the classification accuracy of various architectures including ResNet and DeiT on MNIST, CIFAR-10, and ImageNet.

Via

Access Paper or Ask Questions