Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chao Ma

SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction

Apr 15, 2024

Pin Tang, Zhongdao Wang, Guoqing Wang, Jilai Zheng, Xiangxuan Ren, Bailan Feng, Chao Ma

Figure 1 for SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction

Figure 2 for SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction

Figure 3 for SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction

Figure 4 for SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction

Abstract:Vision-based perception for autonomous driving requires an explicit modeling of a 3D space, where 2D latent representations are mapped and subsequent 3D operators are applied. However, operating on dense latent spaces introduces a cubic time and space complexity, which limits scalability in terms of perception range or spatial resolution. Existing approaches compress the dense representation using projections like Bird's Eye View (BEV) or Tri-Perspective View (TPV). Although efficient, these projections result in information loss, especially for tasks like semantic occupancy prediction. To address this, we propose SparseOcc, an efficient occupancy network inspired by sparse point cloud processing. It utilizes a lossless sparse latent representation with three key innovations. Firstly, a 3D sparse diffuser performs latent completion using spatially decomposed 3D sparse convolutional kernels. Secondly, a feature pyramid and sparse interpolation enhance scales with information from others. Finally, the transformer head is redesigned as a sparse variant. SparseOcc achieves a remarkable 74.9% reduction on FLOPs over the dense baseline. Interestingly, it also improves accuracy, from 12.8% to 14.1% mIOU, which in part can be attributed to the sparse representation's ability to avoid hallucinations on empty voxels.

* IEEE Conference on Computer Vision and Pattern Recognition 2024 (CVPR 2024)
* 10 pages, 4 figures, accepted by CVPR 2024

Via

Access Paper or Ask Questions

FiP: a Fixed-Point Approach for Causal Generative Modeling

Apr 14, 2024

Meyer Scetbon, Joel Jennings, Agrin Hilmkil, Cheng Zhang, Chao Ma

Figure 1 for FiP: a Fixed-Point Approach for Causal Generative Modeling

Figure 2 for FiP: a Fixed-Point Approach for Causal Generative Modeling

Figure 3 for FiP: a Fixed-Point Approach for Causal Generative Modeling

Figure 4 for FiP: a Fixed-Point Approach for Causal Generative Modeling

Abstract:Modeling true world data-generating processes lies at the heart of empirical science. Structural Causal Models (SCMs) and their associated Directed Acyclic Graphs (DAGs) provide an increasingly popular answer to such problems by defining the causal generative process that transforms random noise into observations. However, learning them from observational data poses an ill-posed and NP-hard inverse problem in general. In this work, we propose a new and equivalent formalism that does not require DAGs to describe them, viewed as fixed-point problems on the causally ordered variables, and we show three important cases where they can be uniquely recovered given the topological ordering (TO). To the best of our knowledge, we obtain the weakest conditions for their recovery when TO is known. Based on this, we design a two-stage causal generative model that first infers the causal order from observations in a zero-shot manner, thus by-passing the search, and then learns the generative fixed-point SCM on the ordered variables. To infer TOs from observations, we propose to amortize the learning of TOs on generated datasets by sequentially predicting the leaves of graphs seen during training. To learn fixed-point SCMs, we design a transformer-based architecture that exploits a new attention mechanism enabling the modeling of causal structures, and show that this parameterization is consistent with our formalism. Finally, we conduct an extensive evaluation of each method individually, and show that when combined, our model outperforms various baselines on generated out-of-distribution problems.

Via

Access Paper or Ask Questions

VMambaMorph: a Multi-Modality Deformable Image Registration Framework based on Visual State Space Model with Cross-Scan Module

Apr 14, 2024

Ziyang Wang, Jian-Qing Zheng, Chao Ma, Tao Guo

Figure 1 for VMambaMorph: a Multi-Modality Deformable Image Registration Framework based on Visual State Space Model with Cross-Scan Module

Figure 2 for VMambaMorph: a Multi-Modality Deformable Image Registration Framework based on Visual State Space Model with Cross-Scan Module

Figure 3 for VMambaMorph: a Multi-Modality Deformable Image Registration Framework based on Visual State Space Model with Cross-Scan Module

Abstract:Image registration, a critical process in medical imaging, involves aligning different sets of medical imaging data into a single unified coordinate system. Deep learning networks, such as the Convolutional Neural Network (CNN)-based VoxelMorph, Vision Transformer (ViT)-based TransMorph, and State Space Model (SSM)-based MambaMorph, have demonstrated effective performance in this domain. The recent Visual State Space Model (VMamba), which incorporates a cross-scan module with SSM, has exhibited promising improvements in modeling global-range dependencies with efficient computational cost in computer vision tasks. This paper hereby introduces an exploration of VMamba with image registration, named VMambaMorph. This novel hybrid VMamba-CNN network is designed specifically for 3D image registration. Utilizing a U-shaped network architecture, VMambaMorph computes the deformation field based on target and source volumes. The VMamba-based block with 2D cross-scan module is redesigned for 3D volumetric feature processing. To overcome the complex motion and structure on multi-modality images, we further propose a fine-tune recursive registration framework. We validate VMambaMorph using a public benchmark brain MR-CT registration dataset, comparing its performance against current state-of-the-art methods. The results indicate that VMambaMorph achieves competitive registration quality. The code for VMambaMorph with all baseline methods is available on GitHub.

Via

Access Paper or Ask Questions

Monocular Identity-Conditioned Facial Reflectance Reconstruction

Mar 30, 2024

Xingyu Ren, Jiankang Deng, Yuhao Cheng, Jia Guo, Chao Ma, Yichao Yan, Wenhan Zhu, Xiaokang Yang

Abstract:Recent 3D face reconstruction methods have made remarkable advancements, yet there remain huge challenges in monocular high-quality facial reflectance reconstruction. Existing methods rely on a large amount of light-stage captured data to learn facial reflectance models. However, the lack of subject diversity poses challenges in achieving good generalization and widespread applicability. In this paper, we learn the reflectance prior in image space rather than UV space and present a framework named ID2Reflectance. Our framework can directly estimate the reflectance maps of a single image while using limited reflectance data for training. Our key insight is that reflectance data shares facial structures with RGB faces, which enables obtaining expressive facial prior from inexpensive RGB data thus reducing the dependency on reflectance data. We first learn a high-quality prior for facial reflectance. Specifically, we pretrain multi-domain facial feature codebooks and design a codebook fusion method to align the reflectance and RGB domains. Then, we propose an identity-conditioned swapping module that injects facial identity from the target image into the pre-trained autoencoder to modify the identity of the source reflectance image. Finally, we stitch multi-view swapped reflectance images to obtain renderable assets. Extensive experiments demonstrate that our method exhibits excellent generalization capability and achieves state-of-the-art facial reflectance reconstruction results for in-the-wild faces. Our project page is https://xingyuren.github.io/id2reflectance/.

* Accepted by CVPR 2024

Via

Access Paper or Ask Questions

Frankenstein: Generating Semantic-Compositional 3D Scenes in One Tri-Plane

Mar 24, 2024

Han Yan, Yang Li, Zhennan Wu, Shenzhou Chen, Weixuan Sun, Taizhang Shang, Weizhe Liu, Tian Chen, Xiaqiang Dai, Chao Ma(+2 more)

Figure 1 for Frankenstein: Generating Semantic-Compositional 3D Scenes in One Tri-Plane

Figure 2 for Frankenstein: Generating Semantic-Compositional 3D Scenes in One Tri-Plane

Figure 3 for Frankenstein: Generating Semantic-Compositional 3D Scenes in One Tri-Plane

Figure 4 for Frankenstein: Generating Semantic-Compositional 3D Scenes in One Tri-Plane

Abstract:We present Frankenstein, a diffusion-based framework that can generate semantic-compositional 3D scenes in a single pass. Unlike existing methods that output a single, unified 3D shape, Frankenstein simultaneously generates multiple separated shapes, each corresponding to a semantically meaningful part. The 3D scene information is encoded in one single tri-plane tensor, from which multiple Singed Distance Function (SDF) fields can be decoded to represent the compositional shapes. During training, an auto-encoder compresses tri-planes into a latent space, and then the denoising diffusion process is employed to approximate the distribution of the compositional scenes. Frankenstein demonstrates promising results in generating room interiors as well as human avatars with automatically separated parts. The generated scenes facilitate many downstream applications, such as part-wise re-texturing, object rearrangement in the room or avatar cloth re-targeting.

* Video: https://youtu.be/lRn-HqyCrLI

Via

Access Paper or Ask Questions

Lightning NeRF: Efficient Hybrid Scene Representation for Autonomous Driving

Mar 09, 2024

Junyi Cao, Zhichao Li, Naiyan Wang, Chao Ma

Figure 1 for Lightning NeRF: Efficient Hybrid Scene Representation for Autonomous Driving

Figure 2 for Lightning NeRF: Efficient Hybrid Scene Representation for Autonomous Driving

Figure 3 for Lightning NeRF: Efficient Hybrid Scene Representation for Autonomous Driving

Figure 4 for Lightning NeRF: Efficient Hybrid Scene Representation for Autonomous Driving

Abstract:Recent studies have highlighted the promising application of NeRF in autonomous driving contexts. However, the complexity of outdoor environments, combined with the restricted viewpoints in driving scenarios, complicates the task of precisely reconstructing scene geometry. Such challenges often lead to diminished quality in reconstructions and extended durations for both training and rendering. To tackle these challenges, we present Lightning NeRF. It uses an efficient hybrid scene representation that effectively utilizes the geometry prior from LiDAR in autonomous driving scenarios. Lightning NeRF significantly improves the novel view synthesis performance of NeRF and reduces computational overheads. Through evaluations on real-world datasets, such as KITTI-360, Argoverse2, and our private dataset, we demonstrate that our approach not only exceeds the current state-of-the-art in novel view synthesis quality but also achieves a five-fold increase in training speed and a ten-fold improvement in rendering speed. Codes are available at https://github.com/VISION-SJTU/Lightning-NeRF .

* Accepted to ICRA 2024

Via

Access Paper or Ask Questions

Weak-Mamba-UNet: Visual Mamba Makes CNN and ViT Work Better for Scribble-based Medical Image Segmentation

Feb 16, 2024

Ziyang Wang, Chao Ma

Figure 1 for Weak-Mamba-UNet: Visual Mamba Makes CNN and ViT Work Better for Scribble-based Medical Image Segmentation

Figure 2 for Weak-Mamba-UNet: Visual Mamba Makes CNN and ViT Work Better for Scribble-based Medical Image Segmentation

Figure 3 for Weak-Mamba-UNet: Visual Mamba Makes CNN and ViT Work Better for Scribble-based Medical Image Segmentation

Figure 4 for Weak-Mamba-UNet: Visual Mamba Makes CNN and ViT Work Better for Scribble-based Medical Image Segmentation

Abstract:Medical image segmentation is increasingly reliant on deep learning techniques, yet the promising performance often come with high annotation costs. This paper introduces Weak-Mamba-UNet, an innovative weakly-supervised learning (WSL) framework that leverages the capabilities of Convolutional Neural Network (CNN), Vision Transformer (ViT), and the cutting-edge Visual Mamba (VMamba) architecture for medical image segmentation, especially when dealing with scribble-based annotations. The proposed WSL strategy incorporates three distinct architecture but same symmetrical encoder-decoder networks: a CNN-based UNet for detailed local feature extraction, a Swin Transformer-based SwinUNet for comprehensive global context understanding, and a VMamba-based Mamba-UNet for efficient long-range dependency modeling. The key concept of this framework is a collaborative and cross-supervisory mechanism that employs pseudo labels to facilitate iterative learning and refinement across the networks. The effectiveness of Weak-Mamba-UNet is validated on a publicly available MRI cardiac segmentation dataset with processed scribble annotations, where it surpasses the performance of a similar WSL framework utilizing only UNet or SwinUNet. This highlights its potential in scenarios with sparse or imprecise annotations. The source code is made publicly accessible.

Via

Access Paper or Ask Questions

Semi-Mamba-UNet: Pixel-Level Contrastive Cross-Supervised Visual Mamba-based UNet for Semi-Supervised Medical Image Segmentation

Feb 11, 2024

Ziyang Wang, Chao Ma

Abstract:Medical image segmentation is essential in diagnostics, treatment planning, and healthcare, with deep learning offering promising advancements. Notably, Convolutional Neural Network (CNN) excel in capturing local image features, whereas Vision Transformer (ViT) adeptly model long-range dependencies through multi-head self-attention mechanisms. Despite their strengths, both CNN and ViT face challenges in efficiently processing long-range dependencies within medical images, often requiring substantial computational resources. This issue, combined with the high cost and limited availability of expert annotations, poses significant obstacles to achieving precise segmentation. To address these challenges, this paper introduces the Semi-Mamba-UNet, which integrates a visual mamba-based UNet architecture with a conventional UNet into a semi-supervised learning (SSL) framework. This innovative SSL approach leverages dual networks to jointly generate pseudo labels and cross supervise each other, drawing inspiration from consistency regularization techniques. Furthermore, we introduce a self-supervised pixel-level contrastive learning strategy, employing a projector pair to further enhance feature learning capabilities. Our comprehensive evaluation on a publicly available MRI cardiac segmentation dataset, comparing against various SSL frameworks with different UNet-based segmentation networks, highlights the superior performance of Semi-Mamba-UNet. The source code has been made publicly accessible.

Via

Access Paper or Ask Questions

The Essential Role of Causality in Foundation World Models for Embodied AI

Feb 06, 2024

Tarun Gupta, Wenbo Gong, Chao Ma, Nick Pawlowski, Agrin Hilmkil, Meyer Scetbon, Ade Famoti, Ashley Juan Llorens, Jianfeng Gao, Stefan Bauer(+3 more)

Figure 1 for The Essential Role of Causality in Foundation World Models for Embodied AI

Figure 2 for The Essential Role of Causality in Foundation World Models for Embodied AI

Abstract:Recent advances in foundation models, especially in large multi-modal models and conversational agents, have ignited interest in the potential of generally capable embodied agents. Such agents would require the ability to perform new tasks in many different real-world environments. However, current foundation models fail to accurately model physical interactions with the real world thus not sufficient for Embodied AI. The study of causality lends itself to the construction of veridical world models, which are crucial for accurately predicting the outcomes of possible interactions. This paper focuses on the prospects of building foundation world models for the upcoming generation of embodied agents and presents a novel viewpoint on the significance of causality within these. We posit that integrating causal considerations is vital to facilitate meaningful physical interactions with the world. Finally, we demystify misconceptions about causality in this context and present our outlook for future research.

Via

Access Paper or Ask Questions

Vision-Informed Flow Image Super-Resolution with Quaternion Spatial Modeling and Dynamic Flow Convolution

Jan 29, 2024

Qinglong Cao, Zhengqin Xu, Chao Ma, Xiaokang Yang, Yuntian Chen

Abstract:Flow image super-resolution (FISR) aims at recovering high-resolution turbulent velocity fields from low-resolution flow images. Existing FISR methods mainly process the flow images in natural image patterns, while the critical and distinct flow visual properties are rarely considered. This negligence would cause the significant domain gap between flow and natural images to severely hamper the accurate perception of flow turbulence, thereby undermining super-resolution performance. To tackle this dilemma, we comprehensively consider the flow visual properties, including the unique flow imaging principle and morphological information, and propose the first flow visual property-informed FISR algorithm. Particularly, different from natural images that are constructed by independent RGB channels in the light field, flow images build on the orthogonal UVW velocities in the flow field. To empower the FISR network with an awareness of the flow imaging principle, we propose quaternion spatial modeling to model this orthogonal spatial relationship for improved FISR. Moreover, due to viscosity and surface tension characteristics, fluids often exhibit a droplet-like morphology in flow images. Inspired by this morphological property, we design the dynamic flow convolution to effectively mine the morphological information to enhance FISR. Extensive experiments on the newly acquired flow image datasets demonstrate the state-of-the-art performance of our method. Code and data will be made available.

Via

Access Paper or Ask Questions