Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiatian Zhu

Sebra: Debiasing Through Self-Guided Bias Ranking

Jan 30, 2025

Adarsh Kappiyath, Abhra Chaudhuri, Ajay Jaiswal, Ziquan Liu, Yunpeng Li, Xiatian Zhu, Lu Yin

Figure 1 for Sebra: Debiasing Through Self-Guided Bias Ranking

Figure 2 for Sebra: Debiasing Through Self-Guided Bias Ranking

Figure 3 for Sebra: Debiasing Through Self-Guided Bias Ranking

Figure 4 for Sebra: Debiasing Through Self-Guided Bias Ranking

Abstract:Ranking samples by fine-grained estimates of spuriosity (the degree to which spurious cues are present) has recently been shown to significantly benefit bias mitigation, over the traditional binary biased-\textit{vs}-unbiased partitioning of train sets. However, this spuriosity ranking comes with the requirement of human supervision. In this paper, we propose a debiasing framework based on our novel \ul{Se}lf-Guided \ul{B}ias \ul{Ra}nking (\emph{Sebra}), that mitigates biases (spurious correlations) via an automatic ranking of data points by spuriosity within their respective classes. Sebra leverages a key local symmetry in Empirical Risk Minimization (ERM) training -- the ease of learning a sample via ERM inversely correlates with its spuriousity; the fewer spurious correlations a sample exhibits, the harder it is to learn, and vice versa. However, globally across iterations, ERM tends to deviate from this symmetry. Sebra dynamically steers ERM to correct this deviation, facilitating the sequential learning of attributes in increasing order of difficulty, \ie, decreasing order of spuriosity. As a result, the sequence in which Sebra learns samples naturally provides spuriousity rankings. We use the resulting fine-grained bias characterization in a contrastive learning framework to mitigate biases from multiple sources. Extensive experiments show that Sebra consistently outperforms previous state-of-the-art unsupervised debiasing techniques across multiple standard benchmarks, including UrbanCars, BAR, CelebA, and ImageNet-1K. Code, pre-trained models, and training logs are available at https://kadarsh22.github.io/sebra_iclr25/.

* Accepted to ICLR 2025

Via

Access Paper or Ask Questions

Foundational Models for 3D Point Clouds: A Survey and Outlook

Jan 30, 2025

Vishal Thengane, Xiatian Zhu, Salim Bouzerdoum, Son Lam Phung, Yunpeng Li

Figure 1 for Foundational Models for 3D Point Clouds: A Survey and Outlook

Figure 2 for Foundational Models for 3D Point Clouds: A Survey and Outlook

Figure 3 for Foundational Models for 3D Point Clouds: A Survey and Outlook

Figure 4 for Foundational Models for 3D Point Clouds: A Survey and Outlook

Abstract:The 3D point cloud representation plays a crucial role in preserving the geometric fidelity of the physical world, enabling more accurate complex 3D environments. While humans naturally comprehend the intricate relationships between objects and variations through a multisensory system, artificial intelligence (AI) systems have yet to fully replicate this capacity. To bridge this gap, it becomes essential to incorporate multiple modalities. Models that can seamlessly integrate and reason across these modalities are known as foundation models (FMs). The development of FMs for 2D modalities, such as images and text, has seen significant progress, driven by the abundant availability of large-scale datasets. However, the 3D domain has lagged due to the scarcity of labelled data and high computational overheads. In response, recent research has begun to explore the potential of applying FMs to 3D tasks, overcoming these challenges by leveraging existing 2D knowledge. Additionally, language, with its capacity for abstract reasoning and description of the environment, offers a promising avenue for enhancing 3D understanding through large pre-trained language models (LLMs). Despite the rapid development and adoption of FMs for 3D vision tasks in recent years, there remains a gap in comprehensive and in-depth literature reviews. This article aims to address this gap by presenting a comprehensive overview of the state-of-the-art methods that utilize FMs for 3D visual understanding. We start by reviewing various strategies employed in the building of various 3D FMs. Then we categorize and summarize use of different FMs for tasks such as perception tasks. Finally, the article offers insights into future directions for research and development in this field. To help reader, we have curated list of relevant papers on the topic: https://github.com/vgthengane/Awesome-FMs-in-3D.

* Initial submission

Via

Access Paper or Ask Questions

Robust Low-Light Human Pose Estimation through Illumination-Texture Modulation

Jan 14, 2025

Feng Zhang, Ze Li, Xiatian Zhu, Lei Chen

Figure 1 for Robust Low-Light Human Pose Estimation through Illumination-Texture Modulation

Figure 2 for Robust Low-Light Human Pose Estimation through Illumination-Texture Modulation

Figure 3 for Robust Low-Light Human Pose Estimation through Illumination-Texture Modulation

Figure 4 for Robust Low-Light Human Pose Estimation through Illumination-Texture Modulation

Abstract:As critical visual details become obscured, the low visibility and high ISO noise in extremely low-light images pose a significant challenge to human pose estimation. Current methods fail to provide high-quality representations due to reliance on pixel-level enhancements that compromise semantics and the inability to effectively handle extreme low-light conditions for robust feature learning. In this work, we propose a frequency-based framework for low-light human pose estimation, rooted in the "divide-and-conquer" principle. Instead of uniformly enhancing the entire image, our method focuses on task-relevant information. By applying dynamic illumination correction to the low-frequency components and low-rank denoising to the high-frequency components, we effectively enhance both the semantic and texture information essential for accurate pose estimation. As a result, this targeted enhancement method results in robust, high-quality representations, significantly improving pose estimation performance. Extensive experiments demonstrating its superiority over state-of-the-art methods in various challenging low-light scenarios.

* 5 pages, 2 figures, conference

Via

Access Paper or Ask Questions

AgentPose: Progressive Distribution Alignment via Feature Agent for Human Pose Distillation

Jan 14, 2025

Feng Zhang, Jinwei Liu, Xiatian Zhu, Lei Chen

Figure 1 for AgentPose: Progressive Distribution Alignment via Feature Agent for Human Pose Distillation

Figure 2 for AgentPose: Progressive Distribution Alignment via Feature Agent for Human Pose Distillation

Figure 3 for AgentPose: Progressive Distribution Alignment via Feature Agent for Human Pose Distillation

Figure 4 for AgentPose: Progressive Distribution Alignment via Feature Agent for Human Pose Distillation

Abstract:Pose distillation is widely adopted to reduce model size in human pose estimation. However, existing methods primarily emphasize the transfer of teacher knowledge while often neglecting the performance degradation resulted from the curse of capacity gap between teacher and student. To address this issue, we propose AgentPose, a novel pose distillation method that integrates a feature agent to model the distribution of teacher features and progressively aligns the distribution of student features with that of the teacher feature, effectively overcoming the capacity gap and enhancing the ability of knowledge transfer. Our comprehensive experiments conducted on the COCO dataset substantiate the effectiveness of our method in knowledge transfer, particularly in scenarios with a high capacity gap.

* 5 pages, 1 figures

Via

Access Paper or Ask Questions

Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation

Jan 07, 2025

Kam Woh Ng, Jing Yang, Jia Wei Sii, Jiankang Deng, Chee Seng Chan, Yi-Zhe Song, Tao Xiang, Xiatian Zhu

Figure 1 for Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation

Figure 2 for Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation

Figure 3 for Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation

Figure 4 for Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation

Abstract:In this paper, we push the boundaries of fine-grained 3D generation into truly creative territory. Current methods either lack intricate details or simply mimic existing objects -- we enable both. By lifting 2D fine-grained understanding into 3D through multi-view diffusion and modeling part latents as continuous distributions, we unlock the ability to generate entirely new, yet plausible parts through interpolation and sampling. A self-supervised feature consistency loss further ensures stable generation of these unseen parts. The result is the first system capable of creating novel 3D objects with species-specific details that transcend existing examples. While we demonstrate our approach on birds, the underlying framework extends beyond things that can chirp! Code will be released at https://github.com/kamwoh/chirpy3d.

* 20 pages

Via

Access Paper or Ask Questions

4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives

Dec 30, 2024

Zeyu Yang, Zijie Pan, Xiatian Zhu, Li Zhang, Yu-Gang Jiang, Philip H. S. Torr

Figure 1 for 4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives

Figure 2 for 4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives

Figure 3 for 4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives

Figure 4 for 4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives

Abstract:Dynamic 3D scene representation and novel view synthesis from captured videos are crucial for enabling immersive experiences required by AR/VR and metaverse applications. However, this task is challenging due to the complexity of unconstrained real-world scenes and their temporal dynamics. In this paper, we frame dynamic scenes as a spatio-temporal 4D volume learning problem, offering a native explicit reformulation with minimal assumptions about motion, which serves as a versatile dynamic scene learning framework. Specifically, we represent a target dynamic scene using a collection of 4D Gaussian primitives with explicit geometry and appearance features, dubbed as 4D Gaussian splatting (4DGS). This approach can capture relevant information in space and time by fitting the underlying spatio-temporal volume. Modeling the spacetime as a whole with 4D Gaussians parameterized by anisotropic ellipses that can rotate arbitrarily in space and time, our model can naturally learn view-dependent and time-evolved appearance with 4D spherindrical harmonics. Notably, our 4DGS model is the first solution that supports real-time rendering of high-resolution, photorealistic novel views for complex dynamic scenes. To enhance efficiency, we derive several compact variants that effectively reduce memory footprint and mitigate the risk of overfitting. Extensive experiments validate the superiority of 4DGS in terms of visual quality and efficiency across a range of dynamic scene-related tasks (e.g., novel view synthesis, 4D generation, scene understanding) and scenarios (e.g., single object, indoor scenes, driving environments, synthetic and real data).

* Journal extension of ICLR 2024. arXiv admin note: text overlap with arXiv:2310.10642

Via

Access Paper or Ask Questions

Reflective Gaussian Splatting

Dec 26, 2024

Yuxuan Yao, Zixuan Zeng, Chun Gu, Xiatian Zhu, Li Zhang

Figure 1 for Reflective Gaussian Splatting

Figure 2 for Reflective Gaussian Splatting

Figure 3 for Reflective Gaussian Splatting

Figure 4 for Reflective Gaussian Splatting

Abstract:Novel view synthesis has experienced significant advancements owing to increasingly capable NeRF- and 3DGS-based methods. However, reflective object reconstruction remains challenging, lacking a proper solution to achieve real-time, high-quality rendering while accommodating inter-reflection. To fill this gap, we introduce a Reflective Gaussian splatting (\textbf{Ref-Gaussian}) framework characterized with two components: (I) {\em Physically based deferred rendering} that empowers the rendering equation with pixel-level material properties via formulating split-sum approximation; (II) {\em Gaussian-grounded inter-reflection} that realizes the desired inter-reflection function within a Gaussian splatting paradigm for the first time. To enhance geometry modeling, we further introduce material-aware normal propagation and an initial per-Gaussian shading stage, along with 2D Gaussian primitives. Extensive experiments on standard datasets demonstrate that Ref-Gaussian surpasses existing approaches in terms of quantitative metrics, visual quality, and compute efficiency. Further, we show that our method serves as a unified solution for both reflective and non-reflective scenes, going beyond the previous alternatives focusing on only reflective scenes. Also, we illustrate that Ref-Gaussian supports more applications such as relighting and editing.

* 17 pages, 14 figures

Via

Access Paper or Ask Questions

Is Foreground Prototype Sufficient? Few-Shot Medical Image Segmentation with Background-Fused Prototype

Dec 04, 2024

Song Tang, Chunxiao Zu, Wenxin Su, Yuan Dong, Mao Ye, Yan Gan, Xiatian Zhu

Abstract:Few-shot Semantic Segmentation(FSS)aim to adapt a pre-trained model to new classes with as few as a single labeled training sample per class. The existing prototypical work used in natural image scenarios biasedly focus on capturing foreground's discrimination while employing a simplistic representation for background, grounded on the inherent observation separation between foreground and background. However, this paradigm is not applicable to medical images where the foreground and background share numerous visual features, necessitating a more detailed description for background. In this paper, we present a new pluggable Background-fused prototype(Bro)approach for FSS in medical images. Instead of finding a commonality of background subjects in support image, Bro incorporates this background with two pivot designs. Specifically, Feature Similarity Calibration(FeaC)initially reduces noise in the support image by employing feature cross-attention with the query image. Subsequently, Hierarchical Channel Adversarial Attention(HiCA)merges the background into comprehensive prototypes. We achieve this by a channel groups-based attention mechanism, where an adversarial Mean-Offset structure encourages a coarse-to-fine fusion. Extensive experiments show that previous state-of-the-art methods, when paired with Bro, experience significant performance improvements. This demonstrates a more integrated way to represent backgrounds specifically for medical image.

Via

Access Paper or Ask Questions

Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data

Dec 02, 2024

Wenxin Su, Song Tang, Xiaofeng Liu, Xiaojing Yi, Mao Ye, Chunxiao Zu, Jiahao Li, Xiatian Zhu

Figure 1 for Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data

Figure 2 for Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data

Figure 3 for Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data

Figure 4 for Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data

Abstract:Domain shift (the difference between source and target domains) poses a significant challenge in clinical applications, e.g., Diabetic Retinopathy (DR) grading. Despite considering certain clinical requirements, like source data privacy, conventional transfer methods are predominantly model-centered and often struggle to prevent model-targeted attacks. In this paper, we address a challenging Online Model-aGnostic Domain Adaptation (OMG-DA) setting, driven by the demands of clinical environments. This setting is characterized by the absence of the model and the flow of target data. To tackle the new challenge, we propose a novel approach, Generative Unadversarial ExampleS (GUES), which enables adaptation from a data-centric perspective. Specifically, we first theoretically reformulate conventional perturbation optimization in a generative way--learning a perturbation generation function with a latent input variable. During model instantiation, we leverage a Variational AutoEncoder to express this function. The encoder with the reparameterization trick predicts the latent input, whilst the decoder is responsible for the generation. Furthermore, the saliency map is selected as pseudo-perturbation labels. Because it not only captures potential lesions but also theoretically provides an upper bound on the function input, enabling the identification of the latent variable. Extensive comparative experiments on DR benchmarks with both frozen pre-trained models and trainable models demonstrate the superiority of GUES, showing robustness even with small batch size.

Via

Access Paper or Ask Questions

Driving Scene Synthesis on Free-form Trajectories with Generative Prior

Dec 02, 2024

Zeyu Yang, Zijie Pan, Yuankun Yang, Xiatian Zhu, Li Zhang

Figure 1 for Driving Scene Synthesis on Free-form Trajectories with Generative Prior

Figure 2 for Driving Scene Synthesis on Free-form Trajectories with Generative Prior

Figure 3 for Driving Scene Synthesis on Free-form Trajectories with Generative Prior

Figure 4 for Driving Scene Synthesis on Free-form Trajectories with Generative Prior

Abstract:Driving scene synthesis along free-form trajectories is essential for driving simulations to enable closed-loop evaluation of end-to-end driving policies. While existing methods excel at novel view synthesis on recorded trajectories, they face challenges with novel trajectories due to limited views of driving videos and the vastness of driving environments. To tackle this challenge, we propose a novel free-form driving view synthesis approach, dubbed DriveX, by leveraging video generative prior to optimize a 3D model across a variety of trajectories. Concretely, we crafted an inverse problem that enables a video diffusion model to be utilized as a prior for many-trajectory optimization of a parametric 3D model (e.g., Gaussian splatting). To seamlessly use the generative prior, we iteratively conduct this process during optimization. Our resulting model can produce high-fidelity virtual driving environments outside the recorded trajectory, enabling free-form trajectory driving simulation. Beyond real driving scenes, DriveX can also be utilized to simulate virtual driving worlds from AI-generated videos.

Via

Access Paper or Ask Questions