Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiatian Zhu

Bayesian Detector Combination for Object Detection with Crowdsourced Annotations

Jul 10, 2024

Zhi Qin Tan, Olga Isupova, Gustavo Carneiro, Xiatian Zhu, Yunpeng Li

Figure 1 for Bayesian Detector Combination for Object Detection with Crowdsourced Annotations

Figure 2 for Bayesian Detector Combination for Object Detection with Crowdsourced Annotations

Figure 3 for Bayesian Detector Combination for Object Detection with Crowdsourced Annotations

Figure 4 for Bayesian Detector Combination for Object Detection with Crowdsourced Annotations

Abstract:Acquiring fine-grained object detection annotations in unconstrained images is time-consuming, expensive, and prone to noise, especially in crowdsourcing scenarios. Most prior object detection methods assume accurate annotations; A few recent works have studied object detection with noisy crowdsourced annotations, with evaluation on distinct synthetic crowdsourced datasets of varying setups under artificial assumptions. To address these algorithmic limitations and evaluation inconsistency, we first propose a novel Bayesian Detector Combination (BDC) framework to more effectively train object detectors with noisy crowdsourced annotations, with the unique ability of automatically inferring the annotators' label qualities. Unlike previous approaches, BDC is model-agnostic, requires no prior knowledge of the annotators' skill level, and seamlessly integrates with existing object detection models. Due to the scarcity of real-world crowdsourced datasets, we introduce large synthetic datasets by simulating varying crowdsourcing scenarios. This allows consistent evaluation of different models at scale. Extensive experiments on both real and synthetic crowdsourced datasets show that BDC outperforms existing state-of-the-art methods, demonstrating its superiority in leveraging crowdsourced data for object detection. Our code and data are available at https://github.com/zhiqin1998/bdc.

* Accepted at ECCV 2024

Via

Access Paper or Ask Questions

PartCraft: Crafting Creative Objects by Parts

Jul 05, 2024

Kam Woh Ng, Xiatian Zhu, Yi-Zhe Song, Tao Xiang

Figure 1 for PartCraft: Crafting Creative Objects by Parts

Figure 2 for PartCraft: Crafting Creative Objects by Parts

Figure 3 for PartCraft: Crafting Creative Objects by Parts

Figure 4 for PartCraft: Crafting Creative Objects by Parts

Abstract:This paper propels creative control in generative visual AI by allowing users to "select". Departing from traditional text or sketch-based methods, we for the first time allow users to choose visual concepts by parts for their creative endeavors. The outcome is fine-grained generation that precisely captures selected visual concepts, ensuring a holistically faithful and plausible result. To achieve this, we first parse objects into parts through unsupervised feature clustering. Then, we encode parts into text tokens and introduce an entropy-based normalized attention loss that operates on them. This loss design enables our model to learn generic prior topology knowledge about object's part composition, and further generalize to novel part compositions to ensure the generation looks holistically faithful. Lastly, we employ a bottleneck encoder to project the part tokens. This not only enhances fidelity but also accelerates learning, by leveraging shared knowledge and facilitating information exchange among instances. Visual results in the paper and supplementary material showcase the compelling power of PartCraft in crafting highly customized, innovative creations, exemplified by the "charming" and creative birds. Code is released at https://github.com/kamwoh/partcraft.

* ECCV 2024. arXiv admin note: substantial text overlap with arXiv:2311.15477

Via

Access Paper or Ask Questions

Few-Shot Medical Image Segmentation with High-Fidelity Prototypes

Jun 26, 2024

Song Tang, Shaxu Yan, Xiaozhi Qi, Jianxin Gao, Mao Ye, Jianwei Zhang, Xiatian Zhu

Figure 1 for Few-Shot Medical Image Segmentation with High-Fidelity Prototypes

Figure 2 for Few-Shot Medical Image Segmentation with High-Fidelity Prototypes

Figure 3 for Few-Shot Medical Image Segmentation with High-Fidelity Prototypes

Figure 4 for Few-Shot Medical Image Segmentation with High-Fidelity Prototypes

Abstract:Few-shot Semantic Segmentation (FSS) aims to adapt a pretrained model to new classes with as few as a single labelled training sample per class. Despite the prototype based approaches have achieved substantial success, existing models are limited to the imaging scenarios with considerably distinct objects and not highly complex background, e.g., natural images. This makes such models suboptimal for medical imaging with both conditions invalid. To address this problem, we propose a novel Detail Self-refined Prototype Network (DSPNet) to constructing high-fidelity prototypes representing the object foreground and the background more comprehensively. Specifically, to construct global semantics while maintaining the captured detail semantics, we learn the foreground prototypes by modelling the multi-modal structures with clustering and then fusing each in a channel-wise manner. Considering that the background often has no apparent semantic relation in the spatial dimensions, we integrate channel-specific structural information under sparse channel-aware regulation. Extensive experiments on three challenging medical image benchmarks show the superiority of DSPNet over previous state-of-the-art methods.

Via

Access Paper or Ask Questions

AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis

Jun 14, 2024

Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu

Figure 1 for AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis

Figure 2 for AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis

Figure 3 for AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis

Figure 4 for AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis

Abstract:Novel view acoustic synthesis (NVAS) aims to render binaural audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene. Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing binaural audio. However, in addition to low efficiency originating from heavy NeRF rendering, these methods all have a limited ability of characterizing the entire scene environment such as room geometry, material properties, and the spatial relation between the listener and sound source. To address these issues, we propose a novel Audio-Visual Gaussian Splatting (AV-GS) model. To obtain a material-aware and geometry-aware condition for audio synthesis, we learn an explicit point-based scene representation with an audio-guidance parameter on locally initialized Gaussian points, taking into account the space relation from the listener and sound source. To make the visual scene model audio adaptive, we propose a point densification and pruning strategy to optimally distribute the Gaussian points, with the per-point contribution in sound propagation (e.g., more points needed for texture-less wall surfaces as they affect sound path diversion). Extensive experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.

Via

Access Paper or Ask Questions

Gaussian Splatting with Localized Points Management

Jun 13, 2024

Haosen Yang, Chenhao Zhang, Wenqing Wang, Marco Volino, Adrian Hilton, Li Zhang, Xiatian Zhu

Figure 1 for Gaussian Splatting with Localized Points Management

Figure 2 for Gaussian Splatting with Localized Points Management

Figure 3 for Gaussian Splatting with Localized Points Management

Figure 4 for Gaussian Splatting with Localized Points Management

Abstract:Point management is a critical component in optimizing 3D Gaussian Splatting (3DGS) models, as the point initiation (e.g., via structure from motion) is distributionally inappropriate. Typically, the Adaptive Density Control (ADC) algorithm is applied, leveraging view-averaged gradient magnitude thresholding for point densification, opacity thresholding for pruning, and regular all-points opacity reset. However, we reveal that this strategy is limited in tackling intricate/special image regions (e.g., transparent) as it is unable to identify all the 3D zones that require point densification, and lacking an appropriate mechanism to handle the ill-conditioned points with negative impacts (occlusion due to false high opacity). To address these limitations, we propose a Localized Point Management (LPM) strategy, capable of identifying those error-contributing zones in the highest demand for both point addition and geometry calibration. Zone identification is achieved by leveraging the underlying multiview geometry constraints, with the guidance of image rendering errors. We apply point densification in the identified zone, whilst resetting the opacity of those points residing in front of these regions so that a new opportunity is created to correct ill-conditioned points. Serving as a versatile plugin, LPM can be seamlessly integrated into existing 3D Gaussian Splatting models. Experimental evaluation across both static 3D and dynamic 4D scenes validate the efficacy of our LPM strategy in boosting a variety of existing 3DGS models both quantitatively and qualitatively. Notably, LPM improves both vanilla 3DGS and SpaceTimeGS to achieve state-of-the-art rendering quality while retaining real-time speeds, outperforming on challenging datasets such as Tanks & Temples and the Neural 3D Video Dataset.

Via

Access Paper or Ask Questions

ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery

Jun 12, 2024

Kam Woh Ng, Xiatian Zhu, Yi-Zhe Song, Tao Xiang

Figure 1 for ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery

Figure 2 for ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery

Figure 3 for ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery

Figure 4 for ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery

Abstract:Existing fine-grained hashing methods typically lack code interpretability as they compute hash code bits holistically using both global and local features. To address this limitation, we propose ConceptHash, a novel method that achieves sub-code level interpretability. In ConceptHash, each sub-code corresponds to a human-understandable concept, such as an object part, and these concepts are automatically discovered without human annotations. Specifically, we leverage a Vision Transformer architecture and introduce concept tokens as visual prompts, along with image patch tokens as model inputs. Each concept is then mapped to a specific sub-code at the model output, providing natural sub-code interpretability. To capture subtle visual differences among highly similar sub-categories (e.g., bird species), we incorporate language guidance to ensure that the learned hash codes are distinguishable within fine-grained object classes while maintaining semantic alignment. This approach allows us to develop hash codes that exhibit similarity within families of species while remaining distinct from species in other families. Extensive experiments on four fine-grained image retrieval benchmarks demonstrate that ConceptHash outperforms previous methods by a significant margin, offering unique sub-code interpretability as an additional benefit. Code at: https://github.com/kamwoh/concepthash.

* CVPRW 2024 - FGVC11 best paper award

Via

Access Paper or Ask Questions

Localized Gaussian Point Management

Jun 06, 2024

Haosen Yang, Chenhao Zhang, Wenqing Wang, Marco Volino, Adrian Hilton, Li Zhang, Xiatian Zhu

Figure 1 for Localized Gaussian Point Management

Figure 2 for Localized Gaussian Point Management

Figure 3 for Localized Gaussian Point Management

Figure 4 for Localized Gaussian Point Management

Via

Access Paper or Ask Questions

Tetrahedron Splatting for 3D Generation

Jun 03, 2024

Chun Gu, Zeyu Yang, Zijie Pan, Xiatian Zhu, Li Zhang

Figure 1 for Tetrahedron Splatting for 3D Generation

Figure 2 for Tetrahedron Splatting for 3D Generation

Figure 3 for Tetrahedron Splatting for 3D Generation

Figure 4 for Tetrahedron Splatting for 3D Generation

Abstract:3D representation is essential to the significant advance of 3D generation with 2D diffusion priors. As a flexible representation, NeRF has been first adopted for 3D representation. With density-based volumetric rendering, it however suffers both intensive computational overhead and inaccurate mesh extraction. Using a signed distance field and Marching Tetrahedra, DMTet allows for precise mesh extraction and real-time rendering but is limited in handling large topological changes in meshes, leading to optimization challenges. Alternatively, 3D Gaussian Splatting (3DGS) is favored in both training and rendering efficiency while falling short in mesh extraction. In this work, we introduce a novel 3D representation, Tetrahedron Splatting (TeT-Splatting), that supports easy convergence during optimization, precise mesh extraction, and real-time rendering simultaneously. This is achieved by integrating surface-based volumetric rendering within a structured tetrahedral grid while preserving the desired ability of precise mesh extraction, and a tile-based differentiable tetrahedron rasterizer. Furthermore, we incorporate eikonal and normal consistency regularization terms for the signed distance field to improve generation quality and stability. Critically, our representation can be trained without mesh extraction, making the optimization process easier to converge. Our TeT-Splatting can be readily integrated in existing 3D generation pipelines, along with polygonal mesh for texture optimization. Extensive experiments show that our TeT-Splatting strikes a superior tradeoff among convergence speed, render efficiency, and mesh quality as compared to previous alternatives under varying 3D generation settings.

* Code: https://github.com/fudan-zvg/tet-splatting

Via

Access Paper or Ask Questions

Proxy Denoising for Source-Free Domain Adaptation

Jun 03, 2024

Song Tang, Wenxin Su, Mao Ye, Jianwei Zhang, Xiatian Zhu

Abstract:Source-free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to an unlabeled target domain with no access to the source data. Inspired by the success of pre-trained large vision-language (ViL) models in many other applications, the latest SFDA methods have also validated the benefit of ViL models by leveraging their predictions as pseudo supervision. However, we observe that ViL's predictions could be noisy and inaccurate at an unknown rate, potentially introducing additional negative effects during adaption. To address this thus-far ignored challenge, in this paper, we introduce a novel Proxy Denoising (ProDe) approach. Specifically, we leverage the ViL model as a proxy to facilitate the adaptation process towards the latent domain-invariant space. Critically, we design a proxy denoising mechanism for correcting ViL's predictions. This is grounded on a novel proxy confidence theory by modeling elegantly the domain adaption effect of the proxy's divergence against the domain-invariant space. To capitalize the corrected proxy, we further derive a mutual knowledge distilling regularization. Extensive experiments show that our ProDe significantly outperforms the current state-of-the-art alternatives under both conventional closed-set setting and the more challenging open-set, partial-set and generalized SFDA settings. The code will release soon.

Via

Access Paper or Ask Questions

Automating the Diagnosis of Human Vision Disorders by Cross-modal 3D Generation

May 24, 2024

Li Zhang, Yuankun Yang, Ziyang Xie, Zhiyuan Yuan, Jianfeng Feng, Xiatian Zhu, Yu-Gang Jiang

Figure 1 for Automating the Diagnosis of Human Vision Disorders by Cross-modal 3D Generation

Figure 2 for Automating the Diagnosis of Human Vision Disorders by Cross-modal 3D Generation

Figure 3 for Automating the Diagnosis of Human Vision Disorders by Cross-modal 3D Generation

Figure 4 for Automating the Diagnosis of Human Vision Disorders by Cross-modal 3D Generation

Abstract:Understanding the hidden mechanisms behind human's visual perception is a fundamental quest in neuroscience, underpins a wide variety of critical applications, e.g. clinical diagnosis. To that end, investigating into the neural responses of human mind activities, such as functional Magnetic Resonance Imaging (fMRI), has been a significant research vehicle. However, analyzing fMRI signals is challenging, costly, daunting, and demanding for professional training. Despite remarkable progress in artificial intelligence (AI) based fMRI analysis, existing solutions are limited and far away from being clinically meaningful. In this context, we leap forward to demonstrate how AI can go beyond the current state of the art by decoding fMRI into visually plausible 3D visuals, enabling automatic clinical analysis of fMRI data, even without healthcare professionals. Innovationally, we reformulate the task of analyzing fMRI data as a conditional 3D scene reconstruction problem. We design a novel cross-modal 3D scene representation learning method, Brain3D, that takes as input the fMRI data of a subject who was presented with a 2D object image, and yields as output the corresponding 3D object visuals. Importantly, we show that in simulated scenarios our AI agent captures the distinct functionalities of each region of human vision system as well as their intricate interplay relationships, aligning remarkably with the established discoveries of neuroscience. Non-expert diagnosis indicate that Brain3D can successfully identify the disordered brain regions, such as V1, V2, V3, V4, and the medial temporal lobe (MTL) within the human visual system. We also present results in cross-modal 3D visual construction setting, showcasing the perception quality of our 3D scene generation.

* 25 pages, 16 figures, project page: https://brain-3d.github.io/

Via

Access Paper or Ask Questions