Alert button
Picture for Jianfei Cai

Jianfei Cai

Alert button

ObjectSDF++: Improved Object-Compositional Neural Implicit Surfaces

Aug 17, 2023
Qianyi Wu, Kaisiyuan Wang, Kejie Li, Jianmin Zheng, Jianfei Cai

Figure 1 for ObjectSDF++: Improved Object-Compositional Neural Implicit Surfaces
Figure 2 for ObjectSDF++: Improved Object-Compositional Neural Implicit Surfaces
Figure 3 for ObjectSDF++: Improved Object-Compositional Neural Implicit Surfaces
Figure 4 for ObjectSDF++: Improved Object-Compositional Neural Implicit Surfaces

In recent years, neural implicit surface reconstruction has emerged as a popular paradigm for multi-view 3D reconstruction. Unlike traditional multi-view stereo approaches, the neural implicit surface-based methods leverage neural networks to represent 3D scenes as signed distance functions (SDFs). However, they tend to disregard the reconstruction of individual objects within the scene, which limits their performance and practical applications. To address this issue, previous work ObjectSDF introduced a nice framework of object-composition neural implicit surfaces, which utilizes 2D instance masks to supervise individual object SDFs. In this paper, we propose a new framework called ObjectSDF++ to overcome the limitations of ObjectSDF. First, in contrast to ObjectSDF whose performance is primarily restricted by its converted semantic field, the core component of our model is an occlusion-aware object opacity rendering formulation that directly volume-renders object opacity to be supervised with instance masks. Second, we design a novel regularization term for object distinction, which can effectively mitigate the issue that ObjectSDF may result in unexpected reconstruction in invisible regions due to the lack of constraint to prevent collisions. Our extensive experiments demonstrate that our novel framework not only produces superior object reconstruction results but also significantly improves the quality of scene reconstruction. Code and more resources can be found in \url{https://qianyiwu.github.io/objectsdf++}

* ICCV 2023. Project Page: https://qianyiwu.github.io/objectsdf++ Code: https://github.com/QianyiWu/objectsdf_plus 
Viaarxiv icon

Unified Open-Vocabulary Dense Visual Prediction

Jul 17, 2023
Hengcan Shi, Munawar Hayat, Jianfei Cai

Figure 1 for Unified Open-Vocabulary Dense Visual Prediction
Figure 2 for Unified Open-Vocabulary Dense Visual Prediction
Figure 3 for Unified Open-Vocabulary Dense Visual Prediction
Figure 4 for Unified Open-Vocabulary Dense Visual Prediction

In recent years, open-vocabulary (OV) dense visual prediction (such as OV object detection, semantic, instance and panoptic segmentations) has attracted increasing research attention. However, most of existing approaches are task-specific and individually tackle each task. In this paper, we propose a Unified Open-Vocabulary Network (UOVN) to jointly address four common dense prediction tasks. Compared with separate models, a unified network is more desirable for diverse industrial applications. Moreover, OV dense prediction training data is relatively less. Separate networks can only leverage task-relevant training data, while a unified approach can integrate diverse training data to boost individual tasks. We address two major challenges in unified OV prediction. Firstly, unlike unified methods for fixed-set predictions, OV networks are usually trained with multi-modal data. Therefore, we propose a multi-modal, multi-scale and multi-task (MMM) decoding mechanism to better leverage multi-modal data. Secondly, because UOVN uses data from different tasks for training, there are significant domain and task gaps. We present a UOVN training mechanism to reduce such gaps. Experiments on four datasets demonstrate the effectiveness of our UOVN.

Viaarxiv icon

CoactSeg: Learning from Heterogeneous Data for New Multiple Sclerosis Lesion Segmentation

Jul 10, 2023
Yicheng Wu, Zhonghua Wu, Hengcan Shi, Bjoern Picker, Winston Chong, Jianfei Cai

Figure 1 for CoactSeg: Learning from Heterogeneous Data for New Multiple Sclerosis Lesion Segmentation
Figure 2 for CoactSeg: Learning from Heterogeneous Data for New Multiple Sclerosis Lesion Segmentation
Figure 3 for CoactSeg: Learning from Heterogeneous Data for New Multiple Sclerosis Lesion Segmentation
Figure 4 for CoactSeg: Learning from Heterogeneous Data for New Multiple Sclerosis Lesion Segmentation

New lesion segmentation is essential to estimate the disease progression and therapeutic effects during multiple sclerosis (MS) clinical treatments. However, the expensive data acquisition and expert annotation restrict the feasibility of applying large-scale deep learning models. Since single-time-point samples with all-lesion labels are relatively easy to collect, exploiting them to train deep models is highly desirable to improve new lesion segmentation. Therefore, we proposed a coaction segmentation (CoactSeg) framework to exploit the heterogeneous data (i.e., new-lesion annotated two-time-point data and all-lesion annotated single-time-point data) for new MS lesion segmentation. The CoactSeg model is designed as a unified model, with the same three inputs (the baseline, follow-up, and their longitudinal brain differences) and the same three outputs (the corresponding all-lesion and new-lesion predictions), no matter which type of heterogeneous data is being used. Moreover, a simple and effective relation regularization is proposed to ensure the longitudinal relations among the three outputs to improve the model learning. Extensive experiments demonstrate that utilizing the heterogeneous data and the proposed longitudinal relation constraint can significantly improve the performance for both new-lesion and all-lesion segmentation tasks. Meanwhile, we also introduce an in-house MS-23v1 dataset, including 38 Oceania single-time-point samples with all-lesion labels. Codes and the dataset are released at https://github.com/ycwu1997/CoactSeg.

* Accepted by MICCAI 2023 (Early Acceptance) 
Viaarxiv icon

Open-Vocabulary Object Detection via Scene Graph Discovery

Jul 07, 2023
Hengcan Shi, Munawar Hayat, Jianfei Cai

Figure 1 for Open-Vocabulary Object Detection via Scene Graph Discovery
Figure 2 for Open-Vocabulary Object Detection via Scene Graph Discovery
Figure 3 for Open-Vocabulary Object Detection via Scene Graph Discovery
Figure 4 for Open-Vocabulary Object Detection via Scene Graph Discovery

In recent years, open-vocabulary (OV) object detection has attracted increasing research attention. Unlike traditional detection, which only recognizes fixed-category objects, OV detection aims to detect objects in an open category set. Previous works often leverage vision-language (VL) training data (e.g., referring grounding data) to recognize OV objects. However, they only use pairs of nouns and individual objects in VL data, while these data usually contain much more information, such as scene graphs, which are also crucial for OV detection. In this paper, we propose a novel Scene-Graph-Based Discovery Network (SGDN) that exploits scene graph cues for OV detection. Firstly, a scene-graph-based decoder (SGDecoder) including sparse scene-graph-guided attention (SSGA) is presented. It captures scene graphs and leverages them to discover OV objects. Secondly, we propose scene-graph-based prediction (SGPred), where we build a scene-graph-based offset regression (SGOR) mechanism to enable mutual enhancement between scene graph extraction and object localization. Thirdly, we design a cross-modal learning mechanism in SGPred. It takes scene graphs as bridges to improve the consistency between cross-modal embeddings for OV object classification. Experiments on COCO and LVIS demonstrate the effectiveness of our approach. Moreover, we show the ability of our model for OV scene graph detection, while previous OV scene graph generation methods cannot tackle this task.

Viaarxiv icon

Stitched ViTs are Flexible Vision Backbones

Jun 30, 2023
Zizheng Pan, Jing Liu, Haoyu He, Jianfei Cai, Bohan Zhuang

Figure 1 for Stitched ViTs are Flexible Vision Backbones
Figure 2 for Stitched ViTs are Flexible Vision Backbones
Figure 3 for Stitched ViTs are Flexible Vision Backbones
Figure 4 for Stitched ViTs are Flexible Vision Backbones

Large pretrained plain vision Transformers (ViTs) have been the workhorse for many downstream tasks. However, existing works utilizing off-the-shelf ViTs are inefficient in terms of training and deployment, because adopting ViTs with individual sizes requires separate training and is restricted by fixed performance-efficiency trade-offs. In this paper, we are inspired by stitchable neural networks, which is a new framework that cheaply produces a single model that covers rich subnetworks by stitching pretrained model families, supporting diverse performance-efficiency trade-offs at runtime. Building upon this foundation, we introduce SN-Netv2, a systematically improved model stitching framework to facilitate downstream task adaptation. Specifically, we first propose a Two-way stitching scheme to enlarge the stitching space. We then design a resource-constrained sampling strategy that takes into account the underlying FLOPs distributions in the space for improved sampling. Finally, we observe that learning stitching layers is a low-rank update, which plays an essential role on downstream tasks to stabilize training and ensure a good Pareto frontier. With extensive experiments on ImageNet-1K, ADE20K, COCO-Stuff-10K, NYUv2 and COCO-2017, SN-Netv2 demonstrates strong ability to serve as a flexible vision backbone, achieving great advantages in both training efficiency and adaptation. Code will be released at https://github.com/ziplab/SN-Netv2.

* Tech report 
Viaarxiv icon

Explicit Correspondence Matching for Generalizable Neural Radiance Fields

Apr 24, 2023
Yuedong Chen, Haofei Xu, Qianyi Wu, Chuanxia Zheng, Tat-Jen Cham, Jianfei Cai

Figure 1 for Explicit Correspondence Matching for Generalizable Neural Radiance Fields
Figure 2 for Explicit Correspondence Matching for Generalizable Neural Radiance Fields
Figure 3 for Explicit Correspondence Matching for Generalizable Neural Radiance Fields
Figure 4 for Explicit Correspondence Matching for Generalizable Neural Radiance Fields

We present a new generalizable NeRF method that is able to directly generalize to new unseen scenarios and perform novel view synthesis with as few as two source views. The key to our approach lies in the explicitly modeled correspondence matching information, so as to provide the geometry prior to the prediction of NeRF color and density for volume rendering. The explicit correspondence matching is quantified with the cosine similarity between image features sampled at the 2D projections of a 3D point on different views, which is able to provide reliable cues about the surface geometry. Unlike previous methods where image features are extracted independently for each view, we consider modeling the cross-view interactions via Transformer cross-attention, which greatly improves the feature matching quality. Our method achieves state-of-the-art results on different evaluation settings, with the experiments showing a strong correlation between our learned cosine feature similarity and volume density, demonstrating the effectiveness and superiority of our proposed method. Code is at https://github.com/donydchen/matchnerf

* Code and pre-trained models: https://github.com/donydchen/matchnerf Project Page: https://donydchen.github.io/matchnerf/ 
Viaarxiv icon

Sensitivity-Aware Visual Parameter-Efficient Tuning

Mar 15, 2023
Haoyu He, Jianfei Cai, Jing Zhang, Dacheng Tao, Bohan Zhuang

Figure 1 for Sensitivity-Aware Visual Parameter-Efficient Tuning
Figure 2 for Sensitivity-Aware Visual Parameter-Efficient Tuning
Figure 3 for Sensitivity-Aware Visual Parameter-Efficient Tuning
Figure 4 for Sensitivity-Aware Visual Parameter-Efficient Tuning

Visual Parameter-Efficient Tuning (VPET) has become a powerful alternative for full fine-tuning so as to adapt pre-trained vision models to downstream tasks, which only tunes a small number of parameters while freezing the vast majority ones to ease storage burden and optimization difficulty. However, existing VPET methods introduce trainable parameters to the same positions across different tasks depending solely on human heuristics and neglect the domain gaps. To this end, we study where to introduce and how to allocate trainable parameters by proposing a novel Sensitivity-aware visual Parameter-efficient Tuning (SPT) scheme, which adaptively allocates trainable parameters to task-specific important positions given a desired tunable parameter budget. Specifically, our SPT first quickly identifies the sensitive parameters that require tuning for a given task in a data-dependent way. Next, our SPT further boosts the representational capability for the weight matrices whose number of sensitive parameters exceeds a pre-defined threshold by utilizing any of the existing structured tuning methods, e.g., LoRA or Adapter, to replace directly tuning the selected sensitive parameters (unstructured tuning) under the budget. Extensive experiments on a wide range of downstream recognition tasks show that our SPT is complementary to the existing VPET methods and largely boosts their performance, e.g., SPT improves Adapter with supervised pre-trained ViT-B/16 backbone by 4.2% and 1.4% mean Top-1 accuracy, reaching SOTA performance on FGVC and VTAB-1k benchmarks, respectively. Source code is at https://github.com/ziplab/SPT

* Tech report 
Viaarxiv icon

Reliability-Adaptive Consistency Regularization for Weakly-Supervised Point Cloud Segmentation

Mar 09, 2023
Zhonghua Wu, Yicheng Wu, Guosheng Lin, Jianfei Cai

Figure 1 for Reliability-Adaptive Consistency Regularization for Weakly-Supervised Point Cloud Segmentation
Figure 2 for Reliability-Adaptive Consistency Regularization for Weakly-Supervised Point Cloud Segmentation
Figure 3 for Reliability-Adaptive Consistency Regularization for Weakly-Supervised Point Cloud Segmentation
Figure 4 for Reliability-Adaptive Consistency Regularization for Weakly-Supervised Point Cloud Segmentation

Weakly-supervised point cloud segmentation with extremely limited labels is highly desirable to alleviate the expensive costs of collecting densely annotated 3D points. This paper explores to apply the consistency regularization that is commonly used in weakly-supervised learning, for its point cloud counterpart with multiple data-specific augmentations, which has not been well studied. We observe that the straightforward way of applying consistency constraints to weakly-supervised point cloud segmentation has two major limitations: noisy pseudo labels due to the conventional confidence-based selection and insufficient consistency constraints due to discarding unreliable pseudo labels. Therefore, we propose a novel Reliability-Adaptive Consistency Network (RAC-Net) to use both prediction confidence and model uncertainty to measure the reliability of pseudo labels and apply consistency training on all unlabeled points while with different consistency constraints for different points based on the reliability of corresponding pseudo labels. Experimental results on the S3DIS and ScanNet-v2 benchmark datasets show that our model achieves superior performance in weakly-supervised point cloud segmentation. The code will be released.

Viaarxiv icon

Stitchable Neural Networks

Feb 15, 2023
Zizheng Pan, Jianfei Cai, Bohan Zhuang

Figure 1 for Stitchable Neural Networks
Figure 2 for Stitchable Neural Networks
Figure 3 for Stitchable Neural Networks
Figure 4 for Stitchable Neural Networks

The public model zoo containing enormous powerful pretrained model families (e.g., ResNet/DeiT) has reached an unprecedented scope than ever, which significantly contributes to the success of deep learning. As each model family consists of pretrained models with diverse scales (e.g., DeiT-Ti/S/B), it naturally arises a fundamental question of how to efficiently assemble these readily available models in a family for dynamic accuracy-efficiency trade-offs at runtime. To this end, we present Stitchable Neural Networks (SN-Net), a novel scalable and efficient framework for model deployment which cheaply produces numerous networks with different complexity and performance trade-offs given a family of pretrained neural networks, which we call anchors. Specifically, SN-Net splits the anchors across the blocks/layers and then stitches them together with simple stitching layers to map the activations from one anchor to another. With only a few epochs of training, SN-Net effectively interpolates between the performance of anchors with varying scales. At runtime, SN-Net can instantly adapt to dynamic resource constraints by switching the stitching positions. Extensive experiments on ImageNet classification demonstrate that SN-Net can obtain on-par or even better performance than many individually trained networks while supporting diverse deployment scenarios. For example, by stitching Swin Transformers, we challenge hundreds of models in Timm model zoo with a single network. We believe this new elastic model framework can serve as a strong baseline for further research in wider communities.

* Project is available at https://snnet.github.io/ 
Viaarxiv icon