Abstract:During endovascular interventions, physicians have to perform accurate and immediate operations based on the available real-time information, such as the shape and position of guidewires observed on the fluoroscopic images, haptic information and the patients' physiological signals. For this purpose, real-time and accurate guidewire segmentation and tracking can enhance the visualization of guidewires and provide visual feedback for physicians during the intervention as well as for robot-assisted interventions. Nevertheless, this task often comes with the challenge of elongated deformable structures that present themselves with low contrast in the noisy fluoroscopic image sequences. To address these issues, a two-stage deep learning framework for real-time guidewire segmentation and tracking is proposed. In the first stage, a Yolov5s detector is trained, using the original X-ray images as well as synthetic ones, which is employed to output the bounding boxes of possible target guidewires. More importantly, a refinement module based on spatiotemporal constraints is incorporated to robustly localize the guidewire and remove false detections. In the second stage, a novel and efficient network is proposed to segment the guidewire in each detected bounding box. The network contains two major modules, namely a hessian-based enhancement embedding module and a dual self-attention module. Quantitative and qualitative evaluations on clinical intra-operative images demonstrate that the proposed approach significantly outperforms our baselines as well as the current state of the art and, in comparison, shows higher robustness to low quality images.
Abstract:The remarkable performance of Vision Transformers (ViTs) typically requires an extremely large training cost. Existing methods have attempted to accelerate the training of ViTs, yet typically disregard method universality with accuracy dropping. Meanwhile, they break the training consistency of the original transformers, including the consistency of hyper-parameters, architecture, and strategy, which prevents them from being widely applied to different Transformer networks. In this paper, we propose a novel token growth scheme Token Expansion (termed ToE) to achieve consistent training acceleration for ViTs. We introduce an "initialization-expansion-merging" pipeline to maintain the integrity of the intermediate feature distribution of original transformers, preventing the loss of crucial learnable information in the training process. ToE can not only be seamlessly integrated into the training and fine-tuning process of transformers (e.g., DeiT and LV-ViT), but also effective for efficient training frameworks (e.g., EfficientTrain), without twisting the original training hyper-parameters, architecture, and introducing additional training strategies. Extensive experiments demonstrate that ToE achieves about 1.3x faster for the training of ViTs in a lossless manner, or even with performance gains over the full-token training baselines. Code is available at https://github.com/Osilly/TokenExpansion .
Abstract:Semi-supervised learning (SSL) is a practical challenge in computer vision. Pseudo-label (PL) methods, e.g., FixMatch and FreeMatch, obtain the State Of The Art (SOTA) performances in SSL. These approaches employ a threshold-to-pseudo-label (T2L) process to generate PLs by truncating the confidence scores of unlabeled data predicted by the self-training method. However, self-trained models typically yield biased and high-variance predictions, especially in the scenarios when a little labeled data are supplied. To address this issue, we propose a lightweight channel-based ensemble method to effectively consolidate multiple inferior PLs into the theoretically guaranteed unbiased and low-variance one. Importantly, our approach can be readily extended to any SSL framework, such as FixMatch or FreeMatch. Experimental results demonstrate that our method significantly outperforms state-of-the-art techniques on CIFAR10/100 in terms of effectiveness and efficiency.
Abstract:Self-supervised monocular depth estimation methods have been increasingly given much attention due to the benefit of not requiring large, labelled datasets. Such self-supervised methods require high-quality salient features and consequently suffer from severe performance drop for indoor scenes, where low-textured regions dominant in the scenes are almost indiscriminative. To address the issue, we propose a self-supervised indoor monocular depth estimation framework called $\mathrm{F^2Depth}$. A self-supervised optical flow estimation network is introduced to supervise depth learning. To improve optical flow estimation performance in low-textured areas, only some patches of points with more discriminative features are adopted for finetuning based on our well-designed patch-based photometric loss. The finetuned optical flow estimation network generates high-accuracy optical flow as a supervisory signal for depth estimation. Correspondingly, an optical flow consistency loss is designed. Multi-scale feature maps produced by finetuned optical flow estimation network perform warping to compute feature map synthesis loss as another supervisory signal for depth learning. Experimental results on the NYU Depth V2 dataset demonstrate the effectiveness of the framework and our proposed losses. To evaluate the generalization ability of our $\mathrm{F^2Depth}$, we collect a Campus Indoor depth dataset composed of approximately 1500 points selected from 99 images in 18 scenes. Zero-shot generalization experiments on 7-Scenes dataset and Campus Indoor achieve $\delta_1$ accuracy of 75.8% and 76.0% respectively. The accuracy results show that our model can generalize well to monocular images captured in unknown indoor scenes.
Abstract:The intelligent interpretation of buildings plays a significant role in urban planning and management, macroeconomic analysis, population dynamics, etc. Remote sensing image building interpretation primarily encompasses building extraction and change detection. However, current methodologies often treat these two tasks as separate entities, thereby failing to leverage shared knowledge. Moreover, the complexity and diversity of remote sensing image scenes pose additional challenges, as most algorithms are designed to model individual small datasets, thus lacking cross-scene generalization. In this paper, we propose a comprehensive remote sensing image building understanding model, termed RSBuilding, developed from the perspective of the foundation model. RSBuilding is designed to enhance cross-scene generalization and task universality. Specifically, we extract image features based on the prior knowledge of the foundation model and devise a multi-level feature sampler to augment scale information. To unify task representation and integrate image spatiotemporal clues, we introduce a cross-attention decoder with task prompts. Addressing the current shortage of datasets that incorporate annotations for both tasks, we have developed a federated training strategy to facilitate smooth model convergence even when supervision for some tasks is missing, thereby bolstering the complementarity of different tasks. Our model was trained on a dataset comprising up to 245,000 images and validated on multiple building extraction and change detection datasets. The experimental results substantiate that RSBuilding can concurrently handle two structurally distinct tasks and exhibits robust zero-shot generalization capabilities.
Abstract:Existing Quantization-Aware Training (QAT) methods intensively depend on the complete labeled dataset or knowledge distillation to guarantee the performances toward Full Precision (FP) accuracies. However, empirical results show that QAT still has inferior results compared to its FP counterpart. One question is how to push QAT toward or even surpass FP performances. In this paper, we address this issue from a new perspective by injecting the vicinal data distribution information to improve the generalization performances of QAT effectively. We present a simple, novel, yet powerful method introducing an Consistency Regularization (CR) for QAT. Concretely, CR assumes that augmented samples should be consistent in the latent feature space. Our method generalizes well to different network architectures and various QAT methods. Extensive experiments demonstrate that our approach significantly outperforms the current state-of-the-art QAT methods and even FP counterparts.
Abstract:Semi-supervised learning (SSL), thanks to the significant reduction of data annotation costs, has been an active research topic for large-scale 3D scene understanding. However, the existing SSL-based methods suffer from severe training bias, mainly due to class imbalance and long-tail distributions of the point cloud data. As a result, they lead to a biased prediction for the tail class segmentation. In this paper, we introduce a new decoupling optimization framework, which disentangles feature representation learning and classifier in an alternative optimization manner to shift the bias decision boundary effectively. In particular, we first employ two-round pseudo-label generation to select unlabeled points across head-to-tail classes. We further introduce multi-class imbalanced focus loss to adaptively pay more attention to feature learning across head-to-tail classes. We fix the backbone parameters after feature learning and retrain the classifier using ground-truth points to update its parameters. Extensive experiments demonstrate the effectiveness of our method outperforming previous state-of-the-art methods on both indoor and outdoor 3D point cloud datasets (i.e., S3DIS, ScanNet-V2, Semantic3D, and SemanticKITTI) using 1% and 1pt evaluation.
Abstract:Recent advances in vision-language models like Stable Diffusion have shown remarkable power in creative image synthesis and editing.However, most existing text-to-image editing methods encounter two obstacles: First, the text prompt needs to be carefully crafted to achieve good results, which is not intuitive or user-friendly. Second, they are insensitive to local edits and can irreversibly affect non-edited regions, leaving obvious editing traces. To tackle these problems, we propose a Zero-shot instructiON-guided local image Editing approach, termed ZONE. We first convert the editing intent from the user-provided instruction (e.g., ``make his tie blue") into specific image editing regions through InstructPix2Pix. We then propose a Region-IoU scheme for precise image layer extraction from an off-the-shelf segment model. We further develop an edge smoother based on FFT for seamless blending between the layer and the image.Our method allows for arbitrary manipulation of a specific region with a single instruction while preserving the rest. Extensive experiments demonstrate that our ZONE achieves remarkable local editing results and user-friendliness, outperforming state-of-the-art methods.
Abstract:Federated learning (FL) is a machine learning paradigm in which distributed local nodes collaboratively train a central model without sharing individually held private data. Existing FL methods either iteratively share local model parameters or deploy co-distillation. However, the former is highly susceptible to private data leakage, and the latter design relies on the prerequisites of task-relevant real data. Instead, we propose a data-free FL framework based on local-to-central collaborative distillation with direct input and output space exploitation. Our design eliminates any requirement of recursive local parameter exchange or auxiliary task-relevant data to transfer knowledge, thereby giving direct privacy control to local users. In particular, to cope with the inherent data heterogeneity across locals, our technique learns to distill input on which each local model produces consensual yet unique results to represent each expertise. Our proposed FL framework achieves notable privacy-utility trade-offs with extensive experiments on image classification and segmentation tasks under various real-world heterogeneous federated learning settings on both natural and medical images.
Abstract:Consistent editing of real images is a challenging task, as it requires performing non-rigid edits (e.g., changing postures) to the main objects in the input image without changing their identity or attributes. To guarantee consistent attributes, some existing methods fine-tune the entire model or the textual embedding for structural consistency, but they are time-consuming and fail to perform non-rigid edits. Other works are tuning-free, but their performances are weakened by the quality of Denoising Diffusion Implicit Model (DDIM) reconstruction, which often fails in real-world scenarios. In this paper, we present a novel approach called Tuning-free Inversion-enhanced Control (TIC), which directly correlates features from the inversion process with those from the sampling process to mitigate the inconsistency in DDIM reconstruction. Specifically, our method effectively obtains inversion features from the key and value features in the self-attention layers, and enhances the sampling process by these inversion features, thus achieving accurate reconstruction and content-consistent editing. To extend the applicability of our method to general editing scenarios, we also propose a mask-guided attention concatenation strategy that combines contents from both the inversion and the naive DDIM editing processes. Experiments show that the proposed method outperforms previous works in reconstruction and consistent editing, and produces impressive results in various settings.