Abstract:Estimating 3D hand mesh from RGB images is a longstanding track, in which occlusion is one of the most challenging problems. Existing attempts towards this task often fail when the occlusion dominates the image space. In this paper, we propose SiMA-Hand, aiming to boost the mesh reconstruction performance by Single-to-Multi-view Adaptation. First, we design a multi-view hand reconstructor to fuse information across multiple views by holistically adopting feature fusion at image, joint, and vertex levels. Then, we introduce a single-view hand reconstructor equipped with SiMA. Though taking only one view as input at inference, the shape and orientation features in the single-view reconstructor can be enriched by learning non-occluded knowledge from the extra views at training, enhancing the reconstruction precision on the occluded regions. We conduct experiments on the Dex-YCB and HanCo benchmarks with challenging object- and self-caused occlusion cases, manifesting that SiMA-Hand consistently achieves superior performance over the state of the arts. Code will be released on https://github.com/JoyboyWang/SiMA-Hand Pytorch.
Abstract:Sign language recognition (SLR) plays a vital role in facilitating communication for the hearing-impaired community. SLR is a weakly supervised task where entire videos are annotated with glosses, making it challenging to identify the corresponding gloss within a video segment. Recent studies indicate that the main bottleneck in SLR is the insufficient training caused by the limited availability of large-scale datasets. To address this challenge, we present SignVTCL, a multi-modal continuous sign language recognition framework enhanced by visual-textual contrastive learning, which leverages the full potential of multi-modal data and the generalization ability of language model. SignVTCL integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone, thereby yielding more robust visual representations. Furthermore, SignVTCL contains a visual-textual alignment approach incorporating gloss-level and sentence-level alignment to ensure precise correspondence between visual features and glosses at the level of individual glosses and sentence. Experimental results conducted on three datasets, Phoenix-2014, Phoenix-2014T, and CSL-Daily, demonstrate that SignVTCL achieves state-of-the-art results compared with previous methods.
Abstract:Recent studies have made remarkable progress in histopathology classification. Based on current successes, contemporary works proposed to further upgrade the model towards a more generalizable and robust direction through incrementally learning from the sequentially delivered domains. Unlike previous parameter isolation based approaches that usually demand massive computation resources during model updating, we present a memory-efficient prompt tuning framework to cultivate model generalization potential in economical memory cost. For each incoming domain, we reuse the existing parameters of the initial classification model and attach lightweight trainable prompts into it for customized tuning. Considering the domain heterogeneity, we perform decoupled prompt tuning, where we adopt a domain-specific prompt for each domain to independently investigate its distinctive characteristics, and one domain-invariant prompt shared across all domains to continually explore the common content embedding throughout time. All domain-specific prompts will be appended to the prompt bank and isolated from further changes to prevent forgetting the distinctive features of early-seen domains. While the domain-invariant prompt will be passed on and iteratively evolve by style-augmented prompt refining to improve model generalization capability over time. In specific, we construct a graph with existing prompts and build a style-augmented graph attention network to guide the domain-invariant prompt exploring the overlapped latent embedding among all delivered domains for more domain generic representations. We have extensively evaluated our framework with two histopathology tasks, i.e., breast cancer metastasis classification and epithelium-stroma tissue classification, where our approach yielded superior performance and memory efficiency over the competing methods.
Abstract:Brain tumor represents one of the most fatal cancers around the world, and is very common in children and the elderly. Accurate identification of the type and grade of tumor in the early stages plays an important role in choosing a precise treatment plan. The Magnetic Resonance Imaging (MRI) protocols of different sequences provide clinicians with important contradictory information to identify tumor regions. However, manual assessment is time-consuming and error-prone due to big amount of data and the diversity of brain tumor types. Hence, there is an unmet need for MRI automated brain tumor diagnosis. We observe that the predictive capability of uni-modality models is limited and their performance varies widely across modalities, and the commonly used modality fusion methods would introduce potential noise, which results in significant performance degradation. To overcome these challenges, we propose a novel cross-modality guidance-aided multi-modal learning with dual attention for addressing the task of MRI brain tumor grading. To balance the tradeoff between model efficiency and efficacy, we employ ResNet Mix Convolution as the backbone network for feature extraction. Besides, dual attention is applied to capture the semantic interdependencies in spatial and slice dimensions respectively. To facilitate information interaction among modalities, we design a cross-modality guidance-aided module where the primary modality guides the other secondary modalities during the process of training, which can effectively leverage the complementary information of different MRI modalities and meanwhile alleviate the impact of the possible noise.




Abstract:In large-scale storehouses, precise instance masks are crucial for robotic bin picking but are challenging to obtain. Existing instance segmentation methods typically rely on a tedious process of scene collection, mask annotation, and network fine-tuning for every single Stock Keeping Unit (SKU). This paper presents SKU-Patch, a new patch-guided instance segmentation solution, leveraging only a few image patches for each incoming new SKU to predict accurate and robust masks, without tedious manual effort and model re-training. Technical-wise, we design a novel transformer-based network with (i) a patch-image correlation encoder to capture multi-level image features calibrated by patch information and (ii) a patch-aware transformer decoder with parallel task heads to generate instance masks. Extensive experiments on four storehouse benchmarks manifest that SKU-Patch is able to achieve the best performance over the state-of-the-art methods. Also, SKU-Patch yields an average of nearly 100% grasping success rate on more than 50 unseen SKUs in a robot-aided auto-store logistic pipeline, showing its effectiveness and practicality.




Abstract:The Segment Anything Model (SAM), a foundation model for general image segmentation, has demonstrated impressive zero-shot performance across numerous natural image segmentation tasks. However, SAM's performance significantly declines when applied to medical images, primarily due to the substantial disparity between natural and medical image domains. To effectively adapt SAM to medical images, it is important to incorporate critical third-dimensional information, i.e., volumetric or temporal knowledge, during fine-tuning. Simultaneously, we aim to harness SAM's pre-trained weights within its original 2D backbone to the fullest extent. In this paper, we introduce a modality-agnostic SAM adaptation framework, named as MA-SAM, that is applicable to various volumetric and video medical data. Our method roots in the parameter-efficient fine-tuning strategy to update only a small portion of weight increments while preserving the majority of SAM's pre-trained weights. By injecting a series of 3D adapters into the transformer blocks of the image encoder, our method enables the pre-trained 2D backbone to extract third-dimensional information from input data. The effectiveness of our method has been comprehensively evaluated on four medical image segmentation tasks, by using 10 public datasets across CT, MRI, and surgical video data. Remarkably, without using any prompt, our method consistently outperforms various state-of-the-art 3D approaches, surpassing nnU-Net by 0.9%, 2.6%, and 9.9% in Dice for CT multi-organ segmentation, MRI prostate segmentation, and surgical scene segmentation respectively. Our model also demonstrates strong generalization, and excels in challenging tumor segmentation when prompts are used. Our code is available at: https://github.com/cchen-cc/MA-SAM.




Abstract:We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. Guided by ImageBind, we construct a joint embedding space between 3D and multi-modalities, enabling many promising applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D open-world understanding. On top of this, we further present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions. By parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction data, but exhibits superior 3D and multi-modal question-answering capacity. We hope our work may cast a light on the community for extending 3D point clouds to multi-modality applications. Code is available at https://github.com/ZiyuGuo99/Point-Bind_Point-LLM.
Abstract:Large-scale well-annotated datasets are of great importance for training an effective object detector. However, obtaining accurate bounding box annotations is laborious and demanding. Unfortunately, the resultant noisy bounding boxes could cause corrupt supervision signals and thus diminish detection performance. Motivated by the observation that the real ground-truth is usually situated in the aggregation region of the proposals assigned to a noisy ground-truth, we propose DIStribution-aware CalibratiOn (DISCO) to model the spatial distribution of proposals for calibrating supervision signals. In DISCO, spatial distribution modeling is performed to statistically extract the potential locations of objects. Based on the modeled distribution, three distribution-aware techniques, i.e., distribution-aware proposal augmentation (DA-Aug), distribution-aware box refinement (DA-Ref), and distribution-aware confidence estimation (DA-Est), are developed to improve classification, localization, and interpretability, respectively. Extensive experiments on large-scale noisy image datasets (i.e., Pascal VOC and MS-COCO) demonstrate that DISCO can achieve state-of-the-art detection performance, especially at high noise levels.




Abstract:High-accuracy Dichotomous Image Segmentation (DIS) aims to pinpoint category-agnostic foreground objects from natural scenes. The main challenge for DIS involves identifying the highly accurate dominant area while rendering detailed object structure. However, directly using a general encoder-decoder architecture may result in an oversupply of high-level features and neglect the shallow spatial information necessary for partitioning meticulous structures. To fill this gap, we introduce a novel Unite-Divide-Unite Network (UDUN} that restructures and bipartitely arranges complementary features to simultaneously boost the effectiveness of trunk and structure identification. The proposed UDUN proceeds from several strengths. First, a dual-size input feeds into the shared backbone to produce more holistic and detailed features while keeping the model lightweight. Second, a simple Divide-and-Conquer Module (DCM) is proposed to decouple multiscale low- and high-level features into our structure decoder and trunk decoder to obtain structure and trunk information respectively. Moreover, we design a Trunk-Structure Aggregation module (TSA) in our union decoder that performs cascade integration for uniform high-accuracy segmentation. As a result, UDUN performs favorably against state-of-the-art competitors in all six evaluation metrics on overall DIS-TE, i.e., achieving 0.772 weighted F-measure and 977 HCE. Using 1024*1024 input, our model enables real-time inference at 65.3 fps with ResNet-18.




Abstract:We propose a novel approach for RGB-D salient instance segmentation using a dual-branch cross-modal feature calibration architecture called CalibNet. Our method simultaneously calibrates depth and RGB features in the kernel and mask branches to generate instance-aware kernels and mask features. CalibNet consists of three simple modules, a dynamic interactive kernel (DIK) and a weight-sharing fusion (WSF), which work together to generate effective instance-aware kernels and integrate cross-modal features. To improve the quality of depth features, we incorporate a depth similarity assessment (DSA) module prior to DIK and WSF. In addition, we further contribute a new DSIS dataset, which contains 1,940 images with elaborate instance-level annotations. Extensive experiments on three challenging benchmarks show that CalibNet yields a promising result, i.e., 58.0% AP with 320*480 input size on the COME15K-N test set, which significantly surpasses the alternative frameworks. Our code and dataset are available at: https://github.com/PJLallen/CalibNet.