Camouflaged object detection (COD) approaches heavily rely on pixel-level annotated datasets. Weakly-supervised COD (WSCOD) approaches use sparse annotations like scribbles or points to reduce annotation effort, but this can lead to decreased accuracy. The Segment Anything Model (SAM) shows remarkable segmentation ability with sparse prompts like points. However, manual prompt is not always feasible, as it may not be accessible in real-world application. Additionally, it only provides localization information instead of semantic one, which can intrinsically cause ambiguity in interpreting the targets. In this work, we aim to eliminate the need for manual prompt. The key idea is to employ Cross-modal Chains of Thought Prompting (CCTP) to reason visual prompts using the semantic information given by a generic text prompt. To that end, we introduce a test-time adaptation per-instance mechanism called Generalizable SAM (GenSAM) to automatically enerate and optimize visual prompts the generic task prompt for WSCOD. In particular, CCTP maps a single generic text prompt onto image-specific consensus foreground and background heatmaps using vision-language models, acquiring reliable visual prompts. Moreover, to test-time adapt the visual prompts, we further propose Progressive Mask Generation (PMG) to iteratively reweight the input image, guiding the model to focus on the targets in a coarse-to-fine manner. Crucially, all network parameters are fixed, avoiding the need for additional training. Experiments demonstrate the superiority of GenSAM. Experiments on three benchmarks demonstrate that GenSAM outperforms point supervision approaches and achieves comparable results to scribble supervision ones, solely relying on general task descriptions as prompts. our codes is in: https://lwpyh.github.io/GenSAM/.
A vision-language foundation model pretrained on very large-scale image-text paired data has the potential to provide generalizable knowledge representation for downstream visual recognition and detection tasks, especially on supplementing the undersampled categories in downstream model training. Recent studies utilizing CLIP for object detection have shown that a two-stage detector design typically outperforms a one-stage detector, while requiring more expensive training resources and longer inference time. In this work, we propose a one-stage detector GridCLIP that narrows its performance gap to those of two-stage detectors, with approximately 43 and 5 times faster than its two-stage counterpart (ViLD) in the training and test process respectively. GridCLIP learns grid-level representations to adapt to the intrinsic principle of one-stage detection learning by expanding the conventional CLIP image-text holistic mapping to a more fine-grained, grid-text alignment. This differs from the region-text mapping in two-stage detectors that apply CLIP directly by treating regions as images. Specifically, GridCLIP performs Grid-level Alignment to adapt the CLIP image-level representations to grid-level representations by aligning to CLIP category representations to learn the annotated (especially frequent) categories. To learn generalizable visual representations of broader categories, especially undersampled ones, we perform Image-level Alignment during training to propagate broad pre-learned categories in the CLIP image encoder from the image-level to the grid-level representations. Experiments show that the learned CLIP-based grid-level representations boost the performance of undersampled (infrequent and novel) categories, reaching comparable detection performance on the LVIS benchmark.
Most Video Super-Resolution (VSR) methods enhance a video reference frame by aligning its neighboring frames and mining information on these frames. Recently, deformable alignment has drawn extensive attention in VSR community for its remarkable performance, which can adaptively align neighboring frames with the reference one. However, we experimentally find that deformable alignment methods still suffer from fast motion due to locally loss-driven offset prediction and lack explicit motion constraints. Hence, we propose a Matching-based Flow Estimation (MFE) module to conduct global semantic feature matching and estimate optical flow as coarse offset for each location. And a Flow-guided Deformable Module (FDM) is proposed to integrate optical flow into deformable convolution. The FDM uses the optical flow to warp the neighboring frames at first. And then, the warped neighboring frames and the reference one are used to predict a set of fine offsets for each coarse offset. In general, we propose an end-to-end deep network called Flow-guided Deformable Alignment Network (FDAN), which reaches the state-of-the-art performance on two benchmark datasets while is still competitive in computation and memory consumption.