Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Youwei Pang

ZoomNeXt: A Unified Collaborative Pyramid Network for Camouflaged Object Detection

Oct 31, 2023

Youwei Pang, Xiaoqi Zhao, Tian-Zhu Xiang, Lihe Zhang, Huchuan Lu

Abstract:Recent camouflaged object detection (COD) attempts to segment objects visually blended into their surroundings, which is extremely complex and difficult in real-world scenarios. Apart from the high intrinsic similarity between camouflaged objects and their background, objects are usually diverse in scale, fuzzy in appearance, and even severely occluded. To this end, we propose an effective unified collaborative pyramid network which mimics human behavior when observing vague images and videos, \textit{i.e.}, zooming in and out. Specifically, our approach employs the zooming strategy to learn discriminative mixed-scale semantics by the multi-head scale integration and rich granularity perception units, which are designed to fully explore imperceptible clues between candidate objects and background surroundings. The former's intrinsic multi-head aggregation provides more diverse visual patterns. The latter's routing mechanism can effectively propagate inter-frame difference in spatiotemporal scenarios and adaptively ignore static representations. They provides a solid foundation for realizing a unified architecture for static and dynamic COD. Moreover, considering the uncertainty and ambiguity derived from indistinguishable textures, we construct a simple yet effective regularization, uncertainty awareness loss, to encourage predictions with higher confidence in candidate regions. Our highly task-friendly framework consistently outperforms existing state-of-the-art methods in image and video COD benchmarks. The code will be available at \url{https://github.com/lartpang/ZoomNeXt}.

* Extensions to the conference version: arXiv:2203.02688

Via

Access Paper or Ask Questions

ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer

Jul 23, 2023

Youwei Pang, Xiaoqi Zhao, Lihe Zhang, Huchuan Lu

Figure 1 for ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer

Figure 2 for ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer

Figure 3 for ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer

Figure 4 for ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer

Abstract:Deep learning (DL) has advanced the field of dense prediction, while gradually dissolving the inherent barriers between different tasks. However, most existing works focus on designing architectures and constructing visual cues only for the specific task, which ignores the potential uniformity introduced by the DL paradigm. In this paper, we attempt to construct a novel \underline{ComP}lementary \underline{tr}ansformer, \textbf{ComPtr}, for diverse bi-source dense prediction tasks. Specifically, unlike existing methods that over-specialize in a single task or a subset of tasks, ComPtr starts from the more general concept of bi-source dense prediction. Based on the basic dependence on information complementarity, we propose consistency enhancement and difference awareness components with which ComPtr can evacuate and collect important visual semantic cues from different image sources for diverse tasks, respectively. ComPtr treats different inputs equally and builds an efficient dense interaction model in the form of sequence-to-sequence on top of the transformer. This task-generic design provides a smooth foundation for constructing the unified model that can simultaneously deal with various bi-source information. In extensive experiments across several representative vision tasks, i.e. remote sensing change detection, RGB-T crowd counting, RGB-D/T salient object detection, and RGB-D semantic segmentation, the proposed method consistently obtains favorable performance. The code will be available at \url{https://github.com/lartpang/ComPtr}.

Via

Access Paper or Ask Questions

M$^{2}$SNet: Multi-scale in Multi-scale Subtraction Network for Medical Image Segmentation

Mar 20, 2023

Xiaoqi Zhao, Hongpeng Jia, Youwei Pang, Long Lv, Feng Tian, Lihe Zhang, Weibing Sun, Huchuan Lu

$Figure 1 for M$^{2}$SNet: Multi-scale in Multi-scale Subtraction Network for Medical Image Segmentation$

$Figure 2 for M$^{2}$SNet: Multi-scale in Multi-scale Subtraction Network for Medical Image Segmentation$

$Figure 3 for M$^{2}$SNet: Multi-scale in Multi-scale Subtraction Network for Medical Image Segmentation$

$Figure 4 for M$^{2}$SNet: Multi-scale in Multi-scale Subtraction Network for Medical Image Segmentation$

Abstract:Accurate medical image segmentation is critical for early medical diagnosis. Most existing methods are based on U-shape structure and use element-wise addition or concatenation to fuse different level features progressively in decoder. However, both the two operations easily generate plenty of redundant information, which will weaken the complementarity between different level features, resulting in inaccurate localization and blurred edges of lesions. To address this challenge, we propose a general multi-scale in multi-scale subtraction network (M$^{2}$SNet) to finish diverse segmentation from medical image. Specifically, we first design a basic subtraction unit (SU) to produce the difference features between adjacent levels in encoder. Next, we expand the single-scale SU to the intra-layer multi-scale SU, which can provide the decoder with both pixel-level and structure-level difference information. Then, we pyramidally equip the multi-scale SUs at different levels with varying receptive fields, thereby achieving the inter-layer multi-scale feature aggregation and obtaining rich multi-scale difference information. In addition, we build a training-free network ``LossNet'' to comprehensively supervise the task-aware features from bottom layer to top layer, which drives our multi-scale subtraction network to capture the detailed and structural cues simultaneously. Without bells and whistles, our method performs favorably against most state-of-the-art methods under different evaluation metrics on eleven datasets of four different medical image segmentation tasks of diverse image modalities, including color colonoscopy imaging, ultrasound imaging, computed tomography (CT), and optical coherence tomography (OCT). The source code can be available at \url{https://github.com/Xiaoqi-Zhao-DLUT/MSNet}.

* Submitted to IEEE TMI

Via

Access Paper or Ask Questions

Adaptive Multi-source Predictor for Zero-shot Video Object Segmentation

Mar 18, 2023

Xiaoqi Zhao, Shijie Chang, Youwei Pang, Jiaxing Yang, Lihe Zhang, Huchuan Lu

Abstract:Both static and moving objects usually exist in real-life videos. Most video object segmentation methods only focus on exacting and exploiting motion cues to perceive moving objects. Once faced with static objects frames, moving object predictors may predict failed results caused by uncertain motion information, such as low-quality optical flow maps. Besides, many sources such as RGB, depth, optical flow and static saliency can provide useful information about the objects. However, existing approaches only utilize the RGB or RGB and optical flow. In this paper, we propose a novel adaptive multi-source predictor for zero-shot video object segmentation. In the static object predictor, the RGB source is converted to depth and static saliency sources, simultaneously. In the moving object predictor, we propose the multi-source fusion structure. First, the spatial importance of each source is highlighted with the help of the interoceptive spatial attention module (ISAM). Second, the motion-enhanced module (MEM) is designed to generate pure foreground motion attention for improving both static and moving features used in the decoder. Furthermore, we design a feature purification module (FPM) to filter the inter-source incompatible features. By the ISAM, MEM and FPM, the multi-source features are effectively fused. In addition, we put forward an adaptive predictor fusion network (APF) to evaluate the quality of optical flow and fuse the predictions from the static object predictor and the moving object predictor in order to prevent over-reliance on the failed results caused by low-quality optical flow maps. Experiments show that the proposed model outperforms the state-of-the-art methods on three challenging ZVOS benchmarks. And, the static object predictor can precisely predicts a high-quality depth map and static saliency map at the same time.

* Submitted to IJCV. arXiv admin note: substantial text overlap with arXiv:2108.05076

Via

Access Paper or Ask Questions

Towards Diverse Binary Segmentation via A Simple yet General Gated Network

Mar 18, 2023

Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu, Lei Zhang

Abstract:In many binary segmentation tasks, most CNNs-based methods use a U-shape encoder-decoder network as their basic structure. They ignore two key problems when the encoder exchanges information with the decoder: one is the lack of interference control mechanism between them, the other is without considering the disparity of the contributions from different encoder levels. In this work, we propose a simple yet general gated network (GateNet) to tackle them all at once. With the help of multi-level gate units, the valuable context information from the encoder can be selectively transmitted to the decoder. In addition, we design a gated dual branch structure to build the cooperation among the features of different levels and improve the discrimination ability of the network. Furthermore, we introduce a ``Fold'' operation to improve the atrous convolution and form a novel folded atrous convolution, which can be flexibly embedded in ASPP or DenseASPP to accurately localize foreground objects of various scales. GateNet can be easily generalized to many binary segmentation tasks, including general and specific object segmentation and multi-modal segmentation. Without bells and whistles, our network consistently performs favorably against the state-of-the-art methods under 10 metrics on 33 datasets of 10 binary segmentation tasks.

* Submitted to IJCV. arXiv admin note: text overlap with arXiv:2007.08074

Via

Access Paper or Ask Questions

Few-Shot Segmentation via Rich Prototype Generation and Recurrent Prediction Enhancement

Oct 03, 2022

Hongsheng Wang, Xiaoqi Zhao, Youwei Pang, Jinqing Qi

Figure 1 for Few-Shot Segmentation via Rich Prototype Generation and Recurrent Prediction Enhancement

Figure 2 for Few-Shot Segmentation via Rich Prototype Generation and Recurrent Prediction Enhancement

Figure 3 for Few-Shot Segmentation via Rich Prototype Generation and Recurrent Prediction Enhancement

Figure 4 for Few-Shot Segmentation via Rich Prototype Generation and Recurrent Prediction Enhancement

Abstract:Prototype learning and decoder construction are the keys for few-shot segmentation. However, existing methods use only a single prototype generation mode, which can not cope with the intractable problem of objects with various scales. Moreover, the one-way forward propagation adopted by previous methods may cause information dilution from registered features during the decoding process. In this research, we propose a rich prototype generation module (RPGM) and a recurrent prediction enhancement module (RPEM) to reinforce the prototype learning paradigm and build a unified memory-augmented decoder for few-shot segmentation, respectively. Specifically, the RPGM combines superpixel and K-means clustering to generate rich prototype features with complementary scale relationships and adapt the scale gap between support and query images. The RPEM utilizes the recurrent mechanism to design a round-way propagation decoder. In this way, registered features can provide object-aware information continuously. Experiments show that our method consistently outperforms other competitors on two popular benchmarks PASCAL-${{5}^{i}}$ and COCO-${{20}^{i}}$.

* Accepted in PRCV 2022

Via

Access Paper or Ask Questions

Joint Learning of Salient Object Detection, Depth Estimation and Contour Extraction

Mar 09, 2022

Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu

Figure 1 for Joint Learning of Salient Object Detection, Depth Estimation and Contour Extraction

Figure 2 for Joint Learning of Salient Object Detection, Depth Estimation and Contour Extraction

Figure 3 for Joint Learning of Salient Object Detection, Depth Estimation and Contour Extraction

Figure 4 for Joint Learning of Salient Object Detection, Depth Estimation and Contour Extraction

Abstract:Benefiting from color independence, illumination invariance and location discrimination attributed by the depth map, it can provide important supplemental information for extracting salient objects in complex environments. However, high-quality depth sensors are expensive and can not be widely applied. While general depth sensors produce the noisy and sparse depth information, which brings the depth-based networks with irreversible interference. In this paper, we propose a novel multi-task and multi-modal filtered transformer (MMFT) network for RGB-D salient object detection (SOD). Specifically, we unify three complementary tasks: depth estimation, salient object detection and contour estimation. The multi-task mechanism promotes the model to learn the task-aware features from the auxiliary tasks. In this way, the depth information can be completed and purified. Moreover, we introduce a multi-modal filtered transformer (MFT) module, which equips with three modality-specific filters to generate the transformer-enhanced feature for each modality. The proposed model works in a depth-free style during the testing phase. Experiments show that it not only significantly surpasses the depth-based RGB-D SOD methods on multiple datasets, but also precisely predicts a high-quality depth map and salient contour at the same time. And, the resulted depth map can help existing RGB-D SOD methods obtain significant performance gain.

* Manuscript Version

Via

Access Paper or Ask Questions

TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D Salient Object Detection

Dec 04, 2021

Youwei Pang, Xiaoqi Zhao, Lihe Zhang, Huchuan Lu

Figure 1 for TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D Salient Object Detection

Figure 2 for TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D Salient Object Detection

Figure 3 for TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D Salient Object Detection

Figure 4 for TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D Salient Object Detection

Abstract:Most of the existing RGB-D salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of convolution operation constrains the performance of the convolution-based methods to a ceiling. In this work, we rethink this task from the perspective of global information alignment and transformation. Specifically, the proposed method (TransCMD) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path (TIPP). TransCMD treats the multi-scale and multi-modal feature integration as a sequence-to-sequence context propagation and update process built on the transformer. Besides, considering the quadratic complexity w.r.t. the number of input tokens, we design a patch-wise token re-embedding strategy (PTRE) with acceptable computational cost. Experimental results on seven RGB-D SOD benchmark datasets demonstrate that a simple two-stream encoder-decoder framework can surpass the state-of-the-art purely CNN-based methods when it is equipped with the TIPP.

* Manuscript Version

Via

Access Paper or Ask Questions

Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object Segmentation

Aug 11, 2021

Xiaoqi Zhao, Youwei Pang, Jiaxing Yang, Lihe Zhang, Huchuan Lu

Figure 1 for Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object Segmentation

Figure 2 for Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object Segmentation

Figure 3 for Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object Segmentation

Figure 4 for Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object Segmentation

Abstract:Location and appearance are the key cues for video object segmentation. Many sources such as RGB, depth, optical flow and static saliency can provide useful information about the objects. However, existing approaches only utilize the RGB or RGB and optical flow. In this paper, we propose a novel multi-source fusion network for zero-shot video object segmentation. With the help of interoceptive spatial attention module (ISAM), spatial importance of each source is highlighted. Furthermore, we design a feature purification module (FPM) to filter the inter-source incompatible features. By the ISAM and FPM, the multi-source features are effectively fused. In addition, we put forward an automatic predictor selection network (APS) to select the better prediction of either the static saliency predictor or the moving object predictor in order to prevent over-reliance on the failed results caused by low-quality optical flow maps. Extensive experiments on three challenging public benchmarks (i.e. DAVIS$_{16}$, Youtube-Objects and FBMS) show that the proposed model achieves compelling performance against the state-of-the-arts. The source code will be publicly available at \textcolor{red}{\url{https://github.com/Xiaoqi-Zhao-DLUT/Multi-Source-APS-ZVOS}}.

* This work was accepted as ACM MM 2021 oral

Via

Access Paper or Ask Questions

Self-Supervised Representation Learning for RGB-D Salient Object Detection

Jan 29, 2021

Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu, Xiang Ruan

Figure 1 for Self-Supervised Representation Learning for RGB-D Salient Object Detection

Figure 2 for Self-Supervised Representation Learning for RGB-D Salient Object Detection

Figure 3 for Self-Supervised Representation Learning for RGB-D Salient Object Detection

Figure 4 for Self-Supervised Representation Learning for RGB-D Salient Object Detection

Abstract:Existing CNNs-Based RGB-D Salient Object Detection (SOD) networks are all required to be pre-trained on the ImageNet to learn the hierarchy features which can help to provide a good initialization. However, the collection and annotation of large-scale datasets are time-consuming and expensive. In this paper, we utilize Self-Supervised Representation Learning (SSL) to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation. Our pretext tasks require only a few and unlabeled RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts as well as reduce the gap between two modalities, thereby providing an effective initialization for the downstream task. In addition, for the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion (MPF) module that splits a single feature fusion into multi-path fusion to achieve an adequate perception of consistent and differential information. The MPF module is general and suitable for both cross-modal and cross-level feature fusion. Extensive experiments on six benchmark RGB-D SOD datasets, our model pre-trained on the RGB-D dataset ($6,335$ without any annotations) can perform favorably against most state-of-the-art RGB-D methods pre-trained on ImageNet ($1,280,000$ with image-level annotations).

Via

Access Paper or Ask Questions