Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jie Qin

Transformer-based No-Reference Image Quality Assessment via Supervised Contrastive Learning

Dec 12, 2023

Jinsong Shi, Pan Gao, Jie Qin

Abstract:Image Quality Assessment (IQA) has long been a research hotspot in the field of image processing, especially No-Reference Image Quality Assessment (NR-IQA). Due to the powerful feature extraction ability, existing Convolution Neural Network (CNN) and Transformers based NR-IQA methods have achieved considerable progress. However, they still exhibit limited capability when facing unknown authentic distortion datasets. To further improve NR-IQA performance, in this paper, a novel supervised contrastive learning (SCL) and Transformer-based NR-IQA model SaTQA is proposed. We first train a model on a large-scale synthetic dataset by SCL (no image subjective score is required) to extract degradation features of images with various distortion types and levels. To further extract distortion information from images, we propose a backbone network incorporating the Multi-Stream Block (MSB) by combining the CNN inductive bias and Transformer long-term dependence modeling capability. Finally, we propose the Patch Attention Block (PAB) to obtain the final distorted image quality score by fusing the degradation features learned from contrastive learning with the perceptual distortion information extracted by the backbone network. Experimental results on seven standard IQA datasets show that SaTQA outperforms the state-of-the-art methods for both synthetic and authentic datasets. Code is available at https://github.com/I2-Multimedia-Lab/SaTQA

* Accepted by AAAI24

Via

Access Paper or Ask Questions

Relevant Intrinsic Feature Enhancement Network for Few-Shot Semantic Segmentation

Dec 11, 2023

Xiaoyi Bao, Jie Qin, Siyang Sun, Yun Zheng, Xingang Wang

Figure 1 for Relevant Intrinsic Feature Enhancement Network for Few-Shot Semantic Segmentation

Figure 2 for Relevant Intrinsic Feature Enhancement Network for Few-Shot Semantic Segmentation

Figure 3 for Relevant Intrinsic Feature Enhancement Network for Few-Shot Semantic Segmentation

Figure 4 for Relevant Intrinsic Feature Enhancement Network for Few-Shot Semantic Segmentation

Abstract:For few-shot semantic segmentation, the primary task is to extract class-specific intrinsic information from limited labeled data. However, the semantic ambiguity and inter-class similarity of previous methods limit the accuracy of pixel-level foreground-background classification. To alleviate these issues, we propose the Relevant Intrinsic Feature Enhancement Network (RiFeNet). To improve the semantic consistency of foreground instances, we propose an unlabeled branch as an efficient data utilization method, which teaches the model how to extract intrinsic features robust to intra-class differences. Notably, during testing, the proposed unlabeled branch is excluded without extra unlabeled data and computation. Furthermore, we extend the inter-class variability between foreground and background by proposing a novel multi-level prototype generation and interaction module. The different-grained complementarity between global and local prototypes allows for better distinction between similar categories. The qualitative and quantitative performance of RiFeNet surpasses the state-of-the-art methods on PASCAL-5i and COCO benchmarks.

* Accepted in AAAI 2024

Via

Access Paper or Ask Questions

Generalizable Person Search on Open-world User-Generated Video Content

Oct 16, 2023

Junjie Li, Guanshuo Wang, Yichao Yan, Fufu Yu, Qiong Jia, Jie Qin, Shouhong Ding, Xiaokang Yang

Figure 1 for Generalizable Person Search on Open-world User-Generated Video Content

Figure 2 for Generalizable Person Search on Open-world User-Generated Video Content

Figure 3 for Generalizable Person Search on Open-world User-Generated Video Content

Figure 4 for Generalizable Person Search on Open-world User-Generated Video Content

Abstract:Person search is a challenging task that involves detecting and retrieving individuals from a large set of un-cropped scene images. Existing person search applications are mostly trained and deployed in the same-origin scenarios. However, collecting and annotating training samples for each scene is often difficult due to the limitation of resources and the labor cost. Moreover, large-scale intra-domain data for training are generally not legally available for common developers, due to the regulation of privacy and public security. Leveraging easily accessible large-scale User Generated Video Contents (\emph{i.e.} UGC videos) to train person search models can fit the open-world distribution, but still suffering a performance gap from the domain difference to surveillance scenes. In this work, we explore enhancing the out-of-domain generalization capabilities of person search models, and propose a generalizable framework on both feature-level and data-level generalization to facilitate downstream tasks in arbitrary scenarios. Specifically, we focus on learning domain-invariant representations for both detection and ReID by introducing a multi-task prototype-based domain-specific batch normalization, and a channel-wise ID-relevant feature decorrelation strategy. We also identify and address typical sources of noise in open-world training frames, including inaccurate bounding boxes, the omission of identity labels, and the absence of cross-camera data. Our framework achieves promising performance on two challenging person search benchmarks without using any human annotation or samples from the target domain.

Via

Access Paper or Ask Questions

SIDE: Self-supervised Intermediate Domain Exploration for Source-free Domain Adaptation

Oct 13, 2023

Jiamei Liu, Han Sun, Yizhen Jia, Jie Qin, Huiyu Zhou, Ningzhong Liu

Figure 1 for SIDE: Self-supervised Intermediate Domain Exploration for Source-free Domain Adaptation

Figure 2 for SIDE: Self-supervised Intermediate Domain Exploration for Source-free Domain Adaptation

Figure 3 for SIDE: Self-supervised Intermediate Domain Exploration for Source-free Domain Adaptation

Figure 4 for SIDE: Self-supervised Intermediate Domain Exploration for Source-free Domain Adaptation

Abstract:Domain adaptation aims to alleviate the domain shift when transferring the knowledge learned from the source domain to the target domain. Due to privacy issues, source-free domain adaptation (SFDA), where source data is unavailable during adaptation, has recently become very demanding yet challenging. Existing SFDA methods focus on either self-supervised learning of target samples or reconstruction of virtual source data. The former overlooks the transferable knowledge in the source model, whilst the latter introduces even more uncertainty. To address the above issues, this paper proposes self-supervised intermediate domain exploration (SIDE) that effectively bridges the domain gap with an intermediate domain, where samples are cyclically filtered out in a self-supervised fashion. First, we propose cycle intermediate domain filtering (CIDF) to cyclically select intermediate samples with similar distributions over source and target domains. Second, with the aid of those intermediate samples, an inter-domain gap transition (IDGT) module is developed to mitigate possible distribution mismatches between the source and target data. Finally, we introduce cross-view consistency learning (CVCL) to maintain the intrinsic class discriminability whilst adapting the model to the target domain. Extensive experiments on three popular benchmarks, i.e. Office-31, Office-Home and VisDA-C, show that our proposed SIDE achieves competitive performance against state-of-the-art methods.

* code at https://github.com/se111/SIDE

Via

Access Paper or Ask Questions

DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection

Sep 07, 2023

Manlin Zhang, Jie Wu, Yuxi Ren, Ming Li, Jie Qin, Xuefeng Xiao, Wei Liu, Rui Wang, Min Zheng, Andy J. Ma

Figure 1 for DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection

Figure 2 for DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection

Figure 3 for DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection

Figure 4 for DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection

Abstract:Data is the cornerstone of deep learning. This paper reveals that the recently developed Diffusion Model is a scalable data engine for object detection. Existing methods for scaling up detection-oriented data often require manual collection or generative models to obtain target images, followed by data augmentation and labeling to produce training pairs, which are costly, complex, or lacking diversity. To address these issues, we presentDiffusionEngine (DE), a data scaling-up engine that provides high-quality detection-oriented training pairs in a single stage. DE consists of a pre-trained diffusion model and an effective Detection-Adapter, contributing to generating scalable, diverse and generalizable detection data in a plug-and-play manner. Detection-Adapter is learned to align the implicit semantic and location knowledge in off-the-shelf diffusion models with detection-aware signals to make better bounding-box predictions. Additionally, we contribute two datasets, i.e., COCO-DE and VOC-DE, to scale up existing detection benchmarks for facilitating follow-up research. Extensive experiments demonstrate that data scaling-up via DE can achieve significant improvements in diverse scenarios, such as various detection algorithms, self-supervised pre-training, data-sparse, label-scarce, cross-domain, and semi-supervised learning. For example, when using DE with a DINO-based adapter to scale up data, mAP is improved by 3.1% on COCO, 7.6% on VOC, and 11.5% on Clipart.

* Code and Models are publicly available. Project Page: https://mettyz.github.io/DiffusionEngine

Via

Access Paper or Ask Questions

Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection

Aug 20, 2023

Guodong Wang, Yunhong Wang, Jie Qin, Dongming Zhang, Xiuguo Bao, Di Huang

Figure 1 for Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection

Figure 2 for Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection

Figure 3 for Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection

Figure 4 for Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection

Abstract:Anomaly detection (AD), aiming to find samples that deviate from the training distribution, is essential in safety-critical applications. Though recent self-supervised learning based attempts achieve promising results by creating virtual outliers, their training objectives are less faithful to AD which requires a concentrated inlier distribution as well as a dispersive outlier distribution. In this paper, we propose Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation (UniCon-HA), taking into account both the requirements above. Specifically, we explicitly encourage the concentration of inliers and the dispersion of virtual outliers via supervised and unsupervised contrastive losses, respectively. Considering that standard contrastive data augmentation for generating positive views may induce outliers, we additionally introduce a soft mechanism to re-weight each augmented inlier according to its deviation from the inlier distribution, to ensure a purified concentration. Moreover, to prompt a higher concentration, inspired by curriculum learning, we adopt an easy-to-hard hierarchical augmentation strategy and perform contrastive aggregation at different depths of the network based on the strengths of data augmentation. Our method is evaluated under three AD settings including unlabeled one-class, unlabeled multi-class, and labeled multi-class, demonstrating its consistent superiority over other competitors.

* Accepted by ICCV'2023

Via

Access Paper or Ask Questions

Video Frame Interpolation with Flow Transformer

Jul 30, 2023

Pan Gao, Haoyue Tian, Jie Qin

Figure 1 for Video Frame Interpolation with Flow Transformer

Figure 2 for Video Frame Interpolation with Flow Transformer

Figure 3 for Video Frame Interpolation with Flow Transformer

Figure 4 for Video Frame Interpolation with Flow Transformer

Abstract:Video frame interpolation has been actively studied with the development of convolutional neural networks. However, due to the intrinsic limitations of kernel weight sharing in convolution, the interpolated frame generated by it may lose details. In contrast, the attention mechanism in Transformer can better distinguish the contribution of each pixel, and it can also capture long-range pixel dependencies, which provides great potential for video interpolation. Nevertheless, the original Transformer is commonly used for 2D images; how to develop a Transformer-based framework with consideration of temporal self-attention for video frame interpolation remains an open issue. In this paper, we propose Video Frame Interpolation Flow Transformer to incorporate motion dynamics from optical flows into the self-attention mechanism. Specifically, we design a Flow Transformer Block that calculates the temporal self-attention in a matched local area with the guidance of flow, making our framework suitable for interpolating frames with large motion while maintaining reasonably low complexity. In addition, we construct a multi-scale architecture to account for multi-scale motion, further improving the overall performance. Extensive experiments on three benchmarks demonstrate that the proposed method can generate interpolated frames with better visual quality than state-of-the-art methods.

* Accepted to ACM MM23

Via

Access Paper or Ask Questions

AlignDet: Aligning Pre-training and Fine-tuning in Object Detection

Jul 20, 2023

Ming Li, Jie Wu, Xionghui Wang, Chen Chen, Jie Qin, Xuefeng Xiao, Rui Wang, Min Zheng, Xin Pan

Abstract:The paradigm of large-scale pre-training followed by downstream fine-tuning has been widely employed in various object detection algorithms. In this paper, we reveal discrepancies in data, model, and task between the pre-training and fine-tuning procedure in existing practices, which implicitly limit the detector's performance, generalization ability, and convergence speed. To this end, we propose AlignDet, a unified pre-training framework that can be adapted to various existing detectors to alleviate the discrepancies. AlignDet decouples the pre-training process into two stages, i.e., image-domain and box-domain pre-training. The image-domain pre-training optimizes the detection backbone to capture holistic visual abstraction, and box-domain pre-training learns instance-level semantics and task-aware concepts to initialize the parts out of the backbone. By incorporating the self-supervised pre-trained backbones, we can pre-train all modules for various detectors in an unsupervised paradigm. As depicted in Figure 1, extensive experiments demonstrate that AlignDet can achieve significant improvements across diverse protocols, such as detection algorithm, model backbone, data setting, and training schedule. For example, AlignDet improves FCOS by 5.3 mAP, RetinaNet by 2.1 mAP, Faster R-CNN by 3.3 mAP, and DETR by 2.3 mAP under fewer epochs.

* Accepted by ICCV 2023. Code and Models are publicly available. Project Page: https://liming-ai.github.io/AlignDet

Via

Access Paper or Ask Questions

FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation

Mar 30, 2023

Jie Qin, Jie Wu, Pengxiang Yan, Ming Li, Ren Yuxi, Xuefeng Xiao, Yitong Wang, Rui Wang, Shilei Wen, Xin Pan(+1 more)

Figure 1 for FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation

Figure 2 for FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation

Figure 3 for FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation

Figure 4 for FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation

Abstract:Recently, open-vocabulary learning has emerged to accomplish segmentation for arbitrary categories of text-based descriptions, which popularizes the segmentation system to more general-purpose application scenarios. However, existing methods devote to designing specialized architectures or parameters for specific segmentation tasks. These customized design paradigms lead to fragmentation between various segmentation tasks, thus hindering the uniformity of segmentation models. Hence in this paper, we propose FreeSeg, a generic framework to accomplish Unified, Universal and Open-Vocabulary Image Segmentation. FreeSeg optimizes an all-in-one network via one-shot training and employs the same architecture and parameters to handle diverse segmentation tasks seamlessly in the inference procedure. Additionally, adaptive prompt learning facilitates the unified model to capture task-aware and category-sensitive concepts, improving model robustness in multi-task and varied scenarios. Extensive experimental results demonstrate that FreeSeg establishes new state-of-the-art results in performance and generalization on three segmentation tasks, which outperforms the best task-specific architectures by a large margin: 5.5% mIoU on semantic segmentation, 17.6% mAP on instance segmentation, 20.1% PQ on panoptic segmentation for the unseen class on COCO.

* Accepted by CVPR 2023; camera-ready version

Via

Access Paper or Ask Questions

Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers

Mar 26, 2023

Zhou Huang, Hang Dai, Tian-Zhu Xiang, Shuo Wang, Huai-Xin Chen, Jie Qin, Huan Xiong

Figure 1 for Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers

Figure 2 for Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers

Figure 3 for Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers

Figure 4 for Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers

Abstract:Vision transformers have recently shown strong global context modeling capabilities in camouflaged object detection. However, they suffer from two major limitations: less effective locality modeling and insufficient feature aggregation in decoders, which are not conducive to camouflaged object detection that explores subtle cues from indistinguishable backgrounds. To address these issues, in this paper, we propose a novel transformer-based Feature Shrinkage Pyramid Network (FSPNet), which aims to hierarchically decode locality-enhanced neighboring transformer features through progressive shrinking for camouflaged object detection. Specifically, we propose a nonlocal token enhancement module (NL-TEM) that employs the non-local mechanism to interact neighboring tokens and explore graph-based high-order relations within tokens to enhance local representations of transformers. Moreover, we design a feature shrinkage decoder (FSD) with adjacent interaction modules (AIM), which progressively aggregates adjacent transformer features through a layer-bylayer shrinkage pyramid to accumulate imperceptible but effective cues as much as possible for object information decoding. Extensive quantitative and qualitative experiments demonstrate that the proposed model significantly outperforms the existing 24 competitors on three challenging COD benchmark datasets under six widely-used evaluation metrics. Our code is publicly available at https://github.com/ZhouHuang23/FSPNet.

* CVPR 2023. Project webpage at: https://tzxiang.github.io/project/COD-FSPNet/index.html

Via

Access Paper or Ask Questions