Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hefeng Wu

Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search

Nov 15, 2023

Hefeng Wu, Weifeng Chen, Zhibin Liu, Tianshui Chen, Zhiguang Chen, Liang Lin

Figure 1 for Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search

Figure 2 for Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search

Figure 3 for Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search

Figure 4 for Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search

Abstract:Given a descriptive text query, text-based person search (TBPS) aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. To better align the two modalities, most existing works focus on introducing sophisticated network structures and auxiliary tasks, which are complex and hard to implement. In this paper, we propose a simple yet effective dual Transformer model for text-based person search. By exploiting a hardness-aware contrastive learning strategy, our model achieves state-of-the-art performance without any special design for local feature alignment or side information. Moreover, we propose a proximity data generation (PDG) module to automatically produce more diverse data for cross-modal training. The PDG module first introduces an automatic generation algorithm based on a text-to-image diffusion model, which generates new text-image pair samples in the proximity space of original ones. Then it combines approximate text generation and feature-level mixup during training to further strengthen the data diversity. The PDG module can largely guarantee the reasonability of the generated samples that are directly used for training without any human inspection for noise rejection. It improves the performance of our model significantly, providing a feasible solution to the data insufficiency problem faced by such fine-grained visual-linguistic tasks. Extensive experiments on two popular datasets of the TBPS task (i.e., CUHK-PEDES and ICFG-PEDES) show that the proposed approach outperforms state-of-the-art approaches evidently, e.g., improving by 3.88%, 4.02%, 2.92% in terms of Top1, Top5, Top10 on CUHK-PEDES. The codes will be available at https://github.com/HCPLab-SYSU/PersonSearch-CTLG

* Accepted by IEEE T-CSVT

Via

Access Paper or Ask Questions

SketchBodyNet: A Sketch-Driven Multi-faceted Decoder Network for 3D Human Reconstruction

Oct 10, 2023

Fei Wang, Kongzhang Tang, Hefeng Wu, Baoquan Zhao, Hao Cai, Teng Zhou

Abstract:Reconstructing 3D human shapes from 2D images has received increasing attention recently due to its fundamental support for many high-level 3D applications. Compared with natural images, freehand sketches are much more flexible to depict various shapes, providing a high potential and valuable way for 3D human reconstruction. However, such a task is highly challenging. The sparse abstract characteristics of sketches add severe difficulties, such as arbitrariness, inaccuracy, and lacking image details, to the already badly ill-posed problem of 2D-to-3D reconstruction. Although current methods have achieved great success in reconstructing 3D human bodies from a single-view image, they do not work well on freehand sketches. In this paper, we propose a novel sketch-driven multi-faceted decoder network termed SketchBodyNet to address this task. Specifically, the network consists of a backbone and three separate attention decoder branches, where a multi-head self-attention module is exploited in each decoder to obtain enhanced features, followed by a multi-layer perceptron. The multi-faceted decoders aim to predict the camera, shape, and pose parameters, respectively, which are then associated with the SMPL model to reconstruct the corresponding 3D human mesh. In learning, existing 3D meshes are projected via the camera parameters into 2D synthetic sketches with joints, which are combined with the freehand sketches to optimize the model. To verify our method, we collect a large-scale dataset of about 26k freehand sketches and their corresponding 3D meshes containing various poses of human bodies from 14 different angles. Extensive experimental results demonstrate our SketchBodyNet achieves superior performance in reconstructing 3D human meshes from freehand sketches.

* 9 pages, to appear in Pacific Graphics 2023

Via

Access Paper or Ask Questions

Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation

Sep 23, 2023

Tao Pu, Tianshui Chen, Hefeng Wu, Yongyi Lu, Liang Lin

Figure 1 for Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation

Figure 2 for Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation

Figure 3 for Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation

Figure 4 for Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation

Abstract:Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video. It requires not only a comprehensive understanding of each object scattered on the whole scene but also a deep dive into their temporal motions and interactions. Inherently, object pairs and their relationships enjoy spatial co-occurrence correlations within each image and temporal consistency/transition correlations across different images, which can serve as prior knowledge to facilitate VidSGG model learning and inference. In this work, we propose a spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism to learn more representative relationship representations. Specifically, we first learn spatial co-occurrence and temporal transition correlations in a statistical manner. Then, we design spatial and temporal knowledge-embedded layers that introduce the multi-head cross-attention mechanism to fully explore the interaction between visual representation and the knowledge to generate spatial- and temporal-embedded representations, respectively. Finally, we aggregate these representations for each subject-object pair to predict the final semantic labels and their relationships. Extensive experiments show that STKET outperforms current competing algorithms by a large margin, e.g., improving the mR@50 by 8.1%, 4.7%, and 2.1% on different settings over current algorithms.

* Technical Report

Via

Access Paper or Ask Questions

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

May 23, 2023

Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, Liang Lin

Abstract:This paper presents a controllable text-to-video (T2V) diffusion model, named Video-ControlNet, that generates videos conditioned on a sequence of control signals, such as edge or depth maps. Video-ControlNet is built on a pre-trained conditional text-to-image (T2I) diffusion model by incorporating a spatial-temporal self-attention mechanism and trainable temporal layers for efficient cross-frame modeling. A first-frame conditioning strategy is proposed to facilitate the model to generate videos transferred from the image domain as well as arbitrary-length videos in an auto-regressive manner. Moreover, Video-ControlNet employs a novel residual-based noise initialization strategy to introduce motion prior from an input video, producing more coherent videos. With the proposed architecture and strategies, Video-ControlNet can achieve resource-efficient convergence and generate superior quality and consistent videos with fine-grained control. Extensive experiments demonstrate its success in various video generative tasks such as video editing and video style transfer, outperforming previous methods in terms of consistency and quality. Project Page: https://controlavideo.github.io/

Via

Access Paper or Ask Questions

Multi-object Video Generation from Single Frame Layouts

May 06, 2023

Yang Wu, Zhibin Liu, Hefeng Wu, Liang Lin

Figure 1 for Multi-object Video Generation from Single Frame Layouts

Figure 2 for Multi-object Video Generation from Single Frame Layouts

Figure 3 for Multi-object Video Generation from Single Frame Layouts

Figure 4 for Multi-object Video Generation from Single Frame Layouts

Abstract:In this paper, we study video synthesis with emphasis on simplifying the generation conditions. Most existing video synthesis models or datasets are designed to address complex motions of a single object, lacking the ability of comprehensively understanding the spatio-temporal relationships among multiple objects. Besides, current methods are usually conditioned on intricate annotations (e.g. video segmentations) to generate new videos, being fundamentally less practical. These motivate us to generate multi-object videos conditioning exclusively on object layouts from a single frame. To solve above challenges and inspired by recent research on image generation from layouts, we have proposed a novel video generative framework capable of synthesizing global scenes with local objects, via implicit neural representations and layout motion self-inference. Our framework is a non-trivial adaptation from image generation methods, and is new to this field. In addition, our model has been evaluated on two widely-used video recognition benchmarks, demonstrating effectiveness compared to the baseline model.

* 6 pages limite

Via

Access Paper or Ask Questions

Category-Adaptive Label Discovery and Noise Rejection for Multi-label Image Recognition with Partial Positive Labels

Nov 15, 2022

Tao Pu, Qianru Lao, Hefeng Wu, Tianshui Chen, Liang Lin

Figure 1 for Category-Adaptive Label Discovery and Noise Rejection for Multi-label Image Recognition with Partial Positive Labels

Figure 2 for Category-Adaptive Label Discovery and Noise Rejection for Multi-label Image Recognition with Partial Positive Labels

Figure 3 for Category-Adaptive Label Discovery and Noise Rejection for Multi-label Image Recognition with Partial Positive Labels

Figure 4 for Category-Adaptive Label Discovery and Noise Rejection for Multi-label Image Recognition with Partial Positive Labels

Abstract:As a promising solution of reducing annotation cost, training multi-label models with partial positive labels (MLR-PPL), in which merely few positive labels are known while other are missing, attracts increasing attention. Due to the absence of any negative labels, previous works regard unknown labels as negative and adopt traditional MLR algorithms. To reject noisy labels, recent works regard large loss samples as noise but ignore the semantic correlation different multi-label images. In this work, we propose to explore semantic correlation among different images to facilitate the MLR-PPL task. Specifically, we design a unified framework, Category-Adaptive Label Discovery and Noise Rejection, that discovers unknown labels and rejects noisy labels for each category in an adaptive manner. The framework consists of two complementary modules: (1) Category-Adaptive Label Discovery module first measures the semantic similarity between positive samples and then complement unknown labels with high similarities; (2) Category-Adaptive Noise Rejection module first computes the sample weights based on semantic similarities from different samples and then discards noisy labels with low weights. Besides, we propose a novel category-adaptive threshold updating that adaptively adjusts the threshold, to avoid the time-consuming manual tuning process. Extensive experiments demonstrate that our proposed method consistently outperforms current leading algorithms.

* arXiv admin note: text overlap with arXiv:2205.13092

Via

Access Paper or Ask Questions

Semantic-Aware Representation Blending for Multi-Label Image Recognition with Partial Labels

May 26, 2022

Tao Pu, Tianshui Chen, Hefeng Wu, Yongyi Lu, Liang Lin

Figure 1 for Semantic-Aware Representation Blending for Multi-Label Image Recognition with Partial Labels

Figure 2 for Semantic-Aware Representation Blending for Multi-Label Image Recognition with Partial Labels

Figure 3 for Semantic-Aware Representation Blending for Multi-Label Image Recognition with Partial Labels

Figure 4 for Semantic-Aware Representation Blending for Multi-Label Image Recognition with Partial Labels

Abstract:Despite achieving impressive progress, current multi-label image recognition (MLR) algorithms heavily depend on large-scale datasets with complete labels, making collecting large-scale datasets extremely time-consuming and labor-intensive. Training the multi-label image recognition models with partial labels (MLR-PL) is an alternative way to address this issue, in which merely some labels are known while others are unknown for each image (see Figure 1). However, current MLP-PL algorithms mainly rely on the pre-trained image classification or similarity models to generate pseudo labels for the unknown labels. Thus, they depend on a certain amount of data annotations and inevitably suffer from obvious performance drops, especially when the known label proportion is low. To address this dilemma, we propose a unified semantic-aware representation blending (SARB) that consists of two crucial modules to blend multi-granularity category-specific semantic representation across different images to transfer information of known labels to complement unknown labels. Extensive experiments on the MS-COCO, Visual Genome, and Pascal VOC 2007 datasets show that the proposed SARB consistently outperforms current state-of-the-art algorithms on all known label proportion settings. Concretely, it obtain the average mAP improvement of 1.9%, 4.5%, 1.0% on the three benchmark datasets compared with the second-best algorithm.

* Technical Report. arXiv admin note: substantial text overlap with arXiv:2203.02172

Via

Access Paper or Ask Questions

Semantic Representation and Dependency Learning for Multi-Label Image Recognition

Apr 08, 2022

Tao Pu, Lixian Yuan, Hefeng Wu, Tianshui Chen, Ling Tian, Liang Lin

Figure 1 for Semantic Representation and Dependency Learning for Multi-Label Image Recognition

Figure 2 for Semantic Representation and Dependency Learning for Multi-Label Image Recognition

Figure 3 for Semantic Representation and Dependency Learning for Multi-Label Image Recognition

Figure 4 for Semantic Representation and Dependency Learning for Multi-Label Image Recognition

Abstract:Recently many multi-label image recognition (MLR) works have made significant progress by introducing pre-trained object detection models to generate lots of proposals or utilizing statistical label co-occurrence enhance the correlation among different categories. However, these works have some limitations: (1) the effectiveness of the network significantly depends on pre-trained object detection models that bring expensive and unaffordable computation; (2) the network performance degrades when there exist occasional co-occurrence objects in images, especially for the rare categories. To address these problems, we propose a novel and effective semantic representation and dependency learning (SRDL) framework to learn category-specific semantic representation for each category and capture semantic dependency among all categories. Specifically, we design a category-specific attentional regions (CAR) module to generate channel/spatial-wise attention matrices to guide model to focus on semantic-aware regions. We also design an object erasing (OE) module to implicitly learn semantic dependency among categories by erasing semantic-aware regions to regularize the network training. Extensive experiments and comparisons on two popular MLR benchmark datasets (i.e., MS-COCO and Pascal VOC 2007) demonstrate the effectiveness of the proposed framework over current state-of-the-art algorithms.

* 25 pages, 7 figures

Via

Access Paper or Ask Questions

Structured Semantic Transfer for Multi-Label Recognition with Partial Labels

Dec 22, 2021

Tianshui Chen, Tao Pu, Hefeng Wu, Yuan Xie, Liang Lin

Figure 1 for Structured Semantic Transfer for Multi-Label Recognition with Partial Labels

Figure 2 for Structured Semantic Transfer for Multi-Label Recognition with Partial Labels

Figure 3 for Structured Semantic Transfer for Multi-Label Recognition with Partial Labels

Figure 4 for Structured Semantic Transfer for Multi-Label Recognition with Partial Labels

Abstract:Multi-label image recognition is a fundamental yet practical task because real-world images inherently possess multiple semantic labels. However, it is difficult to collect large-scale multi-label annotations due to the complexity of both the input images and output label spaces. To reduce the annotation cost, we propose a structured semantic transfer (SST) framework that enables training multi-label recognition models with partial labels, i.e., merely some labels are known while other labels are missing (also called unknown labels) per image. The framework consists of two complementary transfer modules that explore within-image and cross-image semantic correlations to transfer knowledge of known labels to generate pseudo labels for unknown labels. Specifically, an intra-image semantic transfer module learns image-specific label co-occurrence matrix and maps the known labels to complement unknown labels based on this matrix. Meanwhile, a cross-image transfer module learns category-specific feature similarities and helps complement unknown labels with high similarities. Finally, both known and generated labels are used to train the multi-label recognition models. Extensive experiments on the Microsoft COCO, Visual Genome and Pascal VOC datasets show that the proposed SST framework obtains superior performance over current state-of-the-art algorithms. Codes are available at https://github.com/HCPLab-SYSU/HCP-MLR-PL.

* Accepted by AAAI'22

Via

Access Paper or Ask Questions