Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zongxin Yang

Noise-Tolerant Hybrid Prototypical Learning with Noisy Web Data

Jan 05, 2025

Chao Liang, Linchao Zhu, Zongxin Yang, Wei Chen, Yi Yang

Figure 1 for Noise-Tolerant Hybrid Prototypical Learning with Noisy Web Data

Figure 2 for Noise-Tolerant Hybrid Prototypical Learning with Noisy Web Data

Figure 3 for Noise-Tolerant Hybrid Prototypical Learning with Noisy Web Data

Figure 4 for Noise-Tolerant Hybrid Prototypical Learning with Noisy Web Data

Abstract:We focus on the challenging problem of learning an unbiased classifier from a large number of potentially relevant but noisily labeled web images given only a few clean labeled images. This problem is particularly practical because it reduces the expensive annotation costs by utilizing freely accessible web images with noisy labels. Typically, prototypes are representative images or features used to classify or identify other images. However, in the few clean and many noisy scenarios, the class prototype can be severely biased due to the presence of irrelevant noisy images. The resulting prototypes are less compact and discriminative, as previous methods do not take into account the diverse range of images in the noisy web image collections. On the other hand, the relation modeling between noisy and clean images is not learned for the class prototype generation in an end-to-end manner, which results in a suboptimal class prototype. In this article, we introduce a similarity maximization loss named SimNoiPro. Our SimNoiPro first generates noise-tolerant hybrid prototypes composed of clean and noise-tolerant prototypes and then pulls them closer to each other. Our approach considers the diversity of noisy images by explicit division and overcomes the optimization discrepancy issue. This enables better relation modeling between clean and noisy images and helps extract judicious information from the noisy image set. The evaluation results on two extended few-shot classification benchmarks confirm that our SimNoiPro outperforms prior methods in measuring image relations and cleaning noisy data.

* Accepted by TOMM 2024

Via

Access Paper or Ask Questions

Generalizable Origin Identification for Text-Guided Image-to-Image Diffusion Models

Jan 04, 2025

Wenhao Wang, Yifan Sun, Zongxin Yang, Zhentao Tan, Zhengdong Hu, Yi Yang

Figure 1 for Generalizable Origin Identification for Text-Guided Image-to-Image Diffusion Models

Abstract:Text-guided image-to-image diffusion models excel in translating images based on textual prompts, allowing for precise and creative visual modifications. However, such a powerful technique can be misused for spreading misinformation, infringing on copyrights, and evading content tracing. This motivates us to introduce the task of origin IDentification for text-guided Image-to-image Diffusion models (ID$^2$), aiming to retrieve the original image of a given translated query. A straightforward solution to ID$^2$ involves training a specialized deep embedding model to extract and compare features from both query and reference images. However, due to visual discrepancy across generations produced by different diffusion models, this similarity-based approach fails when training on images from one model and testing on those from another, limiting its effectiveness in real-world applications. To solve this challenge of the proposed ID$^2$ task, we contribute the first dataset and a theoretically guaranteed method, both emphasizing generalizability. The curated dataset, OriPID, contains abundant Origins and guided Prompts, which can be used to train and test potential IDentification models across various diffusion models. In the method section, we first prove the existence of a linear transformation that minimizes the distance between the pre-trained Variational Autoencoder (VAE) embeddings of generated samples and their origins. Subsequently, it is demonstrated that such a simple linear transformation can be generalized across different diffusion models. Experimental results show that the proposed method achieves satisfying generalization performance, significantly surpassing similarity-based methods ($+31.6\%$ mAP), even those with generalization designs.

Via

Access Paper or Ask Questions

Collaborative Hybrid Propagator for Temporal Misalignment in Audio-Visual Segmentation

Dec 11, 2024

Kexin Li, Zongxin Yang, Yi Yang, Jun Xiao

Abstract:Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-producing objects that accurately align with the corresponding audio. However, existing methods often face temporal misalignment, where audio cues and segmentation results are not temporally coordinated. Audio provides two critical pieces of information: i) target object-level details and ii) the timing of when objects start and stop producing sounds. Current methods focus more on object-level information but neglect the boundaries of audio semantic changes, leading to temporal misalignment. To address this issue, we propose a Collaborative Hybrid Propagator Framework~(Co-Prop). This framework includes two main steps: Preliminary Audio Boundary Anchoring and Frame-by-Frame Audio-Insert Propagation. To Anchor the audio boundary, we employ retrieval-assist prompts with Qwen large language models to identify control points of audio semantic changes. These control points split the audio into semantically consistent audio portions. After obtaining the control point lists, we propose the Audio Insertion Propagator to process each audio portion using a frame-by-frame audio insertion propagation and matching approach. We curated a compact dataset comprising diverse source conversion cases and devised a metric to assess alignment rates. Compared to traditional simultaneous processing methods, our approach reduces memory requirements and facilitates frame alignment. Experimental results demonstrate the effectiveness of our approach across three datasets and two backbones. Furthermore, our method can be integrated with existing AVVS approaches, offering plug-and-play functionality to enhance their performance.

Via

Access Paper or Ask Questions

3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation

Oct 16, 2024

Dewei Zhou, Ji Xie, Zongxin Yang, Yi Yang

Figure 1 for 3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation

Figure 2 for 3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation

Figure 3 for 3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation

Figure 4 for 3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation

Abstract:The increasing demand for controllable outputs in text-to-image generation has spurred advancements in multi-instance generation (MIG), allowing users to define both instance layouts and attributes. However, unlike image-conditional generation methods such as ControlNet, MIG techniques have not been widely adopted in state-of-the-art models like SD2 and SDXL, primarily due to the challenge of building robust renderers that simultaneously handle instance positioning and attribute rendering. In this paper, we introduce Depth-Driven Decoupled Instance Synthesis (3DIS), a novel framework that decouples the MIG process into two stages: (i) generating a coarse scene depth map for accurate instance positioning and scene composition, and (ii) rendering fine-grained attributes using pre-trained ControlNet on any foundational model, without additional training. Our 3DIS framework integrates a custom adapter into LDM3D for precise depth-based layouts and employs a finetuning-free method for enhanced instance-level attribute rendering. Extensive experiments on COCO-Position and COCO-MIG benchmarks demonstrate that 3DIS significantly outperforms existing methods in both layout precision and attribute rendering. Notably, 3DIS offers seamless compatibility with diverse foundational models, providing a robust, adaptable solution for advanced multi-instance generation. The code is available at: https://github.com/limuloo/3DIS.

* 10 pages

Via

Access Paper or Ask Questions

MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis

Jul 02, 2024

Dewei Zhou, You Li, Fan Ma, Zongxin Yang, Yi Yang

Figure 1 for MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis

Figure 2 for MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis

Figure 3 for MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis

Figure 4 for MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis

Abstract:We introduce the Multi-Instance Generation (MIG) task, which focuses on generating multiple instances within a single image, each accurately placed at predefined positions with attributes such as category, color, and shape, strictly following user specifications. MIG faces three main challenges: avoiding attribute leakage between instances, supporting diverse instance descriptions, and maintaining consistency in iterative generation. To address attribute leakage, we propose the Multi-Instance Generation Controller (MIGC). MIGC generates multiple instances through a divide-and-conquer strategy, breaking down multi-instance shading into single-instance tasks with singular attributes, later integrated. To provide more types of instance descriptions, we developed MIGC++. MIGC++ allows attribute control through text \& images and position control through boxes \& masks. Lastly, we introduced the Consistent-MIG algorithm to enhance the iterative MIG ability of MIGC and MIGC++. This algorithm ensures consistency in unmodified regions during the addition, deletion, or modification of instances, and preserves the identity of instances when their attributes are changed. We introduce the COCO-MIG and Multimodal-MIG benchmarks to evaluate these methods. Extensive experiments on these benchmarks, along with the COCO-Position benchmark and DrawBench, demonstrate that our methods substantially outperform existing techniques, maintaining precise control over aspects including position, attribute, and quantity. Project page: https://github.com/limuloo/MIGC.

Via

Access Paper or Ask Questions

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Feb 08, 2024

Dewei Zhou, You Li, Fan Ma, Zongxin Yang, Yi Yang

Figure 1 for MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Figure 2 for MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Figure 3 for MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Figure 4 for MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Abstract:We present a Multi-Instance Generation (MIG) task, simultaneously generating multiple instances with diverse controls in one image. Given a set of predefined coordinates and their corresponding descriptions, the task is to ensure that generated instances are accurately at the designated locations and that all instances' attributes adhere to their corresponding description. This broadens the scope of current research on Single-instance generation, elevating it to a more versatile and practical dimension. Inspired by the idea of divide and conquer, we introduce an innovative approach named Multi-Instance Generation Controller (MIGC) to address the challenges of the MIG task. Initially, we break down the MIG task into several subtasks, each involving the shading of a single instance. To ensure precise shading for each instance, we introduce an instance enhancement attention mechanism. Lastly, we aggregate all the shaded instances to provide the necessary information for accurately generating multiple instances in stable diffusion (SD). To evaluate how well generation models perform on the MIG task, we provide a COCO-MIG benchmark along with an evaluation pipeline. Extensive experiments were conducted on the proposed COCO-MIG benchmark, as well as on various commonly used benchmarks. The evaluation results illustrate the exceptional control capabilities of our model in terms of quantity, position, attribute, and interaction.

Via

Access Paper or Ask Questions

Explore Synergistic Interaction Across Frames for Interactive Video Object Segmentation

Feb 04, 2024

Kexin Li, Tao Jiang, Zongxin Yang, Yi Yang, Yueting Zhuang, Jun Xiao

Figure 1 for Explore Synergistic Interaction Across Frames for Interactive Video Object Segmentation

Figure 2 for Explore Synergistic Interaction Across Frames for Interactive Video Object Segmentation

Figure 3 for Explore Synergistic Interaction Across Frames for Interactive Video Object Segmentation

Figure 4 for Explore Synergistic Interaction Across Frames for Interactive Video Object Segmentation

Abstract:Interactive Video Object Segmentation (iVOS) is a challenging task that requires real-time human-computer interaction. To improve the user experience, it is important to consider the user's input habits, segmentation quality, running time and memory consumption.However, existing methods compromise user experience with single input mode and slow running speed. Specifically, these methods only allow the user to interact with one single frame, which limits the expression of the user's intent.To overcome these limitations and better align with people's usage habits, we propose a framework that can accept multiple frames simultaneously and explore synergistic interaction across frames (SIAF). Concretely, we designed the Across-Frame Interaction Module that enables users to annotate different objects freely on multiple frames. The AFI module will migrate scribble information among multiple interactive frames and generate multi-frame masks. Additionally, we employ the id-queried mechanism to process multiple objects in batches. Furthermore, for a more efficient propagation and lightweight model, we design a truncated re-propagation strategy to replace the previous multi-round fusion module, which employs an across-round memory that stores important interaction information. Our SwinB-SIAF achieves new state-of-the-art performance on DAVIS 2017 (89.6%, J&F@60). Moreover, our R50-SIAF is more than 3 faster than the state-of-the-art competitor under challenging multi-object scenarios.

Via

Access Paper or Ask Questions

DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models

Jan 16, 2024

Zongxin Yang, Guikun Chen, Xiaodi Li, Wenguan Wang, Yi Yang

Figure 1 for DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models

Figure 2 for DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models

Figure 3 for DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models

Figure 4 for DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models

Abstract:The field of AI agents is advancing at an unprecedented rate due to the capabilities of large language models (LLMs). However, LLM-driven visual agents mainly focus on solving tasks for the image modality, which limits their ability to understand the dynamic nature of the real world, making it still far from real-life applications, e.g., guiding students in laboratory experiments and identifying their mistakes. Considering the video modality better reflects the ever-changing and perceptually intensive nature of real-world scenarios, we devise DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to handle dynamic video tasks. Given a video with a question/task, DoraemonGPT begins by converting the input video with massive content into a symbolic memory that stores \textit{task-related} attributes. This structured representation allows for spatial-temporal querying and reasoning by sub-task tools, resulting in concise and relevant intermediate results. Recognizing that LLMs have limited internal knowledge when it comes to specialized domains (e.g., analyzing the scientific principles underlying experiments), we incorporate plug-and-play tools to assess external knowledge and address tasks across different domains. Moreover, we introduce a novel LLM-driven planner based on Monte Carlo Tree Search to efficiently explore the large planning space for scheduling various tools. The planner iteratively finds feasible solutions by backpropagating the result's reward, and multiple solutions can be summarized into an improved final answer. We extensively evaluate DoraemonGPT in dynamic scenes and provide in-the-wild showcases demonstrating its ability to handle more complex questions than previous studies.

Via

Access Paper or Ask Questions

Controllable 3D Face Generation with Conditional Style Code Diffusion

Jan 11, 2024

Xiaolong Shen, Jianxin Ma, Chang Zhou, Zongxin Yang

Figure 1 for Controllable 3D Face Generation with Conditional Style Code Diffusion

Figure 2 for Controllable 3D Face Generation with Conditional Style Code Diffusion

Figure 3 for Controllable 3D Face Generation with Conditional Style Code Diffusion

Figure 4 for Controllable 3D Face Generation with Conditional Style Code Diffusion

Abstract:Generating photorealistic 3D faces from given conditions is a challenging task. Existing methods often rely on time-consuming one-by-one optimization approaches, which are not efficient for modeling the same distribution content, e.g., faces. Additionally, an ideal controllable 3D face generation model should consider both facial attributes and expressions. Thus we propose a novel approach called TEx-Face(TExt & Expression-to-Face) that addresses these challenges by dividing the task into three components, i.e., 3D GAN Inversion, Conditional Style Code Diffusion, and 3D Face Decoding. For 3D GAN inversion, we introduce two methods which aim to enhance the representation of style codes and alleviate 3D inconsistencies. Furthermore, we design a style code denoiser to incorporate multiple conditions into the style code and propose a data augmentation strategy to address the issue of insufficient paired visual-language data. Extensive experiments conducted on FFHQ, CelebA-HQ, and CelebA-Dialog demonstrate the promising performance of our TEx-Face in achieving the efficient and controllable generation of photorealistic 3D faces. The code will be available at https://github.com/sxl142/TEx-Face.

* Accepted by AAAI 2024

Via

Access Paper or Ask Questions

GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields

Jan 02, 2024

Xiao Pan, Zongxin Yang, Shuai Bai, Yi Yang

Figure 1 for GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields

Figure 2 for GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields

Figure 3 for GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields

Figure 4 for GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields

Abstract:In this paper, we focus on the One-shot Novel View Synthesis (O-NVS) task which targets synthesizing photo-realistic novel views given only one reference image per scene. Previous One-shot Generalizable Neural Radiance Fields (OG-NeRF) methods solve this task in an inference-time finetuning-free manner, yet suffer the blurry issue due to the encoder-only architecture that highly relies on the limited reference image. On the other hand, recent diffusion-based image-to-3d methods show vivid plausible results via distilling pre-trained 2D diffusion models into a 3D representation, yet require tedious per-scene optimization. Targeting these issues, we propose the GD$^2$-NeRF, a Generative Detail compensation framework via GAN and Diffusion that is both inference-time finetuning-free and with vivid plausible details. In detail, following a coarse-to-fine strategy, GD$^2$-NeRF is mainly composed of a One-stage Parallel Pipeline (OPP) and a 3D-consistent Detail Enhancer (Diff3DE). At the coarse stage, OPP first efficiently inserts the GAN model into the existing OG-NeRF pipeline for primarily relieving the blurry issue with in-distribution priors captured from the training dataset, achieving a good balance between sharpness (LPIPS, FID) and fidelity (PSNR, SSIM). Then, at the fine stage, Diff3DE further leverages the pre-trained image diffusion models to complement rich out-distribution details while maintaining decent 3D consistency. Extensive experiments on both the synthetic and real-world datasets show that GD$^2$-NeRF noticeably improves the details while without per-scene finetuning.

* Reading with Macbook Preview is recommended for best quality; Submitted to Journal

Via

Access Paper or Ask Questions