Alert button
Picture for Yitong Wang

Yitong Wang

Alert button

DiffI2I: Efficient Diffusion Model for Image-to-Image Translation

Aug 26, 2023
Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, Radu Timotfe, Luc Van Gool

Figure 1 for DiffI2I: Efficient Diffusion Model for Image-to-Image Translation
Figure 2 for DiffI2I: Efficient Diffusion Model for Image-to-Image Translation
Figure 3 for DiffI2I: Efficient Diffusion Model for Image-to-Image Translation
Figure 4 for DiffI2I: Efficient Diffusion Model for Image-to-Image Translation

The Diffusion Model (DM) has emerged as the SOTA approach for image synthesis. However, the existing DM cannot perform well on some image-to-image translation (I2I) tasks. Different from image synthesis, some I2I tasks, such as super-resolution, require generating results in accordance with GT images. Traditional DMs for image synthesis require extensive iterations and large denoising models to estimate entire images, which gives their strong generative ability but also leads to artifacts and inefficiency for I2I. To tackle this challenge, we propose a simple, efficient, and powerful DM framework for I2I, called DiffI2I. Specifically, DiffI2I comprises three key components: a compact I2I prior extraction network (CPEN), a dynamic I2I transformer (DI2Iformer), and a denoising network. We train DiffI2I in two stages: pretraining and DM training. For pretraining, GT and input images are fed into CPEN$_{S1}$ to capture a compact I2I prior representation (IPR) guiding DI2Iformer. In the second stage, the DM is trained to only use the input images to estimate the same IRP as CPEN$_{S1}$. Compared to traditional DMs, the compact IPR enables DiffI2I to obtain more accurate outcomes and employ a lighter denoising network and fewer iterations. Through extensive experiments on various I2I tasks, we demonstrate that DiffI2I achieves SOTA performance while significantly reducing computational burdens.

Viaarxiv icon

SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

May 26, 2023
Zhuoyan Luo, Yicheng Xiao, Yong Liu, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, Yujiu Yang

Figure 1 for SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation
Figure 2 for SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation
Figure 3 for SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation
Figure 4 for SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct well-aligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations. Code will be available.

Viaarxiv icon

FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation

Mar 30, 2023
Jie Qin, Jie Wu, Pengxiang Yan, Ming Li, Ren Yuxi, Xuefeng Xiao, Yitong Wang, Rui Wang, Shilei Wen, Xin Pan, Xingang Wang

Figure 1 for FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation
Figure 2 for FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation
Figure 3 for FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation
Figure 4 for FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation

Recently, open-vocabulary learning has emerged to accomplish segmentation for arbitrary categories of text-based descriptions, which popularizes the segmentation system to more general-purpose application scenarios. However, existing methods devote to designing specialized architectures or parameters for specific segmentation tasks. These customized design paradigms lead to fragmentation between various segmentation tasks, thus hindering the uniformity of segmentation models. Hence in this paper, we propose FreeSeg, a generic framework to accomplish Unified, Universal and Open-Vocabulary Image Segmentation. FreeSeg optimizes an all-in-one network via one-shot training and employs the same architecture and parameters to handle diverse segmentation tasks seamlessly in the inference procedure. Additionally, adaptive prompt learning facilitates the unified model to capture task-aware and category-sensitive concepts, improving model robustness in multi-task and varied scenarios. Extensive experimental results demonstrate that FreeSeg establishes new state-of-the-art results in performance and generalization on three segmentation tasks, which outperforms the best task-specific architectures by a large margin: 5.5% mIoU on semantic segmentation, 17.6% mAP on instance segmentation, 20.1% PQ on panoptic segmentation for the unseen class on COCO.

* Accepted by CVPR 2023; camera-ready version 
Viaarxiv icon

DiffIR: Efficient Diffusion Model for Image Restoration

Mar 16, 2023
Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, Luc Van Gool

Figure 1 for DiffIR: Efficient Diffusion Model for Image Restoration
Figure 2 for DiffIR: Efficient Diffusion Model for Image Restoration
Figure 3 for DiffIR: Efficient Diffusion Model for Image Restoration
Figure 4 for DiffIR: Efficient Diffusion Model for Image Restoration

Diffusion model (DM) has achieved SOTA performance by modeling the image synthesis process into a sequential application of a denoising network. However, different from image synthesis generating each pixel from scratch, most pixels of image restoration (IR) are given. Thus, for IR, traditional DMs running massive iterations on a large model to estimate whole images or feature maps is inefficient. To address this issue, we propose an efficient DM for IR (DiffIR), which consists of a compact IR prior extraction network (CPEN), dynamic IR transformer (DIRformer), and denoising network. Specifically, DiffIR has two training stages: pretraining and training DM. In pretraining, we input ground-truth images into CPEN$_{S1}$ to capture a compact IR prior representation (IPR) to guide DIRformer. In the second stage, we train the DM to directly estimate the same IRP as pretrained CPEN$_{S1}$ only using LQ images. We observe that since the IPR is only a compact vector, DiffIR can use fewer iterations than traditional DM to obtain accurate estimations and generate more stable and realistic results. Since the iterations are few, our DiffIR can adopt a joint optimization of CPEN$_{S2}$, DIRformer, and denoising network, which can further reduce the estimation error influence. We conduct extensive experiments on several IR tasks and achieve SOTA performance while consuming less computational costs.

Viaarxiv icon

Global Knowledge Calibration for Fast Open-Vocabulary Segmentation

Mar 16, 2023
Kunyang Han, Yong Liu, Jun Hao Liew, Henghui Ding, Yunchao Wei, Jiajun Liu, Yitong Wang, Yansong Tang, Yujiu Yang, Jiashi Feng, Yao Zhao

Figure 1 for Global Knowledge Calibration for Fast Open-Vocabulary Segmentation
Figure 2 for Global Knowledge Calibration for Fast Open-Vocabulary Segmentation
Figure 3 for Global Knowledge Calibration for Fast Open-Vocabulary Segmentation
Figure 4 for Global Knowledge Calibration for Fast Open-Vocabulary Segmentation

Recent advancements in pre-trained vision-language models, such as CLIP, have enabled the segmentation of arbitrary concepts solely from textual inputs, a process commonly referred to as open-vocabulary semantic segmentation (OVS). However, existing OVS techniques confront a fundamental challenge: the trained classifier tends to overfit on the base classes observed during training, resulting in suboptimal generalization performance to unseen classes. To mitigate this issue, recent studies have proposed the use of an additional frozen pre-trained CLIP for classification. Nonetheless, this approach incurs heavy computational overheads as the CLIP vision encoder must be repeatedly forward-passed for each mask, rendering it impractical for real-world applications. To address this challenge, our objective is to develop a fast OVS model that can perform comparably or better without the extra computational burden of the CLIP image encoder during inference. To this end, we propose a core idea of preserving the generalizable representation when fine-tuning on known classes. Specifically, we introduce a text diversification strategy that generates a set of synonyms for each training category, which prevents the learned representation from collapsing onto specific known category names. Additionally, we employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP. Extensive experiments demonstrate that our proposed model achieves robust generalization performance across various datasets. Furthermore, we perform a preliminary exploration of open-vocabulary video segmentation and present a benchmark that can facilitate future open-vocabulary research in the video domain.

Viaarxiv icon

Knowledge Distillation based Degradation Estimation for Blind Super-Resolution

Nov 30, 2022
Bin Xia, Yulun Zhang, Yitong Wang, Yapeng Tian, Wenming Yang, Radu Timofte, Luc Van Gool

Figure 1 for Knowledge Distillation based Degradation Estimation for Blind Super-Resolution
Figure 2 for Knowledge Distillation based Degradation Estimation for Blind Super-Resolution
Figure 3 for Knowledge Distillation based Degradation Estimation for Blind Super-Resolution
Figure 4 for Knowledge Distillation based Degradation Estimation for Blind Super-Resolution

Blind image super-resolution (Blind-SR) aims to recover a high-resolution (HR) image from its corresponding low-resolution (LR) input image with unknown degradations. Most of the existing works design an explicit degradation estimator for each degradation to guide SR. However, it is infeasible to provide concrete labels of multiple degradation combinations (\eg, blur, noise, jpeg compression) to supervise the degradation estimator training. In addition, these special designs for certain degradation, such as blur, impedes the models from being generalized to handle different degradations. To this end, it is necessary to design an implicit degradation estimator that can extract discriminative degradation representation for all degradations without relying on the supervision of degradation ground-truth. In this paper, we propose a Knowledge Distillation based Blind-SR network (KDSR). It consists of a knowledge distillation based implicit degradation estimator network (KD-IDE) and an efficient SR network. To learn the KDSR model, we first train a teacher network: KD-IDE$_{T}$. It takes paired HR and LR patches as inputs and is optimized with the SR network jointly. Then, we further train a student network KD-IDE$_{S}$, which only takes LR images as input and learns to extract the same implicit degradation representation (IDR) as KD-IDE$_{T}$. In addition, to fully use extracted IDR, we design a simple, strong, and efficient IDR based dynamic convolution residual block (IDR-DCRB) to build an SR network. We conduct extensive experiments under classic and real-world degradation settings. The results show that KDSR achieves SOTA performance and can generalize to various degradation processes. The source codes and pre-trained models will be released.

Viaarxiv icon

Global Spectral Filter Memory Network for Video Object Segmentation

Oct 12, 2022
Yong Liu, Ran Yu, Jiahao Wang, Xinyuan Zhao, Yitong Wang, Yansong Tang, Yujiu Yang

Figure 1 for Global Spectral Filter Memory Network for Video Object Segmentation
Figure 2 for Global Spectral Filter Memory Network for Video Object Segmentation
Figure 3 for Global Spectral Filter Memory Network for Video Object Segmentation
Figure 4 for Global Spectral Filter Memory Network for Video Object Segmentation

This paper studies semi-supervised video object segmentation through boosting intra-frame interaction. Recent memory network-based methods focus on exploiting inter-frame temporal reference while paying little attention to intra-frame spatial dependency. Specifically, these segmentation model tends to be susceptible to interference from unrelated nontarget objects in a certain frame. To this end, we propose Global Spectral Filter Memory network (GSFM), which improves intra-frame interaction through learning long-term spatial dependencies in the spectral domain. The key components of GSFM is 2D (inverse) discrete Fourier transform for spatial information mixing. Besides, we empirically find low frequency feature should be enhanced in encoder (backbone) while high frequency for decoder (segmentation head). We attribute this to semantic information extracting role for encoder and fine-grained details highlighting role for decoder. Thus, Low (High) Frequency Module is proposed to fit this circumstance. Extensive experiments on the popular DAVIS and YouTube-VOS benchmarks demonstrate that GSFM noticeably outperforms the baseline method and achieves state-of-the-art performance. Besides, extensive analysis shows that the proposed modules are reasonable and of great generalization ability. Our source code is available at https://github.com/workforai/GSFM.

* ECCV2022 
Viaarxiv icon

Basic Binary Convolution Unit for Binarized Image Restoration Network

Oct 02, 2022
Bin Xia, Yulun Zhang, Yitong Wang, Yapeng Tian, Wenming Yang, Radu Timofte, Luc Van Gool

Figure 1 for Basic Binary Convolution Unit for Binarized Image Restoration Network
Figure 2 for Basic Binary Convolution Unit for Binarized Image Restoration Network
Figure 3 for Basic Binary Convolution Unit for Binarized Image Restoration Network
Figure 4 for Basic Binary Convolution Unit for Binarized Image Restoration Network

Lighter and faster image restoration (IR) models are crucial for the deployment on resource-limited devices. Binary neural network (BNN), one of the most promising model compression methods, can dramatically reduce the computations and parameters of full-precision convolutional neural networks (CNN). However, there are different properties between BNN and full-precision CNN, and we can hardly use the experience of designing CNN to develop BNN. In this study, we reconsider components in binary convolution, such as residual connection, BatchNorm, activation function, and structure, for IR tasks. We conduct systematic analyses to explain each component's role in binary convolution and discuss the pitfalls. Specifically, we find that residual connection can reduce the information loss caused by binarization; BatchNorm can solve the value range gap between residual connection and binary convolution; The position of the activation function dramatically affects the performance of BNN. Based on our findings and analyses, we design a simple yet efficient basic binary convolution unit (BBCU). Furthermore, we divide IR networks into four parts and specially design variants of BBCU for each part to explore the benefit of binarizing these parts. We conduct experiments on different IR tasks, and our BBCU significantly outperforms other BNNs and lightweight models, which shows that BBCU can serve as a basic unit for binarized IR networks. All codes and models will be released.

Viaarxiv icon

Mobile Edge Computing, Metaverse, 6G Wireless Communications, Artificial Intelligence, and Blockchain: Survey and Their Convergence

Sep 28, 2022
Yitong Wang, Jun Zhao

Figure 1 for Mobile Edge Computing, Metaverse, 6G Wireless Communications, Artificial Intelligence, and Blockchain: Survey and Their Convergence

With the advances of the Internet of Things (IoT) and 5G/6G wireless communications, the paradigms of mobile computing have developed dramatically in recent years, from centralized mobile cloud computing to distributed fog computing and mobile edge computing (MEC). MEC pushes compute-intensive assignments to the edge of the network and brings resources as close to the endpoints as possible, addressing the shortcomings of mobile devices with regard to storage space, resource optimisation, computational performance and efficiency. Compared to cloud computing, as the distributed and closer infrastructure, the convergence of MEC with other emerging technologies, including the Metaverse, 6G wireless communications, artificial intelligence (AI), and blockchain, also solves the problems of network resource allocation, more network load as well as latency requirements. Accordingly, this paper investigates the computational paradigms used to meet the stringent requirements of modern applications. The application scenarios of MEC in mobile augmented reality (MAR) are provided. Furthermore, this survey presents the motivation of MEC-based Metaverse and introduces the applications of MEC to the Metaverse. Particular emphasis is given on a set of technical fusions mentioned above, e.g., 6G with MEC paradigm, MEC strengthened by blockchain, etc.

* This paper appears in the Proceedings of 2022 IEEE 8th World Forum on Internet of Things (WF-IoT). Please feel free to contact us for questions or remarks 
Viaarxiv icon

Human De-occlusion: Invisible Perception and Recovery for Humans

Mar 22, 2021
Qiang Zhou, Shiyin Wang, Yitong Wang, Zilong Huang, Xinggang Wang

Figure 1 for Human De-occlusion: Invisible Perception and Recovery for Humans
Figure 2 for Human De-occlusion: Invisible Perception and Recovery for Humans
Figure 3 for Human De-occlusion: Invisible Perception and Recovery for Humans
Figure 4 for Human De-occlusion: Invisible Perception and Recovery for Humans

In this paper, we tackle the problem of human de-occlusion which reasons about occluded segmentation masks and invisible appearance content of humans. In particular, a two-stage framework is proposed to estimate the invisible portions and recover the content inside. For the stage of mask completion, a stacked network structure is devised to refine inaccurate masks from a general instance segmentation model and predict integrated masks simultaneously. Additionally, the guidance from human parsing and typical pose masks are leveraged to bring prior information. For the stage of content recovery, a novel parsing guided attention module is applied to isolate body parts and capture context information across multiple scales. Besides, an Amodal Human Perception dataset (AHP) is collected to settle the task of human de-occlusion. AHP has advantages of providing annotations from real-world scenes and the number of humans is comparatively larger than other amodal perception datasets. Based on this dataset, experiments demonstrate that our method performs over the state-of-the-art techniques in both tasks of mask completion and content recovery. Our AHP dataset is available at \url{https://sydney0zq.github.io/ahp/}.

* 11 pages, 6 figures, conference 
Viaarxiv icon