Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wangmeng Zuo

Reblurring-Guided Single Image Defocus Deblurring: A Learning Framework with Misaligned Training Pairs

Sep 26, 2024

Xinya Shu, Yu Li, Dongwei Ren, Xiaohe Wu, Jin Li, Wangmeng Zuo

Figure 1 for Reblurring-Guided Single Image Defocus Deblurring: A Learning Framework with Misaligned Training Pairs

Figure 2 for Reblurring-Guided Single Image Defocus Deblurring: A Learning Framework with Misaligned Training Pairs

Figure 3 for Reblurring-Guided Single Image Defocus Deblurring: A Learning Framework with Misaligned Training Pairs

Figure 4 for Reblurring-Guided Single Image Defocus Deblurring: A Learning Framework with Misaligned Training Pairs

Abstract:For single image defocus deblurring, acquiring well-aligned training pairs (or training triplets), i.e., a defocus blurry image, an all-in-focus sharp image (and a defocus blur map), is an intricate task for the development of deblurring models. Existing image defocus deblurring methods typically rely on training data collected by specialized imaging equipment, presupposing that these pairs or triplets are perfectly aligned. However, in practical scenarios involving the collection of real-world data, direct acquisition of training triplets is infeasible, and training pairs inevitably encounter spatial misalignment issues. In this work, we introduce a reblurring-guided learning framework for single image defocus deblurring, enabling the learning of a deblurring network even with misaligned training pairs. Specifically, we first propose a baseline defocus deblurring network that utilizes spatially varying defocus blur map as degradation prior to enhance the deblurring performance. Then, to effectively learn the baseline defocus deblurring network with misaligned training pairs, our reblurring module ensures spatial consistency between the deblurred image, the reblurred image and the input blurry image by reconstructing spatially variant isotropic blur kernels. Moreover, the spatially variant blur derived from the reblurring module can serve as pseudo supervision for defocus blur map during training, interestingly transforming training pairs into training triplets. Additionally, we have collected a new dataset specifically for single image defocus deblurring (SDD) with typical misalignments, which not only substantiates our proposed method but also serves as a benchmark for future research.

* The source code and dataset are available at https://github.com/ssscrystal/Reblurring-guided-JDRL

Via

Access Paper or Ask Questions

LPT++: Efficient Training on Mixture of Long-tailed Experts

Sep 17, 2024

Bowen Dong, Pan Zhou, Wangmeng Zuo

Figure 1 for LPT++: Efficient Training on Mixture of Long-tailed Experts

Figure 2 for LPT++: Efficient Training on Mixture of Long-tailed Experts

Figure 3 for LPT++: Efficient Training on Mixture of Long-tailed Experts

Figure 4 for LPT++: Efficient Training on Mixture of Long-tailed Experts

Abstract:We introduce LPT++, a comprehensive framework for long-tailed classification that combines parameter-efficient fine-tuning (PEFT) with a learnable model ensemble. LPT++ enhances frozen Vision Transformers (ViTs) through the integration of three core components. The first is a universal long-tailed adaptation module, which aggregates long-tailed prompts and visual adapters to adapt the pretrained model to the target domain, meanwhile improving its discriminative ability. The second is the mixture of long-tailed experts framework with a mixture-of-experts (MoE) scorer, which adaptively calculates reweighting coefficients for confidence scores from both visual-only and visual-language (VL) model experts to generate more accurate predictions. Finally, LPT++ employs a three-phase training framework, wherein each critical module is learned separately, resulting in a stable and effective long-tailed classification training paradigm. Besides, we also propose the simple version of LPT++ namely LPT, which only integrates visual-only pretrained ViT and long-tailed prompts to formulate a single model method. LPT can clearly illustrate how long-tailed prompts works meanwhile achieving comparable performance without VL pretrained models. Experiments show that, with only ~1% extra trainable parameters, LPT++ achieves comparable accuracy against all the counterparts.

* Extended version of arXiv:2210.01033

Via

Access Paper or Ask Questions

SceneDreamer360: Text-Driven 3D-Consistent Scene Generation with Panoramic Gaussian Splatting

Aug 25, 2024

Wenrui Li, Yapeng Mi, Fucheng Cai, Zhe Yang, Wangmeng Zuo, Xingtao Wang, Xiaopeng Fan

Figure 1 for SceneDreamer360: Text-Driven 3D-Consistent Scene Generation with Panoramic Gaussian Splatting

Figure 2 for SceneDreamer360: Text-Driven 3D-Consistent Scene Generation with Panoramic Gaussian Splatting

Figure 3 for SceneDreamer360: Text-Driven 3D-Consistent Scene Generation with Panoramic Gaussian Splatting

Figure 4 for SceneDreamer360: Text-Driven 3D-Consistent Scene Generation with Panoramic Gaussian Splatting

Abstract:Text-driven 3D scene generation has seen significant advancements recently. However, most existing methods generate single-view images using generative models and then stitch them together in 3D space. This independent generation for each view often results in spatial inconsistency and implausibility in the 3D scenes. To address this challenge, we proposed a novel text-driven 3D-consistent scene generation model: SceneDreamer360. Our proposed method leverages a text-driven panoramic image generation model as a prior for 3D scene generation and employs 3D Gaussian Splatting (3DGS) to ensure consistency across multi-view panoramic images. Specifically, SceneDreamer360 enhances the fine-tuned Panfusion generator with a three-stage panoramic enhancement, enabling the generation of high-resolution, detail-rich panoramic images. During the 3D scene construction, a novel point cloud fusion initialization method is used, producing higher quality and spatially consistent point clouds. Our extensive experiments demonstrate that compared to other methods, SceneDreamer360 with its panoramic image generation and 3DGS can produce higher quality, spatially consistent, and visually appealing 3D scenes from any text prompt. Our codes are available at \url{https://github.com/liwrui/SceneDreamer360}.

Via

Access Paper or Ask Questions

AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition

Aug 21, 2024

Minheng Ni, Chenfei Wu, Huaying Yuan, Zhengyuan Yang, Ming Gong, Lijuan Wang, Zicheng Liu, Wangmeng Zuo, Nan Duan

Figure 1 for AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition

Figure 2 for AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition

Figure 3 for AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition

Figure 4 for AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition

Abstract:With the advancement of generative models, the synthesis of different sensory elements such as music, visuals, and speech has achieved significant realism. However, the approach to generate multi-sensory outputs has not been fully explored, limiting the application on high-value scenarios such as of directing a film. Developing a movie director agent faces two major challenges: (1) Lack of parallelism and online scheduling with production steps: In the production of multi-sensory films, there are complex dependencies between different sensory elements, and the production time for each element varies. (2) Diverse needs and clear communication demands with users: Users often cannot clearly express their needs until they see a draft, which requires human-computer interaction and iteration to continually adjust and optimize the film content based on user feedback. To address these issues, we introduce AutoDirector, an interactive multi-sensory composition framework that supports long shots, special effects, music scoring, dubbing, and lip-syncing. This framework improves the efficiency of multi-sensory film production through automatic scheduling and supports the modification and improvement of interactive tasks to meet user needs. AutoDirector not only expands the application scope of human-machine collaboration but also demonstrates the potential of AI in collaborating with humans in the role of a film director to complete multi-sensory films.

Via

Access Paper or Ask Questions

SelfDRSC++: Self-Supervised Learning for Dual Reversed Rolling Shutter Correction

Aug 21, 2024

Wei Shang, Dongwei Ren, Wanying Zhang, Qilong Wang, Pengfei Zhu, Wangmeng Zuo

Figure 1 for SelfDRSC++: Self-Supervised Learning for Dual Reversed Rolling Shutter Correction

Figure 2 for SelfDRSC++: Self-Supervised Learning for Dual Reversed Rolling Shutter Correction

Figure 3 for SelfDRSC++: Self-Supervised Learning for Dual Reversed Rolling Shutter Correction

Figure 4 for SelfDRSC++: Self-Supervised Learning for Dual Reversed Rolling Shutter Correction

Abstract:Modern consumer cameras commonly employ the rolling shutter (RS) imaging mechanism, via which images are captured by scanning scenes row-by-row, resulting in RS distortion for dynamic scenes. To correct RS distortion, existing methods adopt a fully supervised learning manner that requires high framerate global shutter (GS) images as ground-truth for supervision. In this paper, we propose an enhanced Self-supervised learning framework for Dual reversed RS distortion Correction (SelfDRSC++). Firstly, we introduce a lightweight DRSC network that incorporates a bidirectional correlation matching block to refine the joint optimization of optical flows and corrected RS features, thereby improving correction performance while reducing network parameters. Subsequently, to effectively train the DRSC network, we propose a self-supervised learning strategy that ensures cycle consistency between input and reconstructed dual reversed RS images. The RS reconstruction in SelfDRSC++ can be interestingly formulated as a specialized instance of video frame interpolation, where each row in reconstructed RS images is interpolated from predicted GS images by utilizing RS distortion time maps. By achieving superior performance while simplifying the training process, SelfDRSC++ enables feasible one-stage self-supervised training. Additionally, besides start and end RS scanning time, SelfDRSC++ allows supervision of GS images at arbitrary intermediate scanning times, thus enabling the learned DRSC network to generate high framerate GS videos. The code and trained models are available at \url{https://github.com/shangwei5/SelfDRSC_plusplus}.

* 13 pages, 9 figures, and the code is available at \url{https://github.com/shangwei5/SelfDRSC_plusplus}

Via

Access Paper or Ask Questions

Thin-Plate Spline-based Interpolation for Animation Line Inbetweening

Aug 17, 2024

Tianyi Zhu, Wei Shang, Dongwei Ren, Wangmeng Zuo

Figure 1 for Thin-Plate Spline-based Interpolation for Animation Line Inbetweening

Figure 2 for Thin-Plate Spline-based Interpolation for Animation Line Inbetweening

Figure 3 for Thin-Plate Spline-based Interpolation for Animation Line Inbetweening

Figure 4 for Thin-Plate Spline-based Interpolation for Animation Line Inbetweening

Abstract:Animation line inbetweening is a crucial step in animation production aimed at enhancing animation fluidity by predicting intermediate line arts between two key frames. However, existing methods face challenges in effectively addressing sparse pixels and significant motion in line art key frames. In literature, Chamfer Distance (CD) is commonly adopted for evaluating inbetweening performance. Despite achieving favorable CD values, existing methods often generate interpolated frames with line disconnections, especially for scenarios involving large motion. Motivated by this observation, we propose a simple yet effective interpolation method for animation line inbetweening that adopts thin-plate spline-based transformation to estimate coarse motion more accurately by modeling the keypoint correspondence between two key frames, particularly for large motion scenarios. Building upon the coarse estimation, a motion refine module is employed to further enhance motion details before final frame interpolation using a simple UNet model. Furthermore, to more accurately assess the performance of animation line inbetweening, we refine the CD metric and introduce a novel metric termed Weighted Chamfer Distance, which demonstrates a higher consistency with visual perception quality. Additionally, we incorporate Earth Mover's Distance and conduct user study to provide a more comprehensive evaluation. Our method outperforms existing approaches by delivering high-quality interpolation results with enhanced fluidity. The code is available at \url{https://github.com/Tian-one/tps-inbetween}.

Via

Access Paper or Ask Questions

Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors

Jul 13, 2024

Wei Shang, Dongwei Ren, Wanying Zhang, Yuming Fang, Wangmeng Zuo, Kede Ma

Figure 1 for Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors

Figure 2 for Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors

Figure 3 for Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors

Figure 4 for Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors

Abstract:Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we first describe a strong baseline for AVSR by putting together three variants of elementary building blocks: 1) a flow-guided recurrent unit that aggregates spatiotemporal information from previous frames, 2) a flow-refined cross-attention unit that selects spatiotemporal information from future frames, and 3) a hyper-upsampling unit that generates scaleaware and content-independent upsampling kernels. We then introduce ST-AVSR by equipping our baseline with a multi-scale structural and textural prior computed from the pre-trained VGG network. This prior has proven effective in discriminating structure and texture across different locations and scales, which is beneficial for AVSR. Comprehensive experiments show that ST-AVSR significantly improves super-resolution quality, generalization ability, and inference speed over the state-of-theart. The code is available at https://github.com/shangwei5/ST-AVSR.

* Accepted by ECCV 2024, the code is available at https://github.com/shangwei5/ST-AVSR

Via

Access Paper or Ask Questions

Multi-modal Crowd Counting via a Broker Modality

Jul 10, 2024

Haoliang Meng, Xiaopeng Hong, Chenhao Wang, Miao Shang, Wangmeng Zuo

Abstract:Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images. This task is challenging due to the significant gap between these distinct modalities. In this paper, we propose a novel approach by introducing an auxiliary broker modality and on this basis frame the task as a triple-modal learning problem. We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models. Additionally, we identify and address the ghosting effect caused by direct cross-modal image fusion in multi-modal crowd counting. Through extensive experimental evaluations on popular multi-modal crowd-counting datasets, we demonstrate the effectiveness of our method, which introduces only 4 million additional parameters, yet achieves promising results. The code is available at https://github.com/HenryCilence/Broker-Modality-Crowd-Counting.

* This is the preprint version of the paper and supplemental material to appear in ECCV 2024. Please cite the final published version. Code is available at https://github.com/HenryCilence/Broker-Modality-Crowd-Counting

Via

Access Paper or Ask Questions

Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

Jul 01, 2024

Mingxiang Liao, Hannan Lu, Xinyu Zhang, Fang Wan, Tianyu Wang, Yuzhong Zhao, Wangmeng Zuo, Qixiang Ye, Jingdong Wang

Abstract:Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content. Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to text prompts. In this study, we propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V models. For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video. Based on the new benchmark and the dynamics scores, we assess T2V models with the design of three metrics: dynamics range, dynamics controllability, and dynamics-based quality. Experiments show that DEVIL achieves a Pearson correlation exceeding 90% with human ratings, demonstrating its potential to advance T2V generation models. Code is available at https://github.com/MingXiangL/DEVIL.

Via

Access Paper or Ask Questions

CPT: Consistent Proxy Tuning for Black-box Optimization

Jul 01, 2024

Yuanyang He, Zitong Huang, Xinxing Xu, Rick Siow Mong Goh, Salman Khan, Wangmeng Zuo, Yong Liu, Chun-Mei Feng

Figure 1 for CPT: Consistent Proxy Tuning for Black-box Optimization

Figure 2 for CPT: Consistent Proxy Tuning for Black-box Optimization

Figure 3 for CPT: Consistent Proxy Tuning for Black-box Optimization

Figure 4 for CPT: Consistent Proxy Tuning for Black-box Optimization

Abstract:Black-box tuning has attracted recent attention due to that the structure or inner parameters of advanced proprietary models are not accessible. Proxy-tuning provides a test-time output adjustment for tuning black-box language models. It applies the difference of the output logits before and after tuning a smaller white-box "proxy" model to improve the black-box model. However, this technique serves only as a decoding-time algorithm, leading to an inconsistency between training and testing which potentially limits overall performance. To address this problem, we introduce Consistent Proxy Tuning (CPT), a simple yet effective black-box tuning method. Different from Proxy-tuning, CPT additionally exploits the frozen large black-box model and another frozen small white-box model, ensuring consistency between training-stage optimization objective and test-time proxies. This consistency benefits Proxy-tuning and enhances model performance. Note that our method focuses solely on logit-level computation, which makes it model-agnostic and applicable to any task involving logit classification. Extensive experimental results demonstrate the superiority of our CPT in both black-box tuning of Large Language Models (LLMs) and Vision-Language Models (VLMs) across various datasets. The code is available at https://github.com/chunmeifeng/CPT.

* 10 pages,2 figures plus supplementary materials

Via

Access Paper or Ask Questions