Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shiqi Wang

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

Aug 15, 2024

Pingping Zhang, Jinlong Li, Meng Wang, Nicu Sebe, Sam Kwong, Shiqi Wang

Figure 1 for When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

Figure 2 for When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

Figure 3 for When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

Figure 4 for When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

Abstract:Existing codecs are designed to eliminate intrinsic redundancies to create a compact representation for compression. However, strong external priors from Multimodal Large Language Models (MLLMs) have not been explicitly explored in video compression. Herein, we introduce a unified paradigm for Cross-Modality Video Coding (CMVC), which is a pioneering approach to explore multimodality representation and video generative models in video coding. Specifically, on the encoder side, we disentangle a video into spatial content and motion components, which are subsequently transformed into distinct modalities to achieve very compact representation by leveraging MLLMs. During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes that optimize video reconstruction quality for specific decoding requirements, including Text-Text-to-Video (TT2V) mode to ensure high-quality semantic information and Image-Text-to-Video (IT2V) mode to achieve superb perceptual consistency. In addition, we propose an efficient frame interpolation model for IT2V mode via Low-Rank Adaption (LoRA) tuning to guarantee perceptual quality, which allows the generated motion cues to behave smoothly. Experiments on benchmarks indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency. These results highlight potential directions for future research in video coding.

Via

Access Paper or Ask Questions

High Efficiency Image Compression for Large Visual-Language Models

Jul 24, 2024

Binzhe Li, Shurun Wang, Shiqi Wang, Yan Ye

Abstract:In recent years, large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks, thus replacing humans as receivers of visual information in various application scenarios. In this paper, we pioneer to propose a variable bitrate image compression framework consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different LVLMs. In particular, instead of optimizing an adaptive pre-editing network towards a particular task or several representative tasks, we propose a new optimization strategy tailored for LVLMs, which is designed based on the representation and discrimination capability with token-level distortion and rank. The pre-editing module and the variable bitrate end-to-end image codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks. {Experimental results demonstrate that the proposed framework could efficiently achieve much better rate-accuracy performance compared to the state-of-the-art coding standard, Versatile Video Coding.} Meanwhile, experiments with multi-modal tasks have revealed the robustness and generalization capability of the proposed framework.

Via

Access Paper or Ask Questions

DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor

Jun 12, 2024

Juncheng Wu, Zhangkai Ni, Hanli Wang, Wenhan Yang, Yuyin Zhou, Shiqi Wang

Figure 1 for DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor

Figure 2 for DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor

Figure 3 for DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor

Figure 4 for DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor

Abstract:Image deep features extracted by pre-trained networks are known to contain rich and informative representations. In this paper, we present Deep Degradation Response (DDR), a method to quantify changes in image deep features under varying degradation conditions. Specifically, our approach facilitates flexible and adaptive degradation, enabling the controlled synthesis of image degradation through text-driven prompts. Extensive evaluations demonstrate the versatility of DDR as an image descriptor, with strong correlations observed with key image attributes such as complexity, colorfulness, sharpness, and overall quality. Moreover, we demonstrate the efficacy of DDR across a spectrum of applications. It excels as a blind image quality assessment metric, outperforming existing methodologies across multiple datasets. Additionally, DDR serves as an effective unsupervised learning objective in image restoration tasks, yielding notable advancements in image deblurring and single-image super-resolution. Our code will be made available.

Via

Access Paper or Ask Questions

Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics

May 29, 2024

Zhangkai Ni, Yue Liu, Keyan Ding, Wenhan Yang, Hanli Wang, Shiqi Wang

Figure 1 for Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics

Figure 2 for Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics

Figure 3 for Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics

Figure 4 for Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics

Abstract:Deep learning-based methods have significantly influenced the blind image quality assessment (BIQA) field, however, these methods often require training using large amounts of human rating data. In contrast, traditional knowledge-based methods are cost-effective for training but face challenges in effectively extracting features aligned with human visual perception. To bridge these gaps, we propose integrating deep features from pre-trained visual models with a statistical analysis model into a Multi-scale Deep Feature Statistics (MDFS) model for achieving opinion-unaware BIQA (OU-BIQA), thereby eliminating the reliance on human rating data and significantly improving training efficiency. Specifically, we extract patch-wise multi-scale features from pre-trained vision models, which are subsequently fitted into a multivariate Gaussian (MVG) model. The final quality score is determined by quantifying the distance between the MVG model derived from the test image and the benchmark MVG model derived from the high-quality image set. A comprehensive series of experiments conducted on various datasets show that our proposed model exhibits superior consistency with human visual perception compared to state-of-the-art BIQA models. Furthermore, it shows improved generalizability across diverse target-specific BIQA tasks. Our code is available at: https://github.com/eezkni/MDFS

* Accepted to IEEE Transactions on Multimedia 2024

Via

Access Paper or Ask Questions

Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare

May 29, 2024

Hanwei Zhu, Haoning Wu, Yixuan Li, Zicheng Zhang, Baoliang Chen, Lingyu Zhu, Yuming Fang, Guangtao Zhai, Weisi Lin, Shiqi Wang

Figure 1 for Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare

Figure 2 for Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare

Figure 3 for Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare

Figure 4 for Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare

Abstract:While recent advancements in large multimodal models (LMMs) have significantly improved their abilities in image quality assessment (IQA) relying on absolute quality rating, how to transfer reliable relative quality comparison outputs to continuous perceptual quality scores remains largely unexplored. To address this gap, we introduce Compare2Score-an all-around LMM-based no-reference IQA (NR-IQA) model, which is capable of producing qualitatively comparative responses and effectively translating these discrete comparative levels into a continuous quality score. Specifically, during training, we present to generate scaled-up comparative instructions by comparing images from the same IQA dataset, allowing for more flexible integration of diverse IQA datasets. Utilizing the established large-scale training corpus, we develop a human-like visual quality comparator. During inference, moving beyond binary choices, we propose a soft comparison method that calculates the likelihood of the test image being preferred over multiple predefined anchor images. The quality score is further optimized by maximum a posteriori estimation with the resulting probability matrix. Extensive experiments on nine IQA datasets validate that the Compare2Score effectively bridges text-defined comparative levels during training with converted single image quality score for inference, surpassing state-of-the-art IQA models across diverse scenarios. Moreover, we verify that the probability-matrix-based inference conversion not only improves the rating accuracy of Compare2Score but also zero-shot general-purpose LMMs, suggesting its intrinsic effectiveness.

Via

Access Paper or Ask Questions

Training LLMs to Better Self-Debug and Explain Code

May 28, 2024

Nan Jiang, Xiaopeng Li, Shiqi Wang, Qiang Zhou, Soneya Binta Hossain, Baishakhi Ray, Varun Kumar, Xiaofei Ma, Anoop Deoras

Figure 1 for Training LLMs to Better Self-Debug and Explain Code

Figure 2 for Training LLMs to Better Self-Debug and Explain Code

Figure 3 for Training LLMs to Better Self-Debug and Explain Code

Figure 4 for Training LLMs to Better Self-Debug and Explain Code

Abstract:In the domain of code generation, self-debugging is crucial. It allows LLMs to refine their generated code based on execution feedback. This is particularly important because generating correct solutions in one attempt proves challenging for complex tasks. Prior works on self-debugging mostly focus on prompting methods by providing LLMs with few-shot examples, which work poorly on small open-sourced LLMs. In this work, we propose a training framework that significantly improves self-debugging capability of LLMs. Intuitively, we observe that a chain of explanations on the wrong code followed by code refinement helps LLMs better analyze the wrong code and do refinement. We thus propose an automated pipeline to collect a high-quality dataset for code explanation and refinement by generating a number of explanations and refinement trajectories and filtering via execution verification. We perform supervised fine-tuning (SFT) and further reinforcement learning (RL) on both success and failure trajectories with a novel reward design considering code explanation and refinement quality. SFT improves the pass@1 by up to 15.92% and pass@10 by 9.30% over four benchmarks. RL training brings additional up to 3.54% improvement on pass@1 and 2.55% improvement on pass@10. The trained LLMs show iterative refinement ability, and can keep refining code continuously. Lastly, our human evaluation shows that the LLMs trained with our framework generate more useful code explanations and help developers better understand bugs in source code.

Via

Access Paper or Ask Questions

Exposing AI-generated Videos: A Benchmark Dataset and a Local-and-Global Temporal Defect Based Detection Method

May 07, 2024

Peisong He, Leyao Zhu, Jiaxing Li, Shiqi Wang, Haoliang Li

Abstract:The generative model has made significant advancements in the creation of realistic videos, which causes security issues. However, this emerging risk has not been adequately addressed due to the absence of a benchmark dataset for AI-generated videos. In this paper, we first construct a video dataset using advanced diffusion-based video generation algorithms with various semantic contents. Besides, typical video lossy operations over network transmission are adopted to generate degraded samples. Then, by analyzing local and global temporal defects of current AI-generated videos, a novel detection framework by adaptively learning local motion information and global appearance variation is constructed to expose fake videos. Finally, experiments are conducted to evaluate the generalization and robustness of different spatial and temporal domain detection methods, where the results can serve as the baseline and demonstrate the research challenge for future studies.

Via

Access Paper or Ask Questions

Gradient-Congruity Guided Federated Sparse Training

May 02, 2024

Chris Xing Tian, Yibing Liu, Haoliang Li, Ray C. C. Cheung, Shiqi Wang

Figure 1 for Gradient-Congruity Guided Federated Sparse Training

Figure 2 for Gradient-Congruity Guided Federated Sparse Training

Figure 3 for Gradient-Congruity Guided Federated Sparse Training

Figure 4 for Gradient-Congruity Guided Federated Sparse Training

Abstract:Edge computing allows artificial intelligence and machine learning models to be deployed on edge devices, where they can learn from local data and collaborate to form a global model. Federated learning (FL) is a distributed machine learning technique that facilitates this process while preserving data privacy. However, FL also faces challenges such as high computational and communication costs regarding resource-constrained devices, and poor generalization performance due to the heterogeneity of data across edge clients and the presence of out-of-distribution data. In this paper, we propose the Gradient-Congruity Guided Federated Sparse Training (FedSGC), a novel method that integrates dynamic sparse training and gradient congruity inspection into federated learning framework to address these issues. Our method leverages the idea that the neurons, in which the associated gradients with conflicting directions with respect to the global model contain irrelevant or less generalized information for other clients, and could be pruned during the sparse training process. Conversely, the neurons where the associated gradients with consistent directions could be grown in a higher priority. In this way, FedSGC can greatly reduce the local computation and communication overheads while, at the same time, enhancing the generalization abilities of FL. We evaluate our method on challenging non-i.i.d settings and show that it achieves competitive accuracy with state-of-the-art FL methods across various scenarios while minimizing computation and communication costs.

Via

Access Paper or Ask Questions

FakeBench: Uncover the Achilles' Heels of Fake Images with Large Multimodal Models

Apr 20, 2024

Yixuan Li, Xuelin Liu, Xiaoyang Wang, Shiqi Wang, Weisi Lin

Figure 1 for FakeBench: Uncover the Achilles' Heels of Fake Images with Large Multimodal Models

Figure 2 for FakeBench: Uncover the Achilles' Heels of Fake Images with Large Multimodal Models

Figure 3 for FakeBench: Uncover the Achilles' Heels of Fake Images with Large Multimodal Models

Figure 4 for FakeBench: Uncover the Achilles' Heels of Fake Images with Large Multimodal Models

Abstract:Recently, fake images generated by artificial intelligence (AI) models have become indistinguishable from the real, exerting new challenges for fake image detection models. To this extent, simple binary judgments of real or fake seem less convincing and credible due to the absence of human-understandable explanations. Fortunately, Large Multimodal Models (LMMs) bring possibilities to materialize the judgment process while their performance remains undetermined. Therefore, we propose FakeBench, the first-of-a-kind benchmark towards transparent defake, consisting of fake images with human language descriptions on forgery signs. FakeBench gropes for two open questions of LMMs: (1) can LMMs distinguish fake images generated by AI, and (2) how do LMMs distinguish fake images? In specific, we construct the FakeClass dataset with 6k diverse-sourced fake and real images, each equipped with a Question&Answer pair concerning the authenticity of images, which are utilized to benchmark the detection ability. To examine the reasoning and interpretation abilities of LMMs, we present the FakeClue dataset, consisting of 15k pieces of descriptions on the telltale clues revealing the falsification of fake images. Besides, we construct the FakeQA to measure the LMMs' open-question answering ability on fine-grained authenticity-relevant aspects. Our experimental results discover that current LMMs possess moderate identification ability, preliminary interpretation and reasoning ability, and passable open-question answering ability for image defake. The FakeBench will be made publicly available soon.

Via

Access Paper or Ask Questions

CompGS: Efficient 3D Scene Representation via Compressed Gaussian Splatting

Apr 15, 2024

Xiangrui Liu, Xinju Wu, Pingping Zhang, Shiqi Wang, Zhu Li, Sam Kwong

Abstract:Gaussian splatting, renowned for its exceptional rendering quality and efficiency, has emerged as a prominent technique in 3D scene representation. However, the substantial data volume of Gaussian splatting impedes its practical utility in real-world applications. Herein, we propose an efficient 3D scene representation, named Compressed Gaussian Splatting (CompGS), which harnesses compact Gaussian primitives for faithful 3D scene modeling with a remarkably reduced data size. To ensure the compactness of Gaussian primitives, we devise a hybrid primitive structure that captures predictive relationships between each other. Then, we exploit a small set of anchor primitives for prediction, allowing the majority of primitives to be encapsulated into highly compact residual forms. Moreover, we develop a rate-constrained optimization scheme to eliminate redundancies within such hybrid primitives, steering our CompGS towards an optimal trade-off between bitrate consumption and representation efficacy. Experimental results show that the proposed CompGS significantly outperforms existing methods, achieving superior compactness in 3D scene representation without compromising model accuracy and rendering quality. Our code will be released on GitHub for further research.

* Submitted to a conference

Via

Access Paper or Ask Questions