Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Meng Wang

School of Electronic and Information Engineering Liaoning Technical University Xingcheng City, Liaoning Province, P. R. China

Learning Frequency-Aware Dynamic Transformers for All-In-One Image Restoration

Jun 30, 2024

Zenglin Shi, Tong Su, Pei Liu, Yunpeng Wu, Le Zhang, Meng Wang

Figure 1 for Learning Frequency-Aware Dynamic Transformers for All-In-One Image Restoration

Figure 2 for Learning Frequency-Aware Dynamic Transformers for All-In-One Image Restoration

Figure 3 for Learning Frequency-Aware Dynamic Transformers for All-In-One Image Restoration

Figure 4 for Learning Frequency-Aware Dynamic Transformers for All-In-One Image Restoration

Abstract:This work aims to tackle the all-in-one image restoration task, which seeks to handle multiple types of degradation with a single model. The primary challenge is to extract degradation representations from the input degraded images and use them to guide the model's adaptation to specific degradation types. Recognizing that various degradations affect image content differently across frequency bands, we propose a new all-in-one image restoration approach from a frequency perspective, leveraging advanced vision transformers. Our method consists of two main components: a frequency-aware Degradation prior learning transformer (Dformer) and a degradation-adaptive Restoration transformer (Rformer). The Dformer captures the essential characteristics of various degradations by decomposing inputs into different frequency components. By understanding how degradations affect these frequency components, the Dformer learns robust priors that effectively guide the restoration process. The Rformer then employs a degradation-adaptive self-attention module to selectively focus on the most affected frequency components, guided by the learned degradation representations. Extensive experimental results demonstrate that our approach outperforms the existing methods on four representative restoration tasks, including denoising, deraining, dehazing and deblurring. Additionally, our method offers benefits for handling spatially variant degradations and unseen degradation levels.

* 8 pages

Via

Access Paper or Ask Questions

Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis

Jun 24, 2024

Hongkang Li, Meng Wang, Shuai Zhang, Sijia Liu, Pin-Yu Chen

Figure 1 for Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis

Figure 2 for Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis

Figure 3 for Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis

Figure 4 for Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis

Abstract:Efficient training and inference algorithms, such as low-rank adaption and model pruning, have shown impressive performance for learning Transformer-based large foundation models. However, due to the technical challenges of the non-convex optimization caused by the complicated architecture of Transformers, the theoretical study of why these methods can be applied to learn Transformers is mostly elusive. To the best of our knowledge, this paper shows the first theoretical analysis of the property of low-rank and sparsity of one-layer Transformers by characterizing the trained model after convergence using stochastic gradient descent. By focusing on a data model based on label-relevant and label-irrelevant patterns, we quantify that the gradient updates of trainable parameters are low-rank, which depends on the number of label-relevant patterns. We also analyze how model pruning affects the generalization while improving computation efficiency and conclude that proper magnitude-based pruning has a slight effect on the testing performance. We implement numerical experiments to support our findings.

* IEEE SAM Workshop 2024

Via

Access Paper or Ask Questions

Enhancing Diagnostic Reliability of Foundation Model with Uncertainty Estimation in OCT Images

Jun 18, 2024

Yuanyuan Peng, Aidi Lin, Meng Wang, Tian Lin, Ke Zou, Yinglin Cheng, Tingkun Shi, Xulong Liao, Lixia Feng, Zhen Liang(+3 more)

Figure 1 for Enhancing Diagnostic Reliability of Foundation Model with Uncertainty Estimation in OCT Images

Figure 2 for Enhancing Diagnostic Reliability of Foundation Model with Uncertainty Estimation in OCT Images

Figure 3 for Enhancing Diagnostic Reliability of Foundation Model with Uncertainty Estimation in OCT Images

Figure 4 for Enhancing Diagnostic Reliability of Foundation Model with Uncertainty Estimation in OCT Images

Abstract:Inability to express the confidence level and detect unseen classes has limited the clinical implementation of artificial intelligence in the real-world. We developed a foundation model with uncertainty estimation (FMUE) to detect 11 retinal conditions on optical coherence tomography (OCT). In the internal test set, FMUE achieved a higher F1 score of 96.76% than two state-of-the-art algorithms, RETFound and UIOS, and got further improvement with thresholding strategy to 98.44%. In the external test sets obtained from other OCT devices, FMUE achieved an accuracy of 88.75% and 92.73% before and after thresholding. Our model is superior to two ophthalmologists with a higher F1 score (95.17% vs. 61.93% &71.72%). Besides, our model correctly predicts high uncertainty scores for samples with ambiguous features, of non-target-category diseases, or with low-quality to prompt manual checks and prevent misdiagnosis. FMUE provides a trustworthy method for automatic retinal anomalies detection in the real-world clinical open set environment.

* All codes are available at https://github.com/yuanyuanpeng0129/FMUE

Via

Access Paper or Ask Questions

Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

Jun 13, 2024

Meng Wang, Tian Lin, Kai Yu, Aidi Lin, Yuanyuan Peng, Lianyu Wang, Cheng Chen, Ke Zou, Huiyu Liang, Man Chen(+26 more)

Figure 1 for Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

Figure 2 for Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

Figure 3 for Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

Figure 4 for Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

Abstract:The current retinal artificial intelligence models were trained using data with a limited category of diseases and limited knowledge. In this paper, we present a retinal vision-language foundation model (RetiZero) with knowledge of over 400 fundus diseases. Specifically, we collected 341,896 fundus images paired with text descriptions from 29 publicly available datasets, 180 ophthalmic books, and online resources, encompassing over 400 fundus diseases across multiple countries and ethnicities. RetiZero achieved outstanding performance across various downstream tasks, including zero-shot retinal disease recognition, image-to-image retrieval, internal domain and cross-domain retinal disease classification, and few-shot fine-tuning. Specially, in the zero-shot scenario, RetiZero achieved a Top5 score of 0.8430 and 0.7561 on 15 and 52 fundus diseases respectively. In the image-retrieval task, RetiZero achieved a Top5 score of 0.9500 and 0.8860 on 15 and 52 retinal diseases respectively. Furthermore, clinical evaluations by ophthalmology experts from different countries demonstrate that RetiZero can achieve performance comparable to experienced ophthalmologists using zero-shot and image retrieval methods without requiring model retraining. These capabilities of retinal disease identification strengthen our RetiZero foundation model in clinical implementation.

Via

Access Paper or Ask Questions

Graph Bottlenecked Social Recommendation

Jun 12, 2024

Yonghui Yang, Le Wu, Zihan Wang, Zhuangzhuang He, Richang Hong, Meng Wang

Figure 1 for Graph Bottlenecked Social Recommendation

Figure 2 for Graph Bottlenecked Social Recommendation

Figure 3 for Graph Bottlenecked Social Recommendation

Figure 4 for Graph Bottlenecked Social Recommendation

Abstract:With the emergence of social networks, social recommendation has become an essential technique for personalized services. Recently, graph-based social recommendations have shown promising results by capturing the high-order social influence. Most empirical studies of graph-based social recommendations directly take the observed social networks into formulation, and produce user preferences based on social homogeneity. Despite the effectiveness, we argue that social networks in the real-world are inevitably noisy~(existing redundant social relations), which may obstruct precise user preference characterization. Nevertheless, identifying and removing redundant social relations is challenging due to a lack of labels. In this paper, we focus on learning the denoised social structure to facilitate recommendation tasks from an information bottleneck perspective. Specifically, we propose a novel Graph Bottlenecked Social Recommendation (GBSR) framework to tackle the social noise issue.GBSR is a model-agnostic social denoising framework, that aims to maximize the mutual information between the denoised social graph and recommendation labels, meanwhile minimizing it between the denoised social graph and the original one. This enables GBSR to learn the minimal yet sufficient social structure, effectively reducing redundant social relations and enhancing social recommendations. Technically, GBSR consists of two elaborate components, preference-guided social graph refinement, and HSIC-based bottleneck learning. Extensive experimental results demonstrate the superiority of the proposed GBSR, including high performances and good generality combined with various backbones. Our code is available at: https://github.com/yimutianyang/KDD24-GBSR.

* Accepted by KDD 2024

Via

Access Paper or Ask Questions

Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement

Jun 07, 2024

Wei Qian, Qi Li, Kun Li, Xinke Wang, Xiao Sun, Meng Wang, Dan Guo

Figure 1 for Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement

Figure 2 for Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement

Figure 3 for Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement

Figure 4 for Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement

Abstract:This paper briefly introduces the solutions developed by our team, HFUT-VUT, for Track 1 of self-supervised heart rate measurement in the 3rd Vision-based Remote Physiological Signal Sensing (RePSS) Challenge hosted at IJCAI 2024. The goal is to develop a self-supervised learning algorithm for heart rate (HR) estimation using unlabeled facial videos. To tackle this task, we present two self-supervised HR estimation solutions that integrate spatial-temporal modeling and contrastive learning, respectively. Specifically, we first propose a non-end-to-end self-supervised HR measurement framework based on spatial-temporal modeling, which can effectively capture subtle rPPG clues and leverage the inherent bandwidth and periodicity characteristics of rPPG to constrain the model. Meanwhile, we employ an excellent end-to-end solution based on contrastive learning, aiming to generalize across different scenarios from complementary perspectives. Finally, we combine the strengths of the above solutions through an ensemble strategy to generate the final predictions, leading to a more accurate HR estimation. As a result, our solutions achieved a remarkable RMSE score of 8.85277 on the test dataset, securing \textbf{2nd place} in Track 1 of the challenge.

Via

Access Paper or Ask Questions

Instructing Prompt-to-Prompt Generation for Zero-Shot Learning

Jun 05, 2024

Man Liu, Huihui Bai, Feng Li, Chunjie Zhang, Yunchao Wei, Meng Wang, Tat-Seng Chua, Yao Zhao

Figure 1 for Instructing Prompt-to-Prompt Generation for Zero-Shot Learning

Figure 2 for Instructing Prompt-to-Prompt Generation for Zero-Shot Learning

Figure 3 for Instructing Prompt-to-Prompt Generation for Zero-Shot Learning

Figure 4 for Instructing Prompt-to-Prompt Generation for Zero-Shot Learning

Abstract:Zero-shot learning (ZSL) aims to explore the semantic-visual interactions to discover comprehensive knowledge transferred from seen categories to classify unseen categories. Recently, prompt engineering has emerged in ZSL, demonstrating impressive potential as it enables the zero-shot transfer of diverse visual concepts to downstream tasks. However, these methods are still not well generalized to broad unseen domains. A key reason is that the fixed adaption of learnable prompts on seen domains makes it tend to over-emphasize the primary visual features observed during training. In this work, we propose a \textbf{P}rompt-to-\textbf{P}rompt generation methodology (\textbf{P2P}), which addresses this issue by further embracing the instruction-following technique to distill instructive visual prompts for comprehensive transferable knowledge discovery. The core of P2P is to mine semantic-related instruction from prompt-conditioned visual features and text instruction on modal-sharing semantic concepts and then inversely rectify the visual representations with the guidance of the learned instruction prompts. This enforces the compensation for missing visual details to primary contexts and further eliminates the cross-modal disparity, endowing unseen domain generalization. Through extensive experimental results, we demonstrate the efficacy of P2P in achieving superior performance over state-of-the-art methods.

Via

Access Paper or Ask Questions

Path-Specific Causal Reasoning for Fairness-aware Cognitive Diagnosis

Jun 05, 2024

Dacao Zhang, Kun Zhang, Le Wu, Mi Tian, Richang Hong, Meng Wang

Figure 1 for Path-Specific Causal Reasoning for Fairness-aware Cognitive Diagnosis

Figure 2 for Path-Specific Causal Reasoning for Fairness-aware Cognitive Diagnosis

Figure 3 for Path-Specific Causal Reasoning for Fairness-aware Cognitive Diagnosis

Figure 4 for Path-Specific Causal Reasoning for Fairness-aware Cognitive Diagnosis

Abstract:Cognitive Diagnosis~(CD), which leverages students and exercise data to predict students' proficiency levels on different knowledge concepts, is one of fundamental components in Intelligent Education. Due to the scarcity of student-exercise interaction data, most existing methods focus on making the best use of available data, such as exercise content and student information~(e.g., educational context). Despite the great progress, the abuse of student sensitive information has not been paid enough attention. Due to the important position of CD in Intelligent Education, employing sensitive information when making diagnosis predictions will cause serious social issues. Moreover, data-driven neural networks are easily misled by the shortcut between input data and output prediction, exacerbating this problem. Therefore, it is crucial to eliminate the negative impact of sensitive information in CD models. In response, we argue that sensitive attributes of students can also provide useful information, and only the shortcuts directly related to the sensitive information should be eliminated from the diagnosis process. Thus, we employ causal reasoning and design a novel Path-Specific Causal Reasoning Framework (PSCRF) to achieve this goal. Specifically, we first leverage an encoder to extract features and generate embeddings for general information and sensitive information of students. Then, we design a novel attribute-oriented predictor to decouple the sensitive attributes, in which fairness-related sensitive features will be eliminated and other useful information will be retained. Finally, we designed a multi-factor constraint to ensure the performance of fairness and diagnosis performance simultaneously. Extensive experiments over real-world datasets (e.g., PISA dataset) demonstrate the effectiveness of our proposed PSCRF.

* Accpeted by KDD'2024

Via

Access Paper or Ask Questions

What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding

Jun 04, 2024

Hongkang Li, Meng Wang, Tengfei Ma, Sijia Liu, Zaixi Zhang, Pin-Yu Chen

Figure 1 for What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding

Figure 2 for What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding

Figure 3 for What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding

Figure 4 for What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding

Abstract:Graph Transformers, which incorporate self-attention and positional encoding, have recently emerged as a powerful architecture for various graph learning tasks. Despite their impressive performance, the complex non-convex interactions across layers and the recursive graph structure have made it challenging to establish a theoretical foundation for learning and generalization. This study introduces the first theoretical investigation of a shallow Graph Transformer for semi-supervised node classification, comprising a self-attention layer with relative positional encoding and a two-layer perceptron. Focusing on a graph data model with discriminative nodes that determine node labels and non-discriminative nodes that are class-irrelevant, we characterize the sample complexity required to achieve a desirable generalization error by training with stochastic gradient descent (SGD). This paper provides the quantitative characterization of the sample complexity and number of iterations for convergence dependent on the fraction of discriminative nodes, the dominant patterns, and the initial model errors. Furthermore, we demonstrate that self-attention and positional encoding enhance generalization by making the attention map sparse and promoting the core neighborhood during training, which explains the superior feature representation of Graph Transformers. Our theoretical results are supported by empirical experiments on synthetic and real-world benchmarks.

* ICML 2024

Via

Access Paper or Ask Questions

Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling

Jun 03, 2024

Jinxing Zhou, Dan Guo, Yiran Zhong, Meng Wang

Abstract:The Audio-Visual Video Parsing task aims to identify and temporally localize the events that occur in either or both the audio and visual streams of audible videos. It often performs in a weakly-supervised manner, where only video event labels are provided, \ie, the modalities and the timestamps of the labels are unknown. Due to the lack of densely annotated labels, recent work attempts to leverage pseudo labels to enrich the supervision. A commonly used strategy is to generate pseudo labels by categorizing the known video event labels for each modality. However, the labels are still confined to the video level, and the temporal boundaries of events remain unlabeled. In this paper, we propose a new pseudo label generation strategy that can explicitly assign labels to each video segment by utilizing prior knowledge learned from the open world. Specifically, we exploit the large-scale pretrained models, namely CLIP and CLAP, to estimate the events in each video segment and generate segment-level visual and audio pseudo labels, respectively. We then propose a new loss function to exploit these pseudo labels by taking into account their category-richness and segment-richness. A label denoising strategy is also adopted to further improve the visual pseudo labels by flipping them whenever abnormally large forward losses occur. We perform extensive experiments on the LLP dataset and demonstrate the effectiveness of each proposed design and we achieve state-of-the-art video parsing performance on all types of event parsing, \ie, audio event, visual event, and audio-visual event. We also examine the proposed pseudo label generation strategy on a relevant weakly-supervised audio-visual event localization task and the experimental results again verify the benefits and generalization of our method.

* IJCV 2024 Accepted. arXiv admin note: substantial text overlap with arXiv:2303.02344

Via

Access Paper or Ask Questions