Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kun Yuan

Text-driven Adaptation of Foundation Models for Few-shot Surgical Workflow Analysis

Jan 16, 2025

Tingxuan Chen, Kun Yuan, Vinkle Srivastav, Nassir Navab, Nicolas Padoy

Abstract:Purpose: Surgical workflow analysis is crucial for improving surgical efficiency and safety. However, previous studies rely heavily on large-scale annotated datasets, posing challenges in cost, scalability, and reliance on expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven Adaptation), designed to handle various surgical workflow analysis tasks with minimal paired image-label data. Methods: Our approach has two key components. First, Few-shot selection-based modality alignment selects a small subset of images and aligns their embeddings with text embeddings from the downstream task, bridging the modality gap. Second, Text-driven adaptation leverages only text data to train a decoder, eliminating the need for paired image-text data. This decoder is then applied to aligned image embeddings, enabling image-related tasks without explicit image-text pairs. Results: We evaluate our approach to generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition). Results show that Surg-FTDA outperforms baselines and generalizes well across downstream tasks. Conclusion: We propose a text-driven adaptation approach that mitigates the modality gap and handles multiple downstream tasks in surgical workflow analysis, with minimal reliance on large annotated datasets. The code and dataset will be released in https://github.com/TingxuanSix/Surg-FTDA.

Via

Access Paper or Ask Questions

Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training

Jan 13, 2025

Ziqing Wen, Ping Luo, Jiahuan Wang, Xiaoge Deng, Jinping Zou, Kun Yuan, Tao Sun, Dongsheng Li

Figure 1 for Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training

Figure 2 for Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training

Figure 3 for Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training

Figure 4 for Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training

Abstract:Large language models (LLMs) have shown impressive performance across a range of natural language processing tasks. However, their vast number of parameters introduces significant memory challenges during training, particularly when using memory-intensive optimizers like Adam. Existing memory-efficient algorithms often rely on techniques such as singular value decomposition projection or weight freezing. While these approaches help alleviate memory constraints, they generally produce suboptimal results compared to full-rank updates. In this paper, we investigate the memory-efficient method beyond low-rank training, proposing a novel solution called Gradient Wavelet Transform (GWT), which applies wavelet transforms to gradients in order to significantly reduce the memory requirements for maintaining optimizer states. We demonstrate that GWT can be seamlessly integrated with memory-intensive optimizers, enabling efficient training without sacrificing performance. Through extensive experiments on both pre-training and fine-tuning tasks, we show that GWT achieves state-of-the-art performance compared with advanced memory-efficient optimizers and full-rank approaches in terms of both memory usage and training performance.

Via

Access Paper or Ask Questions

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Nov 23, 2024

Ming Hu, Kun Yuan, Yaling Shen, Feilong Tang, Xiaohao Xu, Lin Zhou, Wei Li, Ying Chen, Zhongxing Xu, Zelin Peng(+9 more)

Figure 1 for OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Figure 2 for OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Figure 3 for OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Figure 4 for OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Abstract:Surgical practice involves complex visual interpretation, procedural skills, and advanced medical knowledge, making surgical vision-language pretraining (VLP) particularly challenging due to this complexity and the limited availability of annotated data. To address the gap, we propose OphCLIP, a hierarchical retrieval-augmented vision-language pretraining framework specifically designed for ophthalmic surgical workflow understanding. OphCLIP leverages the OphVL dataset we constructed, a large-scale and comprehensive collection of over 375K hierarchically structured video-text pairs with tens of thousands of different combinations of attributes (surgeries, phases/operations/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations, etc). These hierarchical video-text correspondences enable OphCLIP to learn both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles, capturing intricate surgical details and high-level procedural insights, respectively. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos, automatically retrieving semantically relevant content to enhance the representation learning of narrative videos. Evaluation across 11 datasets for phase recognition and multi-instrument identification shows OphCLIP's robust generalization and superior performance.

Via

Access Paper or Ask Questions

SPARKLE: A Unified Single-Loop Primal-Dual Framework for Decentralized Bilevel Optimization

Nov 21, 2024

Shuchen Zhu, Boao Kong, Songtao Lu, Xinmeng Huang, Kun Yuan

Figure 1 for SPARKLE: A Unified Single-Loop Primal-Dual Framework for Decentralized Bilevel Optimization

Figure 2 for SPARKLE: A Unified Single-Loop Primal-Dual Framework for Decentralized Bilevel Optimization

Figure 3 for SPARKLE: A Unified Single-Loop Primal-Dual Framework for Decentralized Bilevel Optimization

Figure 4 for SPARKLE: A Unified Single-Loop Primal-Dual Framework for Decentralized Bilevel Optimization

Abstract:This paper studies decentralized bilevel optimization, in which multiple agents collaborate to solve problems involving nested optimization structures with neighborhood communications. Most existing literature primarily utilizes gradient tracking to mitigate the influence of data heterogeneity, without exploring other well-known heterogeneity-correction techniques such as EXTRA or Exact Diffusion. Additionally, these studies often employ identical decentralized strategies for both upper- and lower-level problems, neglecting to leverage distinct mechanisms across different levels. To address these limitations, this paper proposes SPARKLE, a unified Single-loop Primal-dual AlgoRithm frameworK for decentraLized bilEvel optimization. SPARKLE offers the flexibility to incorporate various heterogeneitycorrection strategies into the algorithm. Moreover, SPARKLE allows for different strategies to solve upper- and lower-level problems. We present a unified convergence analysis for SPARKLE, applicable to all its variants, with state-of-the-art convergence rates compared to existing decentralized bilevel algorithms. Our results further reveal that EXTRA and Exact Diffusion are more suitable for decentralized bilevel optimization, and using mixed strategies in bilevel algorithms brings more benefits than relying solely on gradient tracking.

* 73 pages, the Thirty-Eighth Annual Conference on Neural Information Processing Systems (2024)

Via

Access Paper or Ask Questions

Gradient Normalization with(out) Clipping Ensures Convergence of Nonconvex SGD under Heavy-Tailed Noise with Improved Results

Oct 21, 2024

Tao Sun, Xinwang Liu, Kun Yuan

Figure 1 for Gradient Normalization with(out) Clipping Ensures Convergence of Nonconvex SGD under Heavy-Tailed Noise with Improved Results

Figure 2 for Gradient Normalization with(out) Clipping Ensures Convergence of Nonconvex SGD under Heavy-Tailed Noise with Improved Results

Abstract:This paper investigates Gradient Normalization Stochastic Gradient Descent without Clipping (NSGDC) and its variance reduction variant (NSGDC-VR) for nonconvex optimization under heavy-tailed noise. We present significant improvements in the theoretical results for both algorithms, including the removal of logarithmic factors from the convergence rates and the recovery of the convergence rate to match the deterministic case when the noise variance {\sigma} is zero. Additionally, we demonstrate that gradient normalization alone, assuming individual Lipschitz smoothness, is sufficient to ensure convergence of SGD under heavy-tailed noise, eliminating the need for gradient clipping. Furthermore, we introduce accelerated nonconvex algorithms that utilize second-order Lipschitz smoothness to achieve enhanced convergence rates in the presence of heavy-tailed noise. Our findings offer a deeper understanding of how gradient normalization and variance reduction techniques can be optimized for robust performance in challenging optimization scenarios.

Via

Access Paper or Ask Questions

Subspace Optimization for Large Language Models with Convergence Guarantees

Oct 15, 2024

Yutong He, Pengrui Li, Yipeng Hu, Chuyan Chen, Kun Yuan

Figure 1 for Subspace Optimization for Large Language Models with Convergence Guarantees

Figure 2 for Subspace Optimization for Large Language Models with Convergence Guarantees

Figure 3 for Subspace Optimization for Large Language Models with Convergence Guarantees

Figure 4 for Subspace Optimization for Large Language Models with Convergence Guarantees

Abstract:Subspace optimization algorithms, with GaLore (Zhao et al., 2024) as a representative method, have gained popularity for pre-training or fine-tuning large language models (LLMs) due to their memory efficiency. However, their convergence guarantees remain unclear, particularly in stochastic settings. In this paper, we unexpectedly discover that GaLore does not always converge to the optimal solution and substantiate this finding with an explicit counterexample. We then investigate the conditions under which GaLore can achieve convergence, demonstrating that it does so either in deterministic scenarios or when using a sufficiently large mini-batch size. More significantly, we introduce GoLore (Gradient random Low-rank projection), a novel variant of GaLore that provably converges in stochastic settings, even with standard batch sizes. Our convergence analysis can be readily extended to other sparse subspace optimization algorithms. Finally, we conduct numerical experiments to validate our theoretical results and empirically explore the proposed mechanisms. Codes are available at https://github.com/pkumelon/Golore.

Via

Access Paper or Ask Questions

Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures

Oct 10, 2024

Yiming Chen, Yuan Zhang, Liyuan Cao, Kun Yuan, Zaiwen Wen

Figure 1 for Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures

Figure 2 for Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures

Figure 3 for Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures

Figure 4 for Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures

Abstract:Parameter-efficient fine-tuning (PEFT) significantly reduces memory costs when adapting large language models (LLMs) for downstream applications. However, traditional first-order (FO) fine-tuning algorithms incur substantial memory overhead due to the need to store activation values for back-propagation during gradient computation, particularly in long-context fine-tuning tasks. Zeroth-order (ZO) algorithms offer a promising alternative by approximating gradients using finite differences of function values, thus eliminating the need for activation storage. Nevertheless, existing ZO methods struggle to capture the low-rank gradient structure common in LLM fine-tuning, leading to suboptimal performance. This paper proposes a low-rank ZO gradient estimator and introduces a novel low-rank ZO algorithm (LOZO) that effectively captures this structure in LLMs. We provide convergence guarantees for LOZO by framing it as a subspace optimization method. Additionally, its low-rank nature enables LOZO to integrate with momentum techniques while incurring negligible extra memory costs. Extensive experiments across various model sizes and downstream tasks demonstrate that LOZO and its momentum-based variant outperform existing ZO methods and closely approach the performance of FO algorithms.

Via

Access Paper or Ask Questions

S$^3$Attention: Improving Long Sequence Attention with Smoothed Skeleton Sketching

Aug 16, 2024

Xue Wang, Tian Zhou, Jianqing Zhu, Jialin Liu, Kun Yuan, Tao Yao, Wotao Yin, Rong Jin, HanQin Cai

Abstract:Attention based models have achieved many remarkable breakthroughs in numerous applications. However, the quadratic complexity of Attention makes the vanilla Attention based models hard to apply to long sequence tasks. Various improved Attention structures are proposed to reduce the computation cost by inducing low rankness and approximating the whole sequence by sub-sequences. The most challenging part of those approaches is maintaining the proper balance between information preservation and computation reduction: the longer sub-sequences used, the better information is preserved, but at the price of introducing more noise and computational costs. In this paper, we propose a smoothed skeleton sketching based Attention structure, coined S$^3$Attention, which significantly improves upon the previous attempts to negotiate this trade-off. S$^3$Attention has two mechanisms to effectively minimize the impact of noise while keeping the linear complexity to the sequence length: a smoothing block to mix information over long sequences and a matrix sketching method that simultaneously selects columns and rows from the input matrix. We verify the effectiveness of S$^3$Attention both theoretically and empirically. Extensive studies over Long Range Arena (LRA) datasets and six time-series forecasting show that S$^3$Attention significantly outperforms both vanilla Attention and other state-of-the-art variants of Attention structures.

Via

Access Paper or Ask Questions

QPT V2: Masked Image Modeling Advances Visual Scoring

Jul 23, 2024

Qizhi Xie, Kun Yuan, Yunpeng Qu, Mingda Wu, Ming Sun, Chao Zhou, Jihong Zhu

Figure 1 for QPT V2: Masked Image Modeling Advances Visual Scoring

Figure 2 for QPT V2: Masked Image Modeling Advances Visual Scoring

Figure 3 for QPT V2: Masked Image Modeling Advances Visual Scoring

Figure 4 for QPT V2: Masked Image Modeling Advances Visual Scoring

Abstract:Quality assessment and aesthetics assessment aim to evaluate the perceived quality and aesthetics of visual content. Current learning-based methods suffer greatly from the scarcity of labeled data and usually perform sub-optimally in terms of generalization. Although masked image modeling (MIM) has achieved noteworthy advancements across various high-level tasks (e.g., classification, detection etc.). In this work, we take on a novel perspective to investigate its capabilities in terms of quality- and aesthetics-awareness. To this end, we propose Quality- and aesthetics-aware pretraining (QPT V2), the first pretraining framework based on MIM that offers a unified solution to quality and aesthetics assessment. To perceive the high-level semantics and fine-grained details, pretraining data is curated. To comprehensively encompass quality- and aesthetics-related factors, degradation is introduced. To capture multi-scale quality and aesthetic information, model structure is modified. Extensive experimental results on 11 downstream benchmarks clearly show the superior performance of QPT V2 in comparison with current state-of-the-art approaches and other pretraining paradigms. Code and models will be released at \url{https://github.com/KeiChiTse/QPT-V2}.

* 8 pages, 6 figures

Via

Access Paper or Ask Questions

On the Trade-off between Flatness and Optimization in Distributed Learning

Jun 28, 2024

Ying Cao, Zhaoxian Wu, Kun Yuan, Ali H. Sayed

Figure 1 for On the Trade-off between Flatness and Optimization in Distributed Learning

Figure 2 for On the Trade-off between Flatness and Optimization in Distributed Learning

Figure 3 for On the Trade-off between Flatness and Optimization in Distributed Learning

Figure 4 for On the Trade-off between Flatness and Optimization in Distributed Learning

Abstract:This paper proposes a theoretical framework to evaluate and compare the performance of gradient-descent algorithms for distributed learning in relation to their behavior around local minima in nonconvex environments. Previous works have noticed that convergence toward flat local minima tend to enhance the generalization ability of learning algorithms. This work discovers two interesting results. First, it shows that decentralized learning strategies are able to escape faster away from local minimizers and favor convergence toward flatter minima relative to the centralized solution in the large-batch training regime. Second, and importantly, the ultimate classification accuracy is not solely dependent on the flatness of the local minimizer but also on how well a learning algorithm can approach that minimum. In other words, the classification accuracy is a function of both flatness and optimization performance. The paper examines the interplay between the two measures of flatness and optimization error closely. One important conclusion is that decentralized strategies of the diffusion type deliver enhanced classification accuracy because it strikes a more favorable balance between flatness and optimization performance.

Via

Access Paper or Ask Questions