Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Weiming Zhuang

UNIFORM: Unifying Knowledge from Large-scale and Diverse Pre-trained Models

Aug 27, 2025

Yimu Wang, Weiming Zhuang, Chen Chen, Jiabo Huang, Jingtao Li, Lingjuan Lyu

Figure 1 for UNIFORM: Unifying Knowledge from Large-scale and Diverse Pre-trained Models

Figure 2 for UNIFORM: Unifying Knowledge from Large-scale and Diverse Pre-trained Models

Figure 3 for UNIFORM: Unifying Knowledge from Large-scale and Diverse Pre-trained Models

Figure 4 for UNIFORM: Unifying Knowledge from Large-scale and Diverse Pre-trained Models

Abstract:In the era of deep learning, the increasing number of pre-trained models available online presents a wealth of knowledge. These models, developed with diverse architectures and trained on varied datasets for different tasks, provide unique interpretations of the real world. Their collective consensus is likely universal and generalizable to unseen data. However, effectively harnessing this collective knowledge poses a fundamental challenge due to the heterogeneity of pre-trained models. Existing knowledge integration solutions typically rely on strong assumptions about training data distributions and network architectures, limiting them to learning only from specific types of models and resulting in data and/or inductive biases. In this work, we introduce a novel framework, namely UNIFORM, for knowledge transfer from a diverse set of off-the-shelf models into one student model without such constraints. Specifically, we propose a dedicated voting mechanism to capture the consensus of knowledge both at the logit level -- incorporating teacher models that are capable of predicting target classes of interest -- and at the feature level, utilizing visual representations learned on arbitrary label spaces. Extensive experiments demonstrate that UNIFORM effectively enhances unsupervised object recognition performance compared to strong knowledge transfer baselines. Notably, it exhibits remarkable scalability by benefiting from over one hundred teachers, while existing methods saturate at a much smaller scale.

Via

Access Paper or Ask Questions

Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

Aug 11, 2025

Weitai Kang, Weiming Zhuang, Zhizhong Li, Yan Yan, Lingjuan Lyu

Figure 1 for Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

Figure 2 for Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

Figure 3 for Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

Figure 4 for Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

Abstract:Fine-grained multimodal capability in Multimodal Large Language Models (MLLMs) has emerged as a critical research direction, particularly for tackling the visual grounding (VG) problem. Despite the strong performance achieved by existing approaches, they often employ disparate design choices when fine-tuning MLLMs for VG, lacking systematic verification to support these designs. To bridge this gap, this paper presents a comprehensive study of various design choices that impact the VG performance of MLLMs. We conduct our analysis using LLaVA-1.5, which has been widely adopted in prior empirical studies of MLLMs. While more recent models exist, we follow this convention to ensure our findings remain broadly applicable and extendable to other architectures. We cover two key aspects: (1) exploring different visual grounding paradigms in MLLMs, identifying the most effective design, and providing our insights; and (2) conducting ablation studies on the design of grounding data to optimize MLLMs' fine-tuning for the VG task. Finally, our findings contribute to a stronger MLLM for VG, achieving improvements of +5.6% / +6.9% / +7.0% on RefCOCO/+/g over the LLaVA-1.5.

* 8 pages for the main paper

Via

Access Paper or Ask Questions

Boundary Attention Constrained Zero-Shot Layout-To-Image Generation

Nov 15, 2024

Huancheng Chen, Jingtao Li, Weiming Zhuang, Haris Vikalo, Lingjuan Lyu

Figure 1 for Boundary Attention Constrained Zero-Shot Layout-To-Image Generation

Figure 2 for Boundary Attention Constrained Zero-Shot Layout-To-Image Generation

Figure 3 for Boundary Attention Constrained Zero-Shot Layout-To-Image Generation

Figure 4 for Boundary Attention Constrained Zero-Shot Layout-To-Image Generation

Abstract:Recent text-to-image diffusion models excel at generating high-resolution images from text but struggle with precise control over spatial composition and object counting. To address these challenges, several studies developed layout-to-image (L2I) approaches that incorporate layout instructions into text-to-image models. However, existing L2I methods typically require either fine-tuning pretrained parameters or training additional control modules for the diffusion models. In this work, we propose a novel zero-shot L2I approach, BACON (Boundary Attention Constrained generation), which eliminates the need for additional modules or fine-tuning. Specifically, we use text-visual cross-attention feature maps to quantify inconsistencies between the layout of the generated images and the provided instructions, and then compute loss functions to optimize latent features during the diffusion reverse process. To enhance spatial controllability and mitigate semantic failures in complex layout instructions, we leverage pixel-to-pixel correlations in the self-attention feature maps to align cross-attention maps and combine three loss functions constrained by boundary attention to update latent features. Comprehensive experimental results on both L2I and non-L2I pretrained diffusion models demonstrate that our method outperforms existing zero-shot L2I techniuqes both quantitatively and qualitatively in terms of image composition on the DrawBench and HRS benchmarks.

Via

Access Paper or Ask Questions

Dual Low-Rank Adaptation for Continual Learning with Pre-Trained Models

Nov 01, 2024

Huancheng Chen, Jingtao Li, Nidham Gazagnadou, Weiming Zhuang, Chen Chen, Lingjuan Lyu

Figure 1 for Dual Low-Rank Adaptation for Continual Learning with Pre-Trained Models

Figure 2 for Dual Low-Rank Adaptation for Continual Learning with Pre-Trained Models

Figure 3 for Dual Low-Rank Adaptation for Continual Learning with Pre-Trained Models

Figure 4 for Dual Low-Rank Adaptation for Continual Learning with Pre-Trained Models

Abstract:In the era of foundation models, we revisit continual learning~(CL), which aims to enable vision transformers (ViTs) to learn new tasks over time. However, as the scale of these models increases, catastrophic forgetting remains a persistent challenge, particularly in the presence of significant domain shifts across tasks. Recent studies highlight a crossover between CL techniques and parameter-efficient fine-tuning (PEFT), which focuses on fine-tuning only a small set of trainable parameters to adapt to downstream tasks, such as low-rank adaptation (LoRA). While LoRA achieves faster convergence and requires fewer trainable parameters, it has seldom been explored in the context of continual learning. To address this gap, we propose a novel PEFT-CL method called Dual Low-Rank Adaptation (DualLoRA), which introduces both an orthogonal LoRA adapter and a residual LoRA adapter parallel to pre-trained weights in each layer. These components are orchestrated by a dynamic memory mechanism to strike a balance between stability and plasticity. The orthogonal LoRA adapter's parameters are updated in an orthogonal subspace of previous tasks to mitigate catastrophic forgetting, while the residual LoRA adapter's parameters are updated in the residual subspace spanned by task-specific bases without interaction across tasks, offering complementary capabilities for fine-tuning new tasks. On ViT-based models, we demonstrate that DualLoRA offers significant advantages in accuracy, inference speed, and memory efficiency over existing CL methods across multiple benchmarks.

Via

Access Paper or Ask Questions

A Simple Background Augmentation Method for Object Detection with Diffusion Model

Aug 01, 2024

Yuhang Li, Xin Dong, Chen Chen, Weiming Zhuang, Lingjuan Lyu

Figure 1 for A Simple Background Augmentation Method for Object Detection with Diffusion Model

Figure 2 for A Simple Background Augmentation Method for Object Detection with Diffusion Model

Figure 3 for A Simple Background Augmentation Method for Object Detection with Diffusion Model

Figure 4 for A Simple Background Augmentation Method for Object Detection with Diffusion Model

Abstract:In computer vision, it is well-known that a lack of data diversity will impair model performance. In this study, we address the challenges of enhancing the dataset diversity problem in order to benefit various downstream tasks such as object detection and instance segmentation. We propose a simple yet effective data augmentation approach by leveraging advancements in generative models, specifically text-to-image synthesis technologies like Stable Diffusion. Our method focuses on generating variations of labeled real images, utilizing generative object and background augmentation via inpainting to augment existing training data without the need for additional annotations. We find that background augmentation, in particular, significantly improves the models' robustness and generalization capabilities. We also investigate how to adjust the prompt and mask to ensure the generated content comply with the existing annotations. The efficacy of our augmentation techniques is validated through comprehensive evaluations of the COCO dataset and several other key object detection benchmarks, demonstrating notable enhancements in model performance across diverse scenarios. This approach offers a promising solution to the challenges of dataset enhancement, contributing to the development of more accurate and robust computer vision models.

Via

Access Paper or Ask Questions

COALA: A Practical and Vision-Centric Federated Learning Platform

Jul 23, 2024

Weiming Zhuang, Jian Xu, Chen Chen, Jingtao Li, Lingjuan Lyu

Figure 1 for COALA: A Practical and Vision-Centric Federated Learning Platform

Figure 2 for COALA: A Practical and Vision-Centric Federated Learning Platform

Figure 3 for COALA: A Practical and Vision-Centric Federated Learning Platform

Figure 4 for COALA: A Practical and Vision-Centric Federated Learning Platform

Abstract:We present COALA, a vision-centric Federated Learning (FL) platform, and a suite of benchmarks for practical FL scenarios, which we categorize into three levels: task, data, and model. At the task level, COALA extends support from simple classification to 15 computer vision tasks, including object detection, segmentation, pose estimation, and more. It also facilitates federated multiple-task learning, allowing clients to tackle multiple tasks simultaneously. At the data level, COALA goes beyond supervised FL to benchmark both semi-supervised FL and unsupervised FL. It also benchmarks feature distribution shifts other than commonly considered label distribution shifts. In addition to dealing with static data, it supports federated continual learning for continuously changing data in real-world scenarios. At the model level, COALA benchmarks FL with split models and different models in different clients. COALA platform offers three degrees of customization for these practical FL scenarios, including configuration customization, components customization, and workflow customization. We conduct systematic benchmarking experiments for the practical FL scenarios and highlight potential opportunities for further advancements in FL. Codes are open sourced at https://github.com/SonyResearch/COALA.

* ICML'24

Via

Access Paper or Ask Questions

Towards Fundamentally Scalable Model Selection: Asymptotically Fast Update and Selection

Jun 11, 2024

Wenxiao Wang, Weiming Zhuang, Lingjuan Lyu

Figure 1 for Towards Fundamentally Scalable Model Selection: Asymptotically Fast Update and Selection

Figure 2 for Towards Fundamentally Scalable Model Selection: Asymptotically Fast Update and Selection

Figure 3 for Towards Fundamentally Scalable Model Selection: Asymptotically Fast Update and Selection

Figure 4 for Towards Fundamentally Scalable Model Selection: Asymptotically Fast Update and Selection

Abstract:The advancement of deep learning technologies is bringing new models every day, motivating the study of scalable model selection. An ideal model selection scheme should minimally support two operations efficiently over a large pool of candidate models: update, which involves either adding a new candidate model or removing an existing candidate model, and selection, which involves locating highly performing models for a given task. However, previous solutions to model selection require high computational complexity for at least one of these two operations. In this work, we target fundamentally (more) scalable model selection that supports asymptotically fast update and asymptotically fast selection at the same time. Firstly, we define isolated model embedding, a family of model selection schemes supporting asymptotically fast update and selection: With respect to the number of candidate models $m$, the update complexity is O(1) and the selection consists of a single sweep over $m$ vectors in addition to O(1) model operations. Isolated model embedding also implies several desirable properties for applications. Secondly, we present Standardized Embedder, an empirical realization of isolated model embedding. We assess its effectiveness by using it to select representations from a pool of 100 pre-trained vision models for classification tasks and measuring the performance gaps between the selected models and the best candidates with a linear probing protocol. Experiments suggest our realization is effective in selecting models with competitive performances and highlight isolated model embedding as a promising direction towards model selection that is fundamentally (more) scalable.

* 19 pages, 8 figures

Via

Access Paper or Ask Questions

FedMef: Towards Memory-efficient Federated Dynamic Pruning

Mar 21, 2024

Hong Huang, Weiming Zhuang, Chen Chen, Lingjuan Lyu

Abstract:Federated learning (FL) promotes decentralized training while prioritizing data confidentiality. However, its application on resource-constrained devices is challenging due to the high demand for computation and memory resources to train deep learning models. Neural network pruning techniques, such as dynamic pruning, could enhance model efficiency, but directly adopting them in FL still poses substantial challenges, including post-pruning performance degradation, high activation memory usage, etc. To address these challenges, we propose FedMef, a novel and memory-efficient federated dynamic pruning framework. FedMef comprises two key components. First, we introduce the budget-aware extrusion that maintains pruning efficiency while preserving post-pruning performance by salvaging crucial information from parameters marked for pruning within a given budget. Second, we propose scaled activation pruning to effectively reduce activation memory footprints, which is particularly beneficial for deploying FL to memory-limited devices. Extensive experiments demonstrate the effectiveness of our proposed FedMef. In particular, it achieves a significant reduction of 28.5% in memory footprint compared to state-of-the-art methods while obtaining superior accuracy.

* Accepted by CVPR2024

Via

Access Paper or Ask Questions

MAS: Towards Resource-Efficient Federated Multiple-Task Learning

Jul 21, 2023

Weiming Zhuang, Yonggang Wen, Lingjuan Lyu, Shuai Zhang

Figure 1 for MAS: Towards Resource-Efficient Federated Multiple-Task Learning

Figure 2 for MAS: Towards Resource-Efficient Federated Multiple-Task Learning

Figure 3 for MAS: Towards Resource-Efficient Federated Multiple-Task Learning

Figure 4 for MAS: Towards Resource-Efficient Federated Multiple-Task Learning

Abstract:Federated learning (FL) is an emerging distributed machine learning method that empowers in-situ model training on decentralized edge devices. However, multiple simultaneous FL tasks could overload resource-constrained devices. In this work, we propose the first FL system to effectively coordinate and train multiple simultaneous FL tasks. We first formalize the problem of training simultaneous FL tasks. Then, we present our new approach, MAS (Merge and Split), to optimize the performance of training multiple simultaneous FL tasks. MAS starts by merging FL tasks into an all-in-one FL task with a multi-task architecture. After training for a few rounds, MAS splits the all-in-one FL task into two or more FL tasks by using the affinities among tasks measured during the all-in-one training. It then continues training each split of FL tasks based on model parameters from the all-in-one training. Extensive experiments demonstrate that MAS outperforms other methods while reducing training time by 2x and reducing energy consumption by 40%. We hope this work will inspire the community to further study and optimize training simultaneous FL tasks.

* ICCV'23. arXiv admin note: substantial text overlap with arXiv:2207.04202

Via

Access Paper or Ask Questions

Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators

Jul 16, 2023

Sikai Bai, Shuaicheng Li, Weiming Zhuang, Jie Zhang, Song Guo, Kunlin Yang, Jun Hou, Shuai Zhang, Junyu Gao, Shuai Yi

Figure 1 for Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators

Figure 2 for Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators

Figure 3 for Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators

Figure 4 for Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators

Abstract:Federated learning has become a popular method to learn from decentralized heterogeneous data. Federated semi-supervised learning (FSSL) emerges to train models from a small fraction of labeled data due to label scarcity on decentralized clients. Existing FSSL methods assume independent and identically distributed (IID) labeled data across clients and consistent class distribution between labeled and unlabeled data within a client. This work studies a more practical and challenging scenario of FSSL, where data distribution is different not only across clients but also within a client between labeled and unlabeled data. To address this challenge, we propose a novel FSSL framework with dual regulators, FedDure.} FedDure lifts the previous assumption with a coarse-grained regulator (C-reg) and a fine-grained regulator (F-reg): C-reg regularizes the updating of the local model by tracking the learning effect on labeled data distribution; F-reg learns an adaptive weighting scheme tailored for unlabeled instances in each client. We further formulate the client model training as bi-level optimization that adaptively optimizes the model in the client with two regulators. Theoretically, we show the convergence guarantee of the dual regulators. Empirically, we demonstrate that FedDure is superior to the existing methods across a wide range of settings, notably by more than 11% on CIFAR-10 and CINIC-10 datasets.

Via

Access Paper or Ask Questions