Abstract:Federated Learning (FL) enables collaborative model training across distributed clients without sharing raw data, thereby preserving privacy. However, FL often suffers from significant communication and computational overhead, limiting its scalability and sustainability. In this work, we introduce a Full Compression Pipeline (FCP) for FL in communication-constrained environments. FCP integrates three complementary deep compression techniques (pruning, quantization, and Huffman encoding) into a unified end-to-end framework. By compressing local models and communication payloads, FCP substantially reduces transmission costs and resource consumption while maintaining competitive accuracy. To quantify its impact, we develop an evaluation framework that captures both communication and computation overheads as a unified model cost, allowing a holistic assessment of efficiency trade-offs. The pipeline is evaluated in an independent and identically distributed (IID) and non-IID data setting. In one representative scenario, training a ResNet-12 model on the CIFAR-10 dataset with ten clients and a 2 Mbps bandwidth, the FCP achieves more than 11$\times$ reduction in model size, with only a 2% drop in accuracy compared to the uncompressed baseline. This results in an FL training that is more than 60% faster.
Abstract:Foundation models for vision have transformed visual recognition with powerful pretrained representations and strong zero-shot capabilities, yet their potential for data-efficient learning remains largely untapped. Active Learning (AL) aims to minimize annotation costs by strategically selecting the most informative samples for labeling, but existing methods largely overlook the rich multimodal knowledge embedded in modern vision-language models (VLMs). We introduce Conformal Cross-Modal Acquisition (CCMA), a novel AL framework that bridges vision and language modalities through a teacher-student architecture. CCMA employs a pretrained VLM as a teacher to provide semantically grounded uncertainty estimates, conformally calibrated to guide sample selection for a vision-only student model. By integrating multimodal conformal scoring with diversity-aware selection strategies, CCMA achieves superior data efficiency across multiple benchmarks. Our approach consistently outperforms state-of-the-art AL baselines, demonstrating clear advantages over methods relying solely on uncertainty or diversity metrics.
Abstract:Active learning (AL) is a machine learning (ML) approach that strategically selects the most informative samples for annotation during training, aiming to minimize annotation costs. This strategy not only reduces labeling expenses but also results in energy savings during neural network training, thereby enhancing both data and energy efficiency. In this paper, we implement and evaluate various state-of-the-art acquisition functions, analyzing their accuracy and computational costs, while discussing the advantages and disadvantages of each method. Our findings reveal that representativity-based acquisition functions effectively explore the dataset but do not prioritize boundary decisions, whereas uncertainty-based acquisition functions focus on refining boundary decisions already identified by the neural network. This trade-off is known as the exploration-exploitation dilemma. To address this dilemma, we introduce six aggregation structures: series, parallel, hybrid, adaptive feedback, random exploration, and annealing exploration. Our aggregated acquisition functions alleviate common AL pathologies such as batch mode inefficiency and the cold start problem. Additionally, we focus on balancing accuracy and energy consumption, contributing to the development of more sustainable, energy-aware artificial intelligence (AI). We evaluate our proposed structures on various models and datasets. Our results demonstrate the potential of these structures to reduce computational costs while maintaining or even improving accuracy. Innovative aggregation approaches, such as alternating between acquisition functions such as BALD and BADGE, have shown robust results. Sequentially running functions like $K$-Centers followed by BALD has achieved the same performance goals with up to 12\% fewer samples, while reducing the acquisition cost by almost half.
Abstract:Wireless powered integrated sensing and communication (ISAC) faces a fundamental tradeoff between energy supply, communication throughput, and sensing accuracy. This paper investigates a wireless powered ISAC system with target localization requirements, where users harvest energy from wireless power transfer (WPT) and then conduct ISAC transmissions in a time-division manner. In addition to energy supply, the WPT signal also contributes to target sensing, and the localization accuracy is characterized by Cramér-Rao bound (CRB) constraints. Under this setting, we formulate a max-min throughput maximization problem by jointly allocating the WPT duration, ISAC transmission time allocation, and transmit power. Due to the nonconvexity of the resulting problem, a suitable reformulation is developed by exploiting variable substitutions and the monotonicity of logarithmic functions, based on which an efficient successive convex approximation (SCA)-based iterative algorithm is proposed. Simulation results demonstrate convergence and significant performance gains over benchmark schemes, highlighting the importance of coordinated time-power optimization in balancing sensing accuracy and communication performance in wireless powered ISAC systems.
Abstract:Federated learning (FL) has been considered a promising privacy preserving distributed edge learning framework. Over-the-air computation (AirComp) technique leveraging analog transmission enables the aggregation of local updates directly over-the-air by exploiting the superposition properties of wireless multiple-access channel, thereby drastically reducing the communication bottleneck issues of FL compared with digital transmission schemes. This work points out that existing AirComp-FL overlooks a key practical constraint, the instantaneous peak-power constraints imposed by the non-linearity of radiofrequency power amplifiers. We present and analyze the effect of the classic method to deal with this issue, amplitude clipping combined with filtering. We investigate the effect of instantaneous peak-power constraints in AirComp-FL for both single-carrier and multi-carrier orthogonal frequency-division multiplexing (OFDM) systems. We highlight the specificity of AirComp-FL: the samples depend on the gradient value distribution, leading to a higher peak-to-average power ratio (PAPR) than that observed for uniformly distributed signals. Simulation results demonstrate that, in practical settings, the instantaneous transmit power regularly exceeds the power-amplifier limit; however, by applying clipping and filtering, the FL performance can be degraded. The degradation becomes pronounced especially in multi-carrier OFDM systems due to the in-band distortions caused by clipping and filtering.
Abstract:In this article, we consider an industrial internet of things (IIoT) network supporting multi-device dynamic ultra-reliable low-latency communication (URLLC) while the channel state information (CSI) is imperfect. A joint link adaptation (LA) and device scheduling (including the order) design is provided, aiming at maximizing the total transmission rate under strict block error rate (BLER) constraints. In particular, a Bayesian optimization (BO) driven Twin Delayed Deep Deterministic Policy Gradient (TD3) method is proposed, which determines the device served order sequence and the corresponding modulation and coding scheme (MCS) adaptively based on the imperfect CSI. Note that the imperfection of CSI, error sample imbalance in URLLC networks, as well as the parameter sensitivity nature of the TD3 algorithm likely diminish the algorithm's convergence speed and reliability. To address such an issue, we proposed a BO based training mechanism for the convergence speed improvement, which provides a more reliable learning direction and sample selection method to track the imbalance sample problem. Via extensive simulations, we show that the proposed algorithm achieves faster convergence and higher sum-rate performance compared to existing solutions.
Abstract:This paper presents LiteVSR2, an enhanced version of our previously introduced efficient approach to Visual Speech Recognition (VSR). Building upon our knowledge distillation framework from a pre-trained Automatic Speech Recognition (ASR) model, we introduce two key improvements: a stabilized video preprocessing technique and feature normalization in the distillation process. These improvements yield substantial performance gains on the LRS2 and LRS3 benchmarks, positioning LiteVSR2 as the current best CTC-based VSR model without increasing the volume of training data or computational resources utilized. Furthermore, we explore the scalability of our approach by examining performance metrics across varying model complexities and training data volumes. LiteVSR2 maintains the efficiency of its predecessor while significantly enhancing accuracy, thereby demonstrating the potential for resource-efficient advancements in VSR technology.




Abstract:The expansion of IoT devices and the demands of Deep Learning have highlighted significant challenges in Distributed Deep Learning (DDL) systems. Parallel Split Learning (PSL) has emerged as a promising derivative of Split Learning that is well suited for distributed learning on resource-constrained devices. However, PSL faces several obstacles, such as large effective batch sizes, non-IID data distributions, and the straggler effect. We view these issues as a sampling dilemma and propose to address them by orchestrating the mini-batch sampling process on the server side. We introduce the Uniform Global Sampling (UGS) method to decouple the effective batch size from the number of clients and reduce mini-batch deviation in non-IID settings. To address the straggler effect, we introduce the Latent Dirichlet Sampling (LDS) method, which generalizes UGS to balance the trade-off between batch deviation and training time. Our simulations reveal that our proposed methods enhance model accuracy by up to 34.1% in non-IID settings and reduce the training time in the presence of stragglers by up to 62%. In particular, LDS effectively mitigates the straggler effect without compromising model accuracy or adding significant computational overhead compared to UGS. Our results demonstrate the potential of our methods as a promising solution for DDL in real applications.




Abstract:Millimeter wave (mmWave) cell-free massive MIMO (CF mMIMO) is a promising solution for future wireless communications. However, its optimization is non-trivial due to the challenging channel characteristics. We show that mmWave CF mMIMO optimization is largely an assignment problem between access points (APs) and users due to the high path loss of mmWave channels, the limited output power of the amplifier, and the almost orthogonal channels between users given a large number of AP antennas. The combinatorial nature of the assignment problem, the requirement for scalability, and the distributed implementation of CF mMIMO make this problem difficult. In this work, we propose an unsupervised machine learning (ML) enabled solution. In particular, a graph neural network (GNN) customized for scalability and distributed implementation is introduced. Moreover, the customized GNN architecture is hierarchically permutation-equivariant (HPE), i.e., if the APs or users of an AP are permuted, the output assignment is automatically permuted in the same way. To address the combinatorial problem, we relax it to a continuous problem, and introduce an information entropy-inspired penalty term. The training objective is then formulated using the augmented Lagrangian method (ALM). The test results show that the realized sum-rate outperforms that of the generalized serial dictatorship (GSD) algorithm and is very close to the upper bound in a small network scenario, while the upper bound is impossible to obtain in a large network scenario.




Abstract:This paper proposes a novel, resource-efficient approach to Visual Speech Recognition (VSR) leveraging speech representations produced by any trained Automatic Speech Recognition (ASR) model. Moving away from the resource-intensive trends prevalent in recent literature, our method distills knowledge from a trained Conformer-based ASR model, achieving competitive performance on standard VSR benchmarks with significantly less resource utilization. Using unlabeled audio-visual data only, our baseline model achieves a word error rate (WER) of 47.4% and 54.7% on the LRS2 and LRS3 test benchmarks, respectively. After fine-tuning the model with limited labeled data, the word error rate reduces to 35% (LRS2) and 45.7% (LRS3). Our model can be trained on a single consumer-grade GPU within a few days and is capable of performing real-time end-to-end VSR on dated hardware, suggesting a path towards more accessible and resource-efficient VSR methodologies.