Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yueyuan Sui

Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference

Apr 10, 2026

Yueyuan Sui, Payal Mohapatra, Doğaç Eldenk, Haodong Yang, Yiting Zhang, Haoyan Zhang, Qi Zhu, Stephen Xia

Abstract:Edge devices increasingly run multimodal sensing pipelines that must remain accurate despite fluctuating power budgets and unpredictable sensor dropout. Existing pruning methods fail under these conditions: they generally require fine-tuning after compression, consuming over $10\times$ the deployment energy, and they assign static importance scores that are blind to which sensors are present. We present the SentryFuse framework, which addresses both challenges jointly through two key components. First, SentryGate learns modality-conditioned importance scores during training via first-order saliency supervision and then prunes attention heads and feed-forward channels at deployment without fine-tuning. Second, SentryAttend replaces dense self-attention, a key bottleneck in contemporary multimodal architectures, with sparse grouped-query attention, yielding a net 15% reduction in GFLOPs across three different multimodal architectures. Across three applications and multimodal backbones, SentryGate achieves a 12.7% average accuracy improvement over the strongest pruning baseline, and upto to 18% under modality dropout conditions. Together, SentryFuse reduces memory by 28.2% and lowers latency by up to $1.63\times$ without further fine-tuning, establishing modality-aware zero-shot compression as a practical path to multimodal intelligence on heterogeneous edge hardware.

Via

Access Paper or Ask Questions

ElastiFormer: Learned Redundancy Reduction in Transformer via Self-Distillation

Nov 22, 2024

Junzhang Liu, Tingkai Liu, Yueyuan Sui, Stephen Xia

Figure 1 for ElastiFormer: Learned Redundancy Reduction in Transformer via Self-Distillation

Figure 2 for ElastiFormer: Learned Redundancy Reduction in Transformer via Self-Distillation

Figure 3 for ElastiFormer: Learned Redundancy Reduction in Transformer via Self-Distillation

Figure 4 for ElastiFormer: Learned Redundancy Reduction in Transformer via Self-Distillation

Abstract:We introduce ElastiFormer, a post-training technique that adapts pretrained Transformer models into an elastic counterpart with variable inference time compute. ElastiFormer introduces small routing modules (as low as .00006% additional trainable parameters) to dynamically selects subsets of network parameters and input tokens to be processed by each layer of the pretrained network in an inputdependent manner. The routing modules are trained using self-distillation losses to minimize the differences between the output of the pretrained-model and their elastic counterparts. As ElastiFormer makes no assumption regarding the modality of the pretrained Transformer model, it can be readily applied to all modalities covering causal language modeling, image modeling as well as visual-language modeling tasks. We show that 20% to 50% compute saving could be achieved for different components of the transformer layer, which could be further reduced by adding very low rank LoRA weights (rank 1) trained via the same distillation objective. Finally, by comparing routing trained on different subsets of ImageNet, we show that ElastiFormer is robust against the training domain.

Via

Access Paper or Ask Questions

TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms

May 02, 2024

Yueyuan Sui, Minghui Zhao, Junxi Xia, Xiaofan Jiang, Stephen Xia

Abstract:We propose TRAMBA, a hybrid transformer and Mamba architecture for acoustic and bone conduction speech enhancement, suitable for mobile and wearable platforms. Bone conduction speech enhancement has been impractical to adopt in mobile and wearable platforms for several reasons: (i) data collection is labor-intensive, resulting in scarcity; (ii) there exists a performance gap between state of-art models with memory footprints of hundreds of MBs and methods better suited for resource-constrained systems. To adapt TRAMBA to vibration-based sensing modalities, we pre-train TRAMBA with audio speech datasets that are widely available. Then, users fine-tune with a small amount of bone conduction data. TRAMBA outperforms state-of-art GANs by up to 7.3% in PESQ and 1.8% in STOI, with an order of magnitude smaller memory footprint and an inference speed up of up to 465 times. We integrate TRAMBA into real systems and show that TRAMBA (i) improves battery life of wearables by up to 160% by requiring less data sampling and transmission; (ii) generates higher quality voice in noisy environments than over-the-air speech; (iii) requires a memory footprint of less than 20.0 MB.

Via

Access Paper or Ask Questions

Graph Neural Network-based Multi-agent Reinforcement Learning for Resilient Distributed Coordination of Multi-Robot Systems

Mar 19, 2024

Anthony Goeckner, Yueyuan Sui, Nicolas Martinet, Xinliang Li, Qi Zhu

Figure 1 for Graph Neural Network-based Multi-agent Reinforcement Learning for Resilient Distributed Coordination of Multi-Robot Systems

Figure 2 for Graph Neural Network-based Multi-agent Reinforcement Learning for Resilient Distributed Coordination of Multi-Robot Systems

Figure 3 for Graph Neural Network-based Multi-agent Reinforcement Learning for Resilient Distributed Coordination of Multi-Robot Systems

Figure 4 for Graph Neural Network-based Multi-agent Reinforcement Learning for Resilient Distributed Coordination of Multi-Robot Systems

Abstract:Existing multi-agent coordination techniques are often fragile and vulnerable to anomalies such as agent attrition and communication disturbances, which are quite common in the real-world deployment of systems like field robotics. To better prepare these systems for the real world, we present a graph neural network (GNN)-based multi-agent reinforcement learning (MARL) method for resilient distributed coordination of a multi-robot system. Our method, Multi-Agent Graph Embedding-based Coordination (MAGEC), is trained using multi-agent proximal policy optimization (PPO) and enables distributed coordination around global objectives under agent attrition, partial observability, and limited or disturbed communications. We use a multi-robot patrolling scenario to demonstrate our MAGEC method in a ROS 2-based simulator and then compare its performance with prior coordination approaches. Results demonstrate that MAGEC outperforms existing methods in several experiments involving agent attrition and communication disturbance, and provides competitive results in scenarios without such anomalies.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks

Aug 30, 2023

Payal Mohapatra, Akash Pandey, Yueyuan Sui, Qi Zhu

Figure 1 for Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks

Figure 2 for Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks

Figure 3 for Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks

Figure 4 for Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks

Abstract:Human emotion understanding is pivotal in making conversational technology mainstream. We view speech emotion understanding as a perception task which is a more realistic setting. With varying contexts (languages, demographics, etc.) different share of people perceive the same speech segment as a non-unanimous emotion. As part of the ACM Multimedia 2023 Computational Paralinguistics ChallengE (ComParE) in the EMotion Share track, we leverage their rich dataset of multilingual speakers and multi-label regression target of 'emotion share' or perception of that emotion. We demonstrate that the training scheme of different foundation models dictates their effectiveness for tasks beyond speech recognition, especially for non-semantic speech tasks like emotion understanding. This is a very complex task due to multilingual speakers, variability in the target labels, and inherent imbalance in the regression dataset. Our results show that HuBERT-Large with a self-attention-based light-weight sequence model provides 4.6% improvement over the reported baseline.

* Accepted to appear at ACM Multimedia 2023 Multimedia Grand Challenges Track

Via

Access Paper or Ask Questions