Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bingna Xu

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

Dec 16, 2025

HyperAI Team, Yuchen Liu, Kaiyang Han, Zhiqiang Xia, Yuhang Dong, Chen Song, Kangyu Tang, Jiaming Xu, Xiushi Feng, WenXuan Yu(+19 more)

Figure 1 for HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

Figure 2 for HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

Figure 3 for HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

Figure 4 for HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

Abstract:Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.

* Technical report of Xiaomi HyperAI Team

Via

Access Paper or Ask Questions

Downscaled Representation Matters: Improving Image Rescaling with Collaborative Downscaled Images

Nov 19, 2022

Bingna Xu, Yong Guo, Luoqian Jiang, Mianjie Yu, Jian Chen

Figure 1 for Downscaled Representation Matters: Improving Image Rescaling with Collaborative Downscaled Images

Figure 2 for Downscaled Representation Matters: Improving Image Rescaling with Collaborative Downscaled Images

Figure 3 for Downscaled Representation Matters: Improving Image Rescaling with Collaborative Downscaled Images

Figure 4 for Downscaled Representation Matters: Improving Image Rescaling with Collaborative Downscaled Images

Abstract:Deep networks have achieved great success in image rescaling (IR) task that seeks to learn the optimal downscaled representations, i.e., low-resolution (LR) images, to reconstruct the original high-resolution (HR) images. Compared with super-resolution methods that consider a fixed downscaling scheme, e.g., bicubic, IR often achieves significantly better reconstruction performance thanks to the learned downscaled representations. This highlights the importance of a good downscaled representation in image reconstruction tasks. Existing IR methods mainly learn the downscaled representation by jointly optimizing the downscaling and upscaling models. Unlike them, we seek to improve the downscaled representation through a different and more direct way: optimizing the downscaled image itself instead of the down-/upscaling models. Specifically, we propose a collaborative downscaling scheme that directly generates the collaborative LR examples by descending the gradient w.r.t. the reconstruction loss on them to benefit the IR process. Furthermore, since LR images are downscaled from the corresponding HR images, one can also improve the downscaled representation if we have a better representation in the HR domain. Inspired by this, we propose a Hierarchical Collaborative Downscaling (HCD) method that performs gradient descent in both HR and LR domains to improve the downscaled representations. Extensive experiments show that our HCD significantly improves the reconstruction performance both quantitatively and qualitatively. Moreover, we also highlight the flexibility of our HCD since it can generalize well across diverse IR models.

* 11 pages, 8 figures

Via

Access Paper or Ask Questions