Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ji Luo

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

May 13, 2026

Zhaowei Wang, Lishu Luo, Haodong Duan, Weiwei Liu, Sijin Wu, Ji Luo, Shen Yan, Shuai Peng, Sihang Yuan, Chaoyi Huang(+2 more)

Abstract:Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.

* work in progress

Via

Access Paper or Ask Questions

Derivation and analysis of power offset in fiber-longitudinal power profile estimation using pre-FEC hard-decision data

Mar 26, 2025

Du Tang, Yingjie Jiang, Ji Luo, Yu Chen, Bofang Zheng, Yaojun Qiao

Abstract:Utilizing the precise reference waveform regenerated by post-forward error correction (FEC) data, the fiber-longitudinal power profile estimation based on the minimum-mean-square-error method (MMSE-PPE) has been validated as an effective tool for absolute power monitoring. However, when post-FEC data is unavailable, it becomes necessary to rely on pre-FEC hard-decision data, which inevitably introduces hard-decision errors. These hard-decision errors will result in a power offset that undermines the accuracy of absolute power monitoring. In this paper, we present the first analytical expression for power offset in MMSE-PPE when using pre-FEC hard-decision data, achieved by introducing a virtual hard-decision nonlinear perturbation term. Based on this analytical expression, we also establish the first nonlinear relationship between the power offset and the symbol error rate (SER) of M-ary quadrature amplitude modulation (M-QAM) formats based on Gaussian assumptions. Verified in a numerical 130-GBaud single-wavelength coherent optical fiber transmission system, the correctness of the analytical expression of power offset has been confirmed with 4-QAM, 16-QAM, and 64-QAM formats under different SER situations. Furthermore, the nonlinear relationship between the power offset and SER of $M$-QAM formats has also been thoroughly validated under both linear scale (measured in mW) and logarithmic scale (measured in dB). These theoretical insights offer significant contributions to the design of potential power offset mitigation strategies in MMSE-PPE, thereby enhancing its real-time application.

Via

Access Paper or Ask Questions

Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs

Mar 07, 2025

Ling Team, Binwei Zeng, Chao Huang, Chao Zhang, Changxin Tian, Cong Chen, Dingnan Jin, Feng Yu, Feng Zhu, Feng Yuan(+64 more)

Figure 1 for Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs

Figure 2 for Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs

Figure 3 for Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs

Figure 4 for Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs

Abstract:In this technical report, we tackle the challenges of training large-scale Mixture of Experts (MoE) models, focusing on overcoming cost inefficiency and resource limitations prevalent in such systems. To address these issues, we present two differently sized MoE large language models (LLMs), namely Ling-Lite and Ling-Plus (referred to as "Bailing" in Chinese, spelled B\v{a}il\'ing in Pinyin). Ling-Lite contains 16.8 billion parameters with 2.75 billion activated parameters, while Ling-Plus boasts 290 billion parameters with 28.8 billion activated parameters. Both models exhibit comparable performance to leading industry benchmarks. This report offers actionable insights to improve the efficiency and accessibility of AI development in resource-constrained settings, promoting more scalable and sustainable technologies. Specifically, to reduce training costs for large-scale MoE models, we propose innovative methods for (1) optimization of model architecture and training processes, (2) refinement of training anomaly handling, and (3) enhancement of model evaluation efficiency. Additionally, leveraging high-quality data generated from knowledge graphs, our models demonstrate superior capabilities in tool use compared to other models. Ultimately, our experimental findings demonstrate that a 300B MoE LLM can be effectively trained on lower-performance devices while achieving comparable performance to models of a similar scale, including dense and MoE models. Compared to high-performance devices, utilizing a lower-specification hardware system during the pre-training phase demonstrates significant cost savings, reducing computing costs by approximately 20%. The models can be accessed at https://huggingface.co/inclusionAI.

* 34 pages

Via

Access Paper or Ask Questions

Unsupervised 3D End-to-End Medical Image Registration with Volume Tweening Network

Feb 13, 2019

Tingfung Lau, Ji Luo, Shengyu Zhao, Eric I-Chao Chang, Yan Xu

Figure 1 for Unsupervised 3D End-to-End Medical Image Registration with Volume Tweening Network

Figure 2 for Unsupervised 3D End-to-End Medical Image Registration with Volume Tweening Network

Figure 3 for Unsupervised 3D End-to-End Medical Image Registration with Volume Tweening Network

Figure 4 for Unsupervised 3D End-to-End Medical Image Registration with Volume Tweening Network

Abstract:3D medical image registration is of great clinical importance. However, supervised learning methods require a large amount of accurately annotated corresponding control points (or morphing). The ground truth for 3D medical images is very difficult to obtain. Unsupervised learning methods ease the burden of manual annotation by exploiting unlabeled data without supervision. In this paper, we propose a new unsupervised learning method using convolutional neural networks under an end-to-end framework, Volume Tweening Network (VTN), to register 3D medical images. Three technical components ameliorate our unsupervised learning system for 3D end-to-end medical image registration: (1) We cascade the registration subnetworks; (2) We integrate affine registration into our network; and (3) We incorporate an additional invertibility loss into the training process. Experimental results demonstrate that our algorithm is 880x faster (or 3.3x faster without GPU acceleration) than traditional optimization-based methods and achieves state-of-the-art performance in medical image registration.

Via

Access Paper or Ask Questions