on behalf of EUCanImage working group
Abstract:Distilling reasoning traces from strong large language models into smaller ones is a promising route to improve intelligence in resource-constrained settings. Existing approaches face a fundamental trade-off: offline distillation from teacher-generated traces provides high-quality, sample-efficient supervision but suffers from distributional drift: during training, the student model conditions on teacher-generated prefixes, whereas during inference the student autoregresses on self-generated prefixes, leading to compounding errors over long reasoning trajectories. Meanwhile, on-policy or self-distillation methods better match the student's inference-time distribution, but require costly online sampling and often produce low-quality traces in early training. We propose a principled offline reasoning distillation framework that preserves the efficiency and supervision quality of offline teacher-generated data while correcting teacher-student distribution drift. It adaptively emphasizes teacher supervision that is better aligned with the student's on-policy distribution. Evaluations on mathematical reasoning benchmarks of GSM8K, MATH, MATH500, and harder held-out competition-style tasks, including AMC, AIME, and OlympiadBench, show that our method improves reasoning accuracy over prior offline distillation algorithms and yields more stable reasoning traces while preserving instruction-following capabilities. Our work shows that lightweight, distribution-correction-aware training can substantially strengthen offline reasoning distillation without online rollouts.
Abstract:Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \textbf{ForeSight}, which enables VLMs to \textbf{See Further} with low-level visual cues and \textbf{Think Deeper} with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invocation and answer verification, with the final answer accuracy as the reward signal. To evaluate the performance of the proposed framework, we construct a new dataset, Character and Grounding SalBench (CG-SalBench), based on the SalBench dataset. Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.
Abstract:While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.
Abstract:Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion transformers become increasingly efficient, the latency bottleneck inevitably shifts to VAE decoders. To reduce their latency while maintaining quality, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel pruning method to effectively mitigate severe channel redundancy, and (2) a stage-wise dominant operator optimization strategy to address the high inference cost of the widely used causal 3D convolutions in VAE decoders. Based on these innovations, we construct a Flash-VAED family. Moreover, we design a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE decoders demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a 6$\times$ speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.
Abstract:In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.
Abstract:Colorectal liver metastases (CRLM) are a major cause of cancer-related mortality, and reliable detection on CT remains challenging in multi-centre settings. We developed a foundation model-based AI pipeline for patient-level classification and lesion-level detection of CRLM on contrast-enhanced CT, integrating uncertainty quantification and explainability. CT data from the EuCanImage consortium (n=2437) and an external TCIA cohort (n=197) were used. Among several pretrained models, UMedPT achieved the best performance and was fine-tuned with an MLP head for classification and an FCOS-based head for lesion detection. The classification model achieved an AUC of 0.90 and a sensitivity of 0.82 on the combined test set, with a sensitivity of 0.85 on the external cohort. Excluding the most uncertain 20 percent of cases improved AUC to 0.91 and balanced accuracy to 0.86. Decision curve analysis showed clinical benefit for threshold probabilities between 0.30 and 0.40. The detection model identified 69.1 percent of lesions overall, increasing from 30 percent to 98 percent across lesion size quartiles. Grad-CAM highlighted lesion-corresponding regions in high-confidence cases. These results demonstrate that foundation model-based pipelines can support robust and interpretable CRLM detection and classification across heterogeneous CT data.
Abstract:The orthogonal time frequency space (OTFS) signal is considered a promising solution for high-mobility wireless environments. It manages Doppler effects by utilizing delay-Doppler (DD) domain processing. However, the relatively long OTFS frame duration could introduce considerable sensing or communication latency when radar and communication are performed separately. By operating in a dual-functional radar and communication (DFRC) mode, the OTFS system performs sensing and data transmission simultaneously, thereby reducing the resulting latency. Nevertheless, the optimal OTFS DFRC signal strategy remains insufficiently explored. This paper investigates the optimal signal design for OTFS DFRC systems, focusing on pilot symbol design and data symbol power allocation. Specifically, we derive a channel capacity lower bound metric for communication that considers channel estimation errors in OTFS. For sensing, we derive an integrated sidelobe level (ISL), accounting for the randomness of the data symbols alongside the deterministic pilot symbols. Leveraging the above metrics, we formulate an optimization problem that balances radar and communication performance, and then solve it using an alternating optimization framework. We validate the proposed signal through numerical analysis and Monte Carlo simulations. Our analysis shows that OTFS DFRC enforces a deterministic pilot signal that is characterized by a concentrated peak in the DD domain, which furnishes a common structure in the DD domain facilitating sensing and channel estimation, with data multiplexed in other DD grids, thereby unifying sensing and communication within a single OTFS signal. Compared with conventional OTFS signals, the proposed OTFS DFRC signal expands the achievable sensing-communication performance region, delivering at least a 9.45 dB ISL suppression for sensing and a 4.82 dB SINR ratio gain for communication.
Abstract:Low-Earth-orbit (LEO) satellite communication systems face challenges due to high satellite mobility, which hinders the reliable acquisition of instantaneous channel state information at the transmitter (CSIT) and subsequently degrades multi-user transmission performance. This paper investigates a downlink multi-user multi-antenna system, and tackles the above challenges by introducing orthogonal time frequency space (OTFS) modulation and rate-splitting multiple access (RSMA) transmission. Specifically, OTFS enables stable characterization of time-varying channels by representing them in the delay-Doppler domain. However, realistic propagation introduces various inter-symbol and inter-user interference due to non-orthogonal yet practical rectangular pulse shaping, fractional delays, Doppler shifts, and imperfect (statistical) CSIT. In this context, RSMA offers promising robustness for interference mitigation and CSIT imperfections, and hence is integrated with OTFS to provide a comprehensive solution. A compact cross-domain input-output relationship for RSMA-OTFS is established, and an ergodic sum-rate maximization problem is formulated and solved using a weighted minimum mean-square-error based alternating optimization algorithm that does not depend on channel sparsity. Simulation results reveal that the considered practical propagation effects significantly degrade performance if unaddressed. Furthermore, the RSMA-OTFS scheme demonstrates improved ergodic sum-rate and robustness against CSIT uncertainty across various user deployments and CSIT qualities.
Abstract:This paper proposes a novel pilot scheme for multi-user uplink channel estimation in extra-large-scale massive MIMO (XL-MIMO) systems with extremely large aperture arrays (ELAA). The large aperture of ELAA introduces spatial non-stationarity, where far-apart users have significantly distinct visibility at the antennas, thereby reducing inter-user interference. This insight motivates our novel pilot scheme to group users with distinct visibility regions to share the same frequency subcarriers for channel estimation, so that more users can be served with reduced pilot overhead. Specifically, the proposed pilot scheme employs frequency-division multiplexing for inter-group channel estimation, while intra-group users -- benefiting from strong spatial orthogonality -- are distinguished by shifted cyclic codes, similar to code-division multiplexing. Additionally, we introduce a sub-array structured ELAA, where each sub-array is a traditional MIMO array and treated as spatial stationary, while the distances between sub-arrays can be significantly larger to achieve an expanded aperture. The channel support for sub-arrays features clustered sparsity in the antenna-delay domain and is modeled by a 2-dimensional (2-D) Markov random field (MRF). Based on this, we propose a low-complexity channel estimation algorithm within a turbo Bayesian inference framework that incorporates the 2-D MRF prior model. Simulations show that the proposed scheme and algorithm allow the XL-MIMO system to support more users, and deliver superior channel estimation performance.
Abstract:Purpose: To evaluate the impact of harmonization and multi-region CT image feature integration on survival prediction in non-small cell lung cancer (NSCLC) patients, using handcrafted radiomics, pretrained foundation model (FM) features, and clinical data from a multicenter dataset. Methods: We analyzed CT scans and clinical data from 876 NSCLC patients (604 training, 272 test) across five centers. Features were extracted from the whole lung, tumor, mediastinal nodes, coronary arteries, and coronary artery calcium (CAC). Handcrafted radiomics and FM deep features were harmonized using ComBat, reconstruction kernel normalization (RKN), and RKN+ComBat. Regularized Cox models predicted overall survival; performance was assessed using the concordance index (C-index), 5-year time-dependent area under the curve (t-AUC), and hazard ratio (HR). SHapley Additive exPlanations (SHAP) values explained feature contributions. A consensus model used agreement across top region of interest (ROI) models to stratify patient risk. Results: TNM staging showed prognostic utility (C-index = 0.67; HR = 2.70; t-AUC = 0.85). The clinical + tumor radiomics model with ComBat achieved a C-index of 0.7552 and t-AUC of 0.8820. FM features (50-voxel cubes) combined with clinical data yielded the highest performance (C-index = 0.7616; t-AUC = 0.8866). An ensemble of all ROIs and FM features reached a C-index of 0.7142 and t-AUC of 0.7885. The consensus model, covering 78% of valid test cases, achieved a t-AUC of 0.92, sensitivity of 97.6%, and specificity of 66.7%. Conclusion: Harmonization and multi-region feature integration improve survival prediction in multicenter NSCLC data. Combining interpretable radiomics, FM features, and consensus modeling enables robust risk stratification across imaging centers.