Image-to-image translation is the process of converting an image from one domain to another using deep learning techniques.
Contrast-enhanced magnetic resonance imaging (CE-MRI) plays a crucial role in brain tumor assessment; however, its acquisition requires gadolinium-based contrast agents (GBCAs), which increase costs and raise safety concerns. Consequently, synthesizing CE-MRI from non-contrast MRI (NC-MRI) has emerged as a promising alternative. Early Generative Adversarial Network (GAN)-based approaches suffered from instability and mode collapse, while diffusion models, despite impressive synthesis quality, remain computationally expensive and often fail to faithfully reproduce critical tumor contrast patterns. To address these limitations, we propose Tumor-Biased Latent Bridge Matching (TuLaBM), which formulates NC-to-CE MRI translation as Brownian bridge transport between source and target distributions in a learned latent space, enabling efficient training and inference. To enhance tumor-region fidelity, we introduce a Tumor-Biased Attention Mechanism (TuBAM) that amplifies tumor-relevant latent features during bridge evolution, along with a boundary-aware loss that constrains tumor interfaces to improve margin sharpness. While bridge matching has been explored for medical image translation in pixel space, our latent formulation substantially reduces computational cost and inference time. Experiments on BraTS2023-GLI (BraSyn) and Cleveland Clinic (in-house) liver MRI dataset show that TuLaBM consistently outperforms state-of-the-art baselines on both whole-image and tumor-region metrics, generalizes effectively to unseen liver MRI data in zero-shot and fine-tuned settings, and achieves inference times under 0.097 seconds per image.
Vision-Language Navigation (VLN) requires an embodied agent to navigate complex environments by following natural language instructions, which typically demands tight fusion of visual and language modalities. Existing VLN methods often convert raw images into visual tokens or implicit features, requiring large-scale visual pre-training and suffering from poor generalization under environmental variations (e.g., lighting, texture). To address these issues, we propose SOL-Nav (Structured Observation Language for Navigation), a novel framework that translates egocentric visual observations into compact structured language descriptions for efficient and generalizable navigation. Specifically, we divide RGB-D images into a N*N grid, extract representative semantic, color, and depth information for each grid cell to form structured text, and concatenate this with the language instruction as pure language input to a pre-trained language model (PLM). Experimental results on standard VLN benchmarks (R2R, RxR) and real-world deployments demonstrate that SOL-Nav significantly reduces the model size and training data dependency, fully leverages the reasoning and representation capabilities of PLMs, and achieves strong generalization to unseen environments.
This paper presents NOIR, a framework that reframes core medical imaging tasks as operator learning between continuous function spaces, challenging the prevailing paradigm of discrete grid-based deep learning. Instead of operating on fixed pixel or voxel grids, NOIR embeds discrete medical signals into shared Implicit Neural Representations and learns a Neural Operator that maps between their latent modulations, enabling resolution-independent function-to-function transformations. We evaluate NOIR across multiple 2D and 3D downstream tasks, including segmentation, shape completion, image-to-image translation, and image synthesis, on several public datasets such as Shenzhen, OASIS-4, SkullBreak, fastMRI, as well as an in-house clinical dataset. It achieves competitive performance at native resolution while demonstrating strong robustness to unseen discretizations, and empirically satisfies key theoretical properties of neural operators. The project page is available here: https://github.com/Sidaty1/NOIR-io.
Document Image Machine Translation (DIMT) seeks to translate text embedded in document images from one language to another by jointly modeling both textual content and page layout, bridging optical character recognition (OCR) and natural language processing (NLP). The DIMT 2025 Challenge advances research on end-to-end document image translation, a rapidly evolving area within multimodal document understanding. The competition features two tracks, OCR-free and OCR-based, each with two subtasks for small (less than 1B parameters) and large (greater than 1B parameters) models. Participants submit a single unified DIMT system, with the option to incorporate provided OCR transcripts. Running from December 10, 2024 to April 20, 2025, the competition attracted 69 teams and 27 valid submissions in total. Track 1 had 34 teams and 13 valid submissions, while Track 2 had 35 teams and 14 valid submissions. In this report, we present the challenge motivation, dataset construction, task definitions, evaluation protocol, and a summary of results. Our analysis shows that large-model approaches establish a promising new paradigm for translating complex-layout document images and highlight substantial opportunities for future research.
Lesion detection, symptom tracking, and visual explainability are central to real-world medical image analysis, yet current medical Vision-Language Models (VLMs) still lack mechanisms that translate their broad knowledge into clinically actionable outputs. To bridge this gap, we present MEDIC-AD, a clinically oriented VLM that strengthens these three capabilities through a stage-wise framework. First, learnable anomaly-aware tokens (<Ano>) encourage the model to focus on abnormal regions and build more discriminative lesion centered representations. Second, inter image difference tokens (<Diff>) explicitly encode temporal changes between studies, allowing the model to distinguish worsening, improvement, and stability in disease burden. Finally, a dedicated explainability stage trains the model to generate heatmaps that highlight lesion-related regions, offering clear visual evidence that is consistent with the model's reasoning. Through our staged design, MEDIC-AD steadily boosts performance across anomaly detection, symptom tracking, and anomaly segmentation, achieving state-of-the-art results compared with both closed source and medical-specialized baselines. Evaluations on real longitudinal clinical data collected from real hospital workflows further show that MEDIC-AD delivers stable predictions and clinically faithful explanations in practical patient-monitoring and decision-support workflows
Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel object classes in medical images using only minimal annotated examples, addressing the critical challenges of data scarcity and domain shifts prevalent in medical imaging. While Diffusion Models (DM) excel in visual tasks, their potential for FSMIS remains largely unexplored. We propose that the rich visual priors learned by large-scale DMs offer a powerful foundation for a more robust and data-efficient segmentation approach. In this paper, we introduce SD-FSMIS, a novel framework designed to effectively adapt the powerful pre-trained Stable Diffusion (SD) model for the FSMIS task. Our approach repurposes its conditional generative architecture by introducing two key components: a Support-Query Interaction (SQI) and a Visual-to-Textual Condition Translator (VTCT). Specifically, SQI provides a straightforward yet powerful means of adapting SD to the FSMIS paradigm. The VTCT module translates visual cues from the support set into an implicit textual embedding that guides the diffusion model, enabling precise conditioning of the generation process. Extensive experiments demonstrate that SD-FSMIS achieves competitive results compared to state-of-the-art methods in standard settings. Surprisingly, it also demonstrated excellent generalization ability in more challenging cross-domain scenarios. These findings highlight the immense potential of adapting large-scale generative models to advance data-efficient and robust medical image segmentation.
Monocular cameras are attractive for robotic perception due to their low cost and ease of deployment, yet achieving reliable real-time spatial understanding from a single image stream remains challenging. While recent multi-task dense prediction models have improved per-pixel depth and semantic estimation, translating these advances into stable monocular mapping systems is still non-trivial. This paper presents M2H-MX, a real-time multi-task perception model for monocular spatial understanding. The model preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder, enabling depth and semantic predictions to reinforce each other under strict latency constraints. Its outputs integrate directly into an unmodified monocular SLAM pipeline through a compact perception-to-mapping interface. We evaluate both dense prediction accuracy and in-the-loop system performance. On NYUDv2, M2H-MX-L achieves state-of-the-art results, improving semantic mIoU by 6.6% and reducing depth RMSE by 9.4% over representative multi-task baselines. When deployed in a real-time monocular mapping system on ScanNet, M2H-MX reduces average trajectory error by 60.7% compared to a strong monocular SLAM baseline while producing cleaner metric-semantic maps. These results demonstrate that modern multi-task dense prediction can be reliably deployed for real-time monocular spatial perception in robotic systems.
Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.
Nudging is widely used to promote behavioral change, but its effectiveness is often limited when recipients must repeatedly translate feedback into workable next steps under changing circumstances. Large language models (LLMs) may help reduce part of this cognitive work by generating personalized guidance and updating it iteratively across intervention rounds. We developed an LLM agent for iterative personalization and tested it in a three-arm randomized experiment among 233 university residents in China, using daily electricity and shower hot-water conservation as objectively measured cases differing in friction. LLM-personalized nudges (T2) produced the largest conservation effects, while image-enhanced conventional nudges (T1) and text-based conventional nudges (C) showed similar outcomes (omnibus p = 0.009). Relative to C, T2 reduced electricity consumption by 0.56 kWh per room-day (p = 0.014), corresponding to an 18.3 percentage-point higher adjusted saving rate. This advantage emerged within the first two intervention rounds, alongside iterative updating of personalized guidance, and persisted thereafter. Hot-water outcomes followed the same direction but were smaller, less precisely estimated, and attenuated over time, consistent with stronger friction in this domain. LLM-personalized nudges emphasized prospective and context-specific guidance and were associated with higher participant engagement. This study provides field evidence that LLM-based iterative personalization can enhance behavioral nudging, with behavioral friction as a potential boundary condition. Larger trials and extension to more behaviors are warranted.
Designing a computational imaging system -- selecting operators, setting parameters, validating consistency -- requires weeks of specialist effort per modality, creating an expertise bottleneck that excludes the broader scientific community from prototyping imaging instruments. We introduce spec.md, a structured specification format, and three autonomous agents -- Plan, Judge, and Execute -- that translate a one-sentence natural-language description into a validated forward model with bounded reconstruction error. A design-to-real error theorem decomposes total reconstruction error into five independently bounded terms, each linked to a corrective action. On 6 real-data modalities spanning all 5 carrier families, the automated pipeline matches expert-library quality (98.1 +/- 4.2%). Ten novel designs -- composing primitives into chains from 3D to 5D -- demonstrate compositional reach beyond any single-modality tool.