Image-to-image translation is the process of converting an image from one domain to another using deep learning techniques.
Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Mechanistic interpretability reveals that visual training changes the model's internal binding strategy: text-only training encourages positional shortcuts, whereas image-based training disrupts them through spatial translation invariance, forcing the model to adopt a more robust symbolic binding mechanism that persists even after text-only examples are reintroduced. We further characterize how binding strategies vary across training regimes, visual encoders, and initializations, and show that analogous shifts occur during pretrained LLM-to-VLM transitions. Our findings suggest that cross-modal training can enhance reasoning and generalization even for tasks grounded in a single modality.
Medical ultrasound (US) imaging is a frontline tool for the diagnosis of kidney diseases. However, traditional freehand imaging procedure suffers from inconsistent, operator-dependent outcomes, lack of 3D localization information, and risks of work-related musculoskeletal disorders. While robotic ultrasound (RUS) systems offer the potential for standardized, operator-independent 3D kidney data acquisition, the existing scanning methods lack the ability to determine the optimal imaging window for efficient imaging. As a result, the scan is often blindly performed with excessive probe footprint, which frequently leads to acoustic shadowing and incomplete organ coverage. Consequently, there is a critical need for a spatially efficient imaging technique that can maximize the kidney coverage through minimum probe footprint. Here, we propose an autonomous workflow to achieve efficient kidney imaging via template-guided optimal pivoting. The system first performs an explorative imaging to generate partial observations of the kidney. This data is then registered to a kidney template to estimate the organ pose. With the kidney localized, the robot executes a fixed-point pivoting sweep where the imaging plane is aligned with the kidney long axis to minimize the probe translation. The proposed method was validated in simulation and in-vivo. Simulation results indicate that a 60% exploration ratio provides optimal balance between kidney localization accuracy and scanning efficiency. In-vivo evaluation on two male subjects demonstrates a kidney localization accuracy up to 7.36 mm and 13.84 degrees. Moreover, the optimal pivoting approach shortened the probe footprint by around 75 mm when compared with the baselines. These results valid our approach of leveraging anatomical templates to align the probe optimally for volumetric sweep.
Recent advances in image generation have achieved remarkable visual quality, while a fundamental challenge remains: Can image generation be controlled at the element level, enabling intuitive modifications such as adjusting shapes, altering colors, or adding and removing objects? In this work, we address this challenge by introducing layer-wise controllable generation through simplified vector graphics (VGs). Our approach first efficiently parses images into hierarchical VG representations that are semantic-aligned and structurally coherent. Building on this representation, we design a novel image synthesis framework guided by VGs, allowing users to freely modify elements and seamlessly translate these edits into photorealistic outputs. By leveraging the structural and semantic features of VGs in conjunction with noise prediction, our method provides precise control over geometry, color, and object semantics. Extensive experiments demonstrate the effectiveness of our approach in diverse applications, including image editing, object-level manipulation, and fine-grained content creation, establishing a new paradigm for controllable image generation. Project page: https://guolanqing.github.io/Vec2Pix/
Conditional diffusion inversion provides a powerful framework for unpaired image-to-image translation. However, we demonstrate through an extensive analysis that standard deterministic inversion (e.g. DDIM) fails when the source domain is spectrally sparse compared to the target domain (e.g., super-resolution, sketch-to-image). In these contexts, the recovered latent from the input does not follow the expected isotropic Gaussian distribution. Instead it exhibits a signal with lower frequencies, locking target sampling to oversmoothed and texture-poor generations. We term this phenomenon spectral collapse. We observe that stochastic alternatives attempting to restore the noise variance tend to break the semantic link to the input, leading to structural drift. To resolve this structure-texture trade-off, we propose Orthogonal Variance Guidance (OVG), an inference-time method that corrects the ODE dynamics to enforce the theoretical Gaussian noise magnitude within the null-space of the structural gradient. Extensive experiments on microscopy super-resolution (BBBC021) and sketch-to-image (Edges2Shoes) demonstrate that OVG effectively restores photorealistic textures while preserving structural fidelity.
Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike-like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient-denoising upgrade to M-Attack. On the source side, Multi-Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower-variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch-size ensemble (PE+), this strengthens transferable directions. Together these modules form M-Attack-V2, a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks. Code and data are publicly available at: https://github.com/vila-lab/M-Attack-V2.
Foundation models (FMs) are rapidly reshaping medical imaging, shifting the field from narrowly trained, task-specific networks toward large, general-purpose models that can be adapted across modalities, anatomies, and clinical tasks. In this review, we synthesize the emerging landscape of medical imaging FMs along three major axes: principles of FM design, applications of FMs, and forward-looking challenges and opportunities. Taken together, this review provides a technically grounded, clinically aware, and future-facing roadmap for developing FMs that are not only powerful and versatile but also trustworthy and ready for responsible translation into clinical practice.
Purpose of Review Imaging derived fractional flow reserve (FFR) is rapidly evolving beyond conventional computational fluid dynamics (CFD) based pipelines toward machine learning (ML), deep learning (DL), and physics informed approaches that enable fast, wire free, and scalable functional assessment of coronary stenosis. This review synthesizes recent advances in CT and angiography based FFR, with particular emphasis on emerging physics informed neural networks and neural operators (PINNs and PINOs) and key considerations for their clinical translation. Recent Findings ML/DL approaches have markedly improved automation and computational speed, enabling prediction of pressure and FFR from anatomical descriptors or angiographic contrast dynamics. However, their real-world performance and generalizability can remain variable and sensitive to domain shift, due to multi-center heterogeneity, interpretability challenges, and differences in acquisition protocols and image quality. Physics informed learning introduces conservation structure and boundary condition consistency into model training, improving generalizability and reducing dependence on dense supervision while maintaining rapid inference. Recent evaluation trends increasingly highlight deployment oriented metrics, including calibration, uncertainty quantification, and quality control gatekeeping, as essential for safe clinical use. Summary The field is converging toward imaging derived FFR methods that are faster, more automated, and more reliable. While ML/DL offers substantial efficiency gains, physics informed frameworks such as PINNs and PINOs may provide a more robust balance between speed and physical consistency. Prospective multi center validation and standardized evaluation will be critical to support broad and safe clinical adoption.
We introduce SCAR, a method for long-term auto-calibration refinement of aerial visual-inertial systems that exploits georeferenced satellite imagery as a persistent global reference. SCAR estimates both intrinsic and extrinsic parameters by aligning aerial images with 2D--3D correspondences derived from publicly available orthophotos and elevation models. In contrast to existing approaches that rely on dedicated calibration maneuvers or manually surveyed ground control points, our method leverages external geospatial data to detect and correct calibration degradation under field deployment conditions. We evaluate our approach on six large-scale aerial campaigns conducted over two years under diverse seasonal and environmental conditions. Across all sequences, SCAR consistently outperforms established baselines (Kalibr, COLMAP, VINS-Mono), reducing median reprojection error by a large margin, and translating these calibration gains into substantially lower visual localization rotation errors and higher pose accuracy. These results demonstrate that SCAR provides accurate, robust, and reproducible calibration over long-term aerial operations without the need for manual intervention.
Computed Tomography (CT) is one of the most widely used and diagnostically information-dense imaging modalities, covering critical organs such as the heart, lungs, liver, and colon. Clinical interpretation relies on both slice-driven local features (e.g., sub-centimeter nodules, lesion boundaries) and volume-driven spatial representations (e.g., tumor infiltration, inter-organ anatomical relations). However, existing Large Vision-Language Models (LVLMs) remain fragmented in CT slice versus volumetric understanding: slice-driven LVLMs show strong generalization but lack cross-slice spatial consistency, while volume-driven LVLMs explicitly capture volumetric semantics but suffer from coarse granularity and poor compatibility with slice inputs. The absence of a unified modeling paradigm constitutes a major bottleneck for the clinical translation of medical LVLMs. We present OmniCT, a powerful unified slice-volume LVLM for CT scenarios, which makes three contributions: (i) Spatial Consistency Enhancement (SCE): volumetric slice composition combined with tri-axial positional embedding that introduces volumetric consistency, and an MoE hybrid projection enables efficient slice-volume adaptation; (ii) Organ-level Semantic Enhancement (OSE): segmentation and ROI localization explicitly align anatomical regions, emphasizing lesion- and organ-level semantics; (iii) MedEval-CT: the largest slice-volume CT dataset and hybrid benchmark integrates comprehensive metrics for unified evaluation. OmniCT consistently outperforms existing methods with a substantial margin across diverse clinical tasks and satisfies both micro-level detail sensitivity and macro-level spatial reasoning. More importantly, it establishes a new paradigm for cross-modal medical imaging understanding.
Inverse synthetic aperture radar (ISAR) images generated from single-channel automotive radar data provide critical information about the shape and size of automotive targets. However, the quality of ISAR images degrades due to road clutter and when translational and higher order rotational motions of the targets are not suitably compensated. One method to enhance the signal-to-clutter-and-noise ratio (SCNR) of the systems is to leverage the advantages of the multiple-input-multiple-output (MIMO) framework available in commercial automotive radars to generate MIMO-ISAR images. While substantial research has been devoted to motion compensation of single-channel ISAR images, the effectiveness of these methods for MIMO-ISAR has not been studied extensively. This paper analyzes the performance of three popular motion compensation techniques - entropy minimization, cross-correlation, and phase gradient autofocus - on MIMO-ISAR. The algorithms are evaluated on the measurement data collected using Texas Instruments millimeter-wave MIMO radar. The results indicate that the cross-correlation MOCOMP performs better than the other two MOCOMP algorithms in the MIMO configuration, with an overall improvement of 36%.