Image-to-image translation is the process of converting an image from one domain to another using deep learning techniques.
Glaucoma is a top cause of irreversible blindness globally, making early detection and longitudinal follow-up pivotal to preventing permanent vision loss. Current screening and progression assessment, however, rely on single tests or loosely linked examinations, introducing subjectivity and fragmented care. Limited access to high-quality imaging tools and specialist expertise further compromises consistency and equity in real-world use. To address these gaps, we developed Fair-Eye Net, a fair, reliable multimodal AI system closing the clinical loop from glaucoma screening to follow-up and risk alerting. It integrates fundus photos, OCT structural metrics, VF functional indices, and demographic factors via a dual-stream heterogeneous fusion architecture, with an uncertainty-aware hierarchical gating strategy for selective prediction and safe referral. A fairness constraint reduces missed diagnoses in disadvantaged subgroups. Experimental results show it achieved an AUC of 0.912 (96.7% specificity), cut racial false-negativity disparity by 73.4% (12.31% to 3.28%), maintained stable cross-domain performance, and enabled 3-12 months of early risk alerts (92% sensitivity, 88% specificity). Unlike post hoc fairness adjustments, Fair-Eye Net optimizes fairness as a primary goal with clinical reliability via multitask learning, offering a reproducible path for clinical translation and large-scale deployment to advance global eye health equity.
Synthetic Aperture Radar (SAR) provides robust all-weather imaging capabilities; however, translating SAR observations into photo-realistic optical images remains a fundamentally ill-posed problem. Current approaches are often hindered by the inherent speckle noise and geometric distortions of SAR data, which frequently result in semantic misinterpretation, ambiguous texture synthesis, and structural hallucinations. To address these limitations, a novel SAR-to-Optical (S2O) translation framework is proposed, integrating three core technical contributions: (i) Cross-Modal Semantic Alignment, which establishes an Optical-Aware SAR Encoder by distilling robust semantic priors from an Optical Teacher into a SAR Student (ii) Semantically-Grounded Generative Guidance, realized by a Semantically-Grounded ControlNet that integrates class-aware text prompts for global context with hierarchical visual prompts for local spatial guidance; and (iii) an Uncertainty-Aware Objective, which explicitly models aleatoric uncertainty to dynamically modulate the reconstruction focus, effectively mitigating artifacts caused by speckle-induced ambiguity. Extensive experiments demonstrate that the proposed method achieves superior perceptual quality and semantic consistency compared to state-of-the-art approaches.
Text-to-image (T2I) models are increasingly employed by users worldwide. However, prior research has pointed to the high sensitivity of T2I towards particular input languages - when faced with languages other than English (i.e., different surface forms of the same prompt), T2I models often produce culturally stereotypical depictions, prioritizing the surface over the prompt's semantics. Yet a comprehensive analysis of this behavior, which we dub Surface-over-Semantics (SoS), is missing. We present the first analysis of T2I models' SoS tendencies. To this end, we create a set of prompts covering 171 cultural identities, translated into 14 languages, and use it to prompt seven T2I models. To quantify SoS tendencies across models, languages, and cultures, we introduce a novel measure and analyze how the tendencies we identify manifest visually. We show that all but one model exhibit strong surface-level tendency in at least two languages, with this effect intensifying across the layers of T2I text encoders. Moreover, these surface tendencies frequently correlate with stereotypical visual depictions.
Feed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Plücker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines on our view-consistency benchmark. It also achieves state-of-the-art quality on standard novel view synthesis benchmarks.
Diffractive neural networks have recently emerged as a promising framework for all-optical computing. However, these networks are typically trained for a single task, limiting their potential adoption in systems requiring multiple functionalities. Existing approaches to achieving multi-task functionality either modify the mechanical configuration of the network per task or use a different illumination wavelength or polarization state for each task. In this work, we propose a new control mechanism, which is based on the illumination's angular spectrum. Specifically, we shape the illumination using an amplitude mask that selectively controls its angular spectrum. We employ different illumination masks for achieving different network functionalities, so that the mask serves as a unique task encoder. Interestingly, we show that effective control can be achieved over a very narrow angular range, within the paraxial regime. We numerically illustrate the proposed approach by training a single diffractive network to perform multiple image-to-image translation tasks. In particular, we demonstrate translating handwritten digits into typeset digits of different values, and translating handwritten English letters into typeset numbers and typeset Greek letters, where the type of the output is determined by the illumination's angular components. As we show, the proposed framework can work under different coherence conditions, and can be combined with existing control strategies, such as different wavelengths. Our results establish the illumination angular spectrum as a powerful degree of freedom for controlling diffractive networks, enabling a scalable and versatile framework for multi-task all-optical computing.
We study the online centralized charging scheduling problem (OCCSP). In this problem, a central authority must decide, in real time, when to charge dynamically arriving electric vehicles (EVs), subject to capacity limits, with the objective of balancing load across a finite planning horizon. To solve the problem, we first gamify it; that is, we model it as a game where charging blocks are placed within temporal and capacity constraints on a grid. We design heuristic policies, train learning agents with expert demonstrations, and improve them using Dataset Aggregation (DAgger). From a theoretical standpoint, we show that gamification reduces model complexity and yields tighter generalization bounds than vector-based formulations. Experiments across multiple EV arrival patterns confirm that gamified learning enhances load balancing. In particular, the image-to-movement model trained with DAgger consistently outperforms heuristic baselines, vector-based approaches, and supervised learning agents, while also demonstrating robustness in sensitivity analyses. These operational gains translate into tangible economic value. In a real-world case study for the Greater Montréal Area (Québec, Canada) using utility cost data, the proposed methods lower system costs by tens of millions of dollars per year over the prevailing practice and show clear potential to delay costly grid upgrades.
Image restoration has traditionally required training specialized models on thousands of paired examples per degradation type. We challenge this paradigm by demonstrating that powerful pre-trained text-conditioned image editing models can be efficiently adapted for multiple restoration tasks through parameter-efficient fine-tuning with remarkably few examples. Our approach fine-tunes LoRA adapters on FLUX.1 Kontext, a state-of-the-art 12B parameter flow matching model for image-to-image translation, using only 16-128 paired images per task, guided by simple text prompts that specify the restoration operation. Unlike existing methods that train specialized restoration networks from scratch with thousands of samples, we leverage the rich visual priors already encoded in large-scale pre-trained editing models, dramatically reducing data requirements while maintaining high perceptual quality. A single unified LoRA adapter, conditioned on task-specific text prompts, effectively handles multiple degradations including denoising, deraining, and dehazing. Through comprehensive ablation studies, we analyze: (i) the impact of training set size on restoration quality, (ii) trade-offs between task-specific versus unified multi-task adapters, (iii) the role of text encoder fine-tuning, and (iv) zero-shot baseline performance. While our method prioritizes perceptual quality over pixel-perfect reconstruction metrics like PSNR/SSIM, our results demonstrate that pre-trained image editing models, when properly adapted, offer a compelling and data-efficient alternative to traditional image restoration approaches, opening new avenues for few-shot, prompt-guided image enhancement. The code to reproduce our results are available at: https://github.com/makinyilmaz/Edit2Restore
Scalable and maintainable map representations are fundamental to enabling large-scale visual navigation and facilitating the deployment of robots in real-world environments. While collaborative localization across multi-session mapping enhances efficiency, traditional structure-based methods struggle with high maintenance costs and fail in feature-less environments or under significant viewpoint changes typical of crowd-sourced data. To address this, we propose OPENNAVMAP, a lightweight, structure-free topometric system leveraging 3D geometric foundation models for on-demand reconstruction. Our method unifies dynamic programming-based sequence matching, geometric verification, and confidence-calibrated optimization to robust, coarse-to-fine submap alignment without requiring pre-built 3D models. Evaluations on the Map-Free benchmark demonstrate superior accuracy over structure-from-motion and regression baselines, achieving an average translation error of 0.62m. Furthermore, the system maintains global consistency across 15km of multi-session data with an absolute trajectory error below 3m for map merging. Finally, we validate practical utility through 12 successful autonomous image-goal navigation tasks on simulated and physical robots. Code and datasets will be publicly available in https://rpl-cs-ucl.github.io/OpenNavMap_page.
Image-to-Image (I2I) translation involves converting an image from one domain to another. Deterministic I2I translation, such as in image super-resolution, extends this concept by guaranteeing that each input generates a consistent and predictable output, closely matching the ground truth (GT) with high fidelity. In this paper, we propose a denoising Brownian bridge model with dual approximators (Dual-approx Bridge), a novel generative model that exploits the Brownian bridge dynamics and two neural network-based approximators (one for forward and one for reverse process) to produce faithful output with negligible variance and high image quality in I2I translations. Our extensive experiments on benchmark datasets including image generation and super-resolution demonstrate the consistent and superior performance of Dual-approx Bridge in terms of image quality and faithfulness to GT when compared to both stochastic and deterministic baselines. Project page and code: https://github.com/bohan95/dual-app-bridge
Deep learning has been extensively used in medical imaging applications, assuming that the test and training datasets belong to the same probability distribution. However, a common challenge arises when working with medical images generated by different systems or even the same system with different parameter settings. Such images contain diverse textures and reverberation noise that violate the aforementioned assumption. Consequently, models trained on data from one device or setting often struggle to perform effectively with data from other devices or settings. In addition, retraining models for each specific device or setting is labor-intensive and costly. To address these issues in ultrasound images, we propose a novel Generative Adversarial Network (GAN)-based model. We formulated the domain adaptation tasks as an image-to-image translation task, in which we modified the texture patterns and removed reverberation noise in the test data images from the source domain to align with those in the target domain images while keeping the image content unchanged. We applied the proposed method to two datasets containing carotid ultrasound images from three different domains. The experimental results demonstrate that the model successfully translated the texture pattern of images and removed reverberation noise from the ultrasound images. Furthermore, we evaluated the CycleGAN approaches for a comparative study with the proposed model. The experimental findings conclusively demonstrated that the proposed model achieved domain adaptation (histogram correlation (0.960 (0.019), & 0.920 (0.043) and bhattacharya distance (0.040 (0.020), & 0.085 (0.048)), compared to no adaptation (0.916 (0.062) & 0.890 (0.077), 0.090 (0.070) & 0.121 (0.095)) for both datasets.