Image-to-image translation is the process of converting an image from one domain to another using deep learning techniques.
Image-level domain alignment is the de facto approach for unsupervised domain adaptation, where unpaired image translation is used to minimize the domain gap. Prior studies mainly focus on the domain shift between the source and target domains, whereas the intra-domain variability remains under-explored. To address the latter, an effective strategy is to diversify the styles of the synthetic target domain data during image translation. However, previous methods typically require intra-domain variations to be pre-specified for style synthesis, which may be impractical. In this paper, we propose an exemplar-based style synthesis method named IntraStyler, which can capture diverse intra-domain styles without any prior knowledge. Specifically, IntraStyler uses an exemplar image to guide the style synthesis such that the output style matches the exemplar style. To extract the style-only features, we introduce a style encoder to learn styles discriminatively based on contrastive learning. We evaluate the proposed method on the largest public dataset for cross-modality domain adaptation, CrossMoDA 2023. Our experiments show the efficacy of our method in controllable style synthesis and the benefits of diverse synthetic data for downstream segmentation. Code is available at https://github.com/han-liu/IntraStyler.
Foundation models for vision are predominantly trained on RGB data, while many safety-critical applications rely on non-visible modalities such as infrared (IR) and synthetic aperture radar (SAR). We study whether a single flow-matching foundation model pre-trained primarily on RGB images can be repurposed as a cross-spectral translator using only a few co-measured examples, and whether the resulting synthetic data can enhance downstream detection. Starting from FLUX.1 Kontext, we insert low-rank adaptation (LoRA) modules and fine-tune them on just 100 paired images per domain for two settings: RGB to IR on the KAIST dataset and RGB to SAR on the M4-SAR dataset. The adapted model translates RGB images into pixel-aligned IR/SAR, enabling us to reuse existing bounding boxes and train object detection models purely in the target modality. Across a grid of LoRA hyperparameters, we find that LPIPS computed on only 50 held-out pairs is a strong proxy for downstream performance: lower LPIPS consistently predicts higher mAP for YOLOv11n on both IR and SAR, and for DETR on KAIST IR test data. Using the best LPIPS-selected LoRA adapter, synthetic IR from external RGB datasets (LLVIP, FLIR ADAS) improves KAIST IR pedestrian detection, and synthetic SAR significantly boosts infrastructure detection on M4-SAR when combined with limited real SAR. Our results suggest that few-shot LoRA adaptation of flow-matching foundation models is a promising path toward foundation-style support for non-visible modalities.
We study the online centralized charging scheduling problem (OCCSP). In this problem, a central authority must decide, in real time, when to charge dynamically arriving electric vehicles (EVs), subject to capacity limits, with the objective of balancing load across a finite planning horizon. To solve the problem, we first gamify it; that is, we model it as a game where charging blocks are placed within temporal and capacity constraints on a grid. We design heuristic policies, train learning agents with expert demonstrations, and improve them using Dataset Aggregation (DAgger). From a theoretical standpoint, we show that gamification reduces model complexity and yields tighter generalization bounds than vector-based formulations. Experiments across multiple EV arrival patterns confirm that gamified learning enhances load balancing. In particular, the image-to-movement model trained with DAgger consistently outperforms heuristic baselines, vector-based approaches, and supervised learning agents, while also demonstrating robustness in sensitivity analyses. These operational gains translate into tangible economic value. In a real-world case study for the Greater Montréal Area (Québec, Canada) using utility cost data, the proposed methods lower system costs by tens of millions of dollars per year over the prevailing practice and show clear potential to delay costly grid upgrades.
This paper addresses the critical bottleneck of infrared (IR) data scarcity in Printed Circuit Board (PCB) defect detection by proposing a cross-modal data augmentation framework integrating CycleGAN and YOLOv8. Unlike conventional methods relying on paired supervision, we leverage CycleGAN to perform unpaired image-to-image translation, mapping abundant visible-light PCB images into the infrared domain. This generative process synthesizes high-fidelity pseudo-IR samples that preserve the structural semantics of defects while accurately simulating thermal distribution patterns. Subsequently, we construct a heterogeneous training strategy that fuses generated pseudo-IR data with limited real IR samples to train a lightweight YOLOv8 detector. Experimental results demonstrate that this method effectively enhances feature learning under low-data conditions. The augmented detector significantly outperforms models trained on limited real data alone and approaches the performance benchmarks of fully supervised training, proving the efficacy of pseudo-IR synthesis as a robust augmentation strategy for industrial inspection.
Scalable and maintainable map representations are fundamental to enabling large-scale visual navigation and facilitating the deployment of robots in real-world environments. While collaborative localization across multi-session mapping enhances efficiency, traditional structure-based methods struggle with high maintenance costs and fail in feature-less environments or under significant viewpoint changes typical of crowd-sourced data. To address this, we propose OPENNAVMAP, a lightweight, structure-free topometric system leveraging 3D geometric foundation models for on-demand reconstruction. Our method unifies dynamic programming-based sequence matching, geometric verification, and confidence-calibrated optimization to robust, coarse-to-fine submap alignment without requiring pre-built 3D models. Evaluations on the Map-Free benchmark demonstrate superior accuracy over structure-from-motion and regression baselines, achieving an average translation error of 0.62m. Furthermore, the system maintains global consistency across 15km of multi-session data with an absolute trajectory error below 3m for map merging. Finally, we validate practical utility through 12 successful autonomous image-goal navigation tasks on simulated and physical robots. Code and datasets will be publicly available in https://rpl-cs-ucl.github.io/OpenNavMap_page.
Multimodal models excel in English, supported by abundant image-text and audio-text data, but performance drops sharply for other languages due to limited multilingual multimodal resources. Existing solutions rely heavily on machine translation, while advances in multilingual text modeling remain underutilized. We introduce METAL, a lightweight alignment method that learns only a few linear layers using English text alone to map multilingual text embeddings into a multimodal space. Despite its simplicity, METAL matches baseline performance in English (94.9 percent Recall at 10) and achieves strong zero-shot transfer (89.5 percent Recall at 10 averaged across 11 languages, 10 unseen) on XTD text-to-image retrieval. Qualitative t-SNE visualizations show that multilingual embeddings align tightly with multimodal representations, while weight analysis reveals that the transformation reshapes embedding geometry rather than performing trivial rotations. Beyond image-text retrieval, METAL generalizes to audio-text retrieval and cross-lingual text-to-image generation. We release code and checkpoints at https://github.com/m2m-codebase/M2M , as well as multilingual evaluation datasets including MSCOCO Multilingual 30K (https://huggingface.co/datasets/piyushsinghpasi/mscoco-multilingual-30k ), AudioCaps Multilingual (https://huggingface.co/datasets/piyushsinghpasi/audiocaps-multilingual ), and Clotho Multilingual (https://huggingface.co/datasets/piyushsinghpasi/clotho-multilingual ), to facilitate further research.
Vision-Language-Action (VLA) models have emerged as essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model (VLM) embeddings. Recent advancements have introduced explicit intermediary reasoning, such as sub-task prediction (language) or goal image synthesis (vision), to guide action generation. However, these intermediate reasoning are often indirect and inherently limited in their capacity to convey the full, granular information required for precise action execution. Instead, we posit that the most effective form of reasoning is one that deliberates directly in the action space. We introduce Action Chain-of-Thought (ACoT), a paradigm where the reasoning process itself is formulated as a structured sequence of coarse action intents that guide the final policy. In this paper, we propose ACoT-VLA, a novel architecture that materializes the ACoT paradigm. Specifically, we introduce two complementary components: an Explicit Action Reasoner (EAR) and Implicit Action Reasoner (IAR). The former proposes coarse reference trajectories as explicit action-level reasoning steps, while the latter extracts latent action priors from internal representations of multimodal input, co-forming an ACoT that conditions the downstream action head to enable grounded policy learning. Extensive experiments in real-world and simulation environments demonstrate the superiority of our proposed method, which achieves 98.5%, 84.1%, and 47.4% on LIBERO, LIBERO-Plus and VLABench, respectively.
Attention mechanisms represent a fundamental paradigm shift in neural network architectures, enabling models to selectively focus on relevant portions of input sequences through learned weighting functions. This monograph provides a comprehensive and rigorous mathematical treatment of attention mechanisms, encompassing their theoretical foundations, computational properties, and practical implementations in contemporary deep learning systems. Applications in natural language processing, computer vision, and multimodal learning demonstrate the versatility of attention mechanisms. We examine language modeling with autoregressive transformers, bidirectional encoders for representation learning, sequence-to-sequence translation, Vision Transformers for image classification, and cross-modal attention for vision-language tasks. Empirical analysis reveals training characteristics, scaling laws that relate performance to model size and computation, attention pattern visualizations, and performance benchmarks across standard datasets. We discuss the interpretability of learned attention patterns and their relationship to linguistic and visual structures. The monograph concludes with a critical examination of current limitations, including computational scalability, data efficiency, systematic generalization, and interpretability challenges.
Domain gaps arising from variations in imaging devices and population distributions pose significant challenges for machine learning in medical image analysis. Existing image-to-image translation methods primarily aim to learn mappings between domains, often generating diverse synthetic data with variations in anatomical scale and shape, but they usually overlook spatial correspondence during the translation process. For clinical applications, traceability, defined as the ability to provide pixel-level correspondences between original and translated images, is equally important. This property enhances clinical interpretability but has been largely overlooked in previous approaches. To address this gap, we propose Plasticine, which is, to the best of our knowledge, the first end-to-end image-to-image translation framework explicitly designed with traceability as a core objective. Our method combines intensity translation and spatial transformation within a denoising diffusion framework. This design enables the generation of synthetic images with interpretable intensity transitions and spatially coherent deformations, supporting pixel-wise traceability throughout the translation process.
Sickle cell disease causes erythrocytes to become sickle-shaped, affecting their movement in the bloodstream and reducing oxygen delivery. It has a high global prevalence and places a significant burden on healthcare systems, especially in resource-limited regions. Automated classification of sickle cells in blood images is crucial, allowing the specialist to reduce the effort required and avoid errors when quantifying the deformed cells and assessing the severity of a crisis. Recent studies have proposed various erythrocyte representation and classification methods. Since classification depends solely on cell shape, a suitable approach models erythrocytes as closed planar curves in shape space. This approach employs elastic distances between shapes, which are invariant under rotations, translations, scaling, and reparameterizations, ensuring consistent distance measurements regardless of the curves' position, starting point, or traversal speed. While previous methods exploiting shape space distances had achieved high accuracy, we refined the model by considering the geometric characteristics of healthy and sickled erythrocytes. Our method proposes (1) to employ a fixed parameterization based on the major axis of each cell to compute distances and (2) to align each cell with two templates using this parameterization before computing distances. Aligning shapes to templates before distance computation, a concept successfully applied in areas such as molecular dynamics, and using a fixed parameterization, instead of minimizing distances across all possible parameterizations, simplifies calculations. This strategy achieves 96.03\% accuracy rate in both supervised classification and unsupervised clustering. Our method ensures efficient erythrocyte classification, maintaining or improving accuracy over shape space models while significantly reducing computational costs.