Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Topic:Image To Image Translation

What is Image To Image Translation? Image-to-image translation is the process of converting an image from one domain to another using deep learning techniques.

Language translation, and change of accent for speech-to-speech task using diffusion model

May 04, 2025

Abhishek Mishra, Ritesh Sur Chowdhury, Vartul Bahuguna, Isha Pandey, Ganesh Ramakrishnan

Abstract:Speech-to-speech translation (S2ST) aims to convert spoken input in one language to spoken output in another, typically focusing on either language translation or accent adaptation. However, effective cross-cultural communication requires handling both aspects simultaneously - translating content while adapting the speaker's accent to match the target language context. In this work, we propose a unified approach for simultaneous speech translation and change of accent, a task that remains underexplored in current literature. Our method reformulates the problem as a conditional generation task, where target speech is generated based on phonemes and guided by target speech features. Leveraging the power of diffusion models, known for high-fidelity generative capabilities, we adapt text-to-image diffusion strategies by conditioning on source speech transcriptions and generating Mel spectrograms representing the target speech with desired linguistic and accentual attributes. This integrated framework enables joint optimization of translation and accent adaptation, offering a more parameter-efficient and effective model compared to traditional pipelines.

Via

Access Paper or Ask Questions

EarthMapper: Visual Autoregressive Models for Controllable Bidirectional Satellite-Map Translation

Apr 28, 2025

Zhe Dong, Yuzhe Sun, Tianzhu Liu, Wangmeng Zuo, Yanfeng Gu

Abstract:Satellite imagery and maps, as two fundamental data modalities in remote sensing, offer direct observations of the Earth's surface and human-interpretable geographic abstractions, respectively. The task of bidirectional translation between satellite images and maps (BSMT) holds significant potential for applications in urban planning and disaster response. However, this task presents two major challenges: first, the absence of precise pixel-wise alignment between the two modalities substantially complicates the translation process; second, it requires achieving both high-level abstraction of geographic features and high-quality visual synthesis, which further elevates the technical complexity. To address these limitations, we introduce EarthMapper, a novel autoregressive framework for controllable bidirectional satellite-map translation. EarthMapper employs geographic coordinate embeddings to anchor generation, ensuring region-specific adaptability, and leverages multi-scale feature alignment within a geo-conditioned joint scale autoregression (GJSA) process to unify bidirectional translation in a single training cycle. A semantic infusion (SI) mechanism is introduced to enhance feature-level consistency, while a key point adaptive guidance (KPAG) mechanism is proposed to dynamically balance diversity and precision during inference. We further contribute CNSatMap, a large-scale dataset comprising 302,132 precisely aligned satellite-map pairs across 38 Chinese cities, enabling robust benchmarking. Extensive experiments on CNSatMap and the New York dataset demonstrate EarthMapper's superior performance, achieving significant improvements in visual realism, semantic consistency, and structural fidelity over state-of-the-art methods. Additionally, EarthMapper excels in zero-shot tasks like in-painting, out-painting and coordinate-conditional generation, underscoring its versatility.

Via

Access Paper or Ask Questions

Seeing the Abstract: Translating the Abstract Language for Vision Language Models

May 06, 2025

Davide Talon, Federico Girella, Ziyue Liu, Marco Cristani, Yiming Wang

Abstract:Natural language goes beyond dryly describing visual content. It contains rich abstract concepts to express feeling, creativity and properties that cannot be directly perceived. Yet, current research in Vision Language Models (VLMs) has not shed light on abstract-oriented language. Our research breaks new ground by uncovering its wide presence and under-estimated value, with extensive analysis. Particularly, we focus our investigation on the fashion domain, a highly-representative field with abstract expressions. By analyzing recent large-scale multimodal fashion datasets, we find that abstract terms have a dominant presence, rivaling the concrete ones, providing novel information, and being useful in the retrieval task. However, a critical challenge emerges: current general-purpose or fashion-specific VLMs are pre-trained with databases that lack sufficient abstract words in their text corpora, thus hindering their ability to effectively represent abstract-oriented language. We propose a training-free and model-agnostic method, Abstract-to-Concrete Translator (ACT), to shift abstract representations towards well-represented concrete ones in the VLM latent space, using pre-trained models and existing multimodal databases. On the text-to-image retrieval task, despite being training-free, ACT outperforms the fine-tuned VLMs in both same- and cross-dataset settings, exhibiting its effectiveness with a strong generalization capability. Moreover, the improvement introduced by ACT is consistent with various VLMs, making it a plug-and-play solution.

* Accepted to CVPR25. Project page: https://davidetalon.github.io/fashionact-page/

Via

Access Paper or Ask Questions

Making Physical Objects with Generative AI and Robotic Assembly: Considering Fabrication Constraints, Sustainability, Time, Functionality, and Accessibility

Apr 27, 2025

Alexander Htet Kyaw, Se Hwan Jeon, Miana Smith, Neil Gershenfeld

Abstract:3D generative AI enables rapid and accessible creation of 3D models from text or image inputs. However, translating these outputs into physical objects remains a challenge due to the constraints in the physical world. Recent studies have focused on improving the capabilities of 3D generative AI to produce fabricable outputs, with 3D printing as the main fabrication method. However, this workshop paper calls for a broader perspective by considering how fabrication methods align with the capabilities of 3D generative AI. As a case study, we present a novel system using discrete robotic assembly and 3D generative AI to make physical objects. Through this work, we identified five key aspects to consider in a physical making process based on the capabilities of 3D generative AI. 1) Fabrication Constraints: Current text-to-3D models can generate a wide range of 3D designs, requiring fabrication methods that can adapt to the variability of generative AI outputs. 2) Time: While generative AI can generate 3D models in seconds, fabricating physical objects can take hours or even days. Faster production could enable a closer iterative design loop between humans and AI in the making process. 3) Sustainability: Although text-to-3D models can generate thousands of models in the digital world, extending this capability to the real world would be resource-intensive, unsustainable and irresponsible. 4) Functionality: Unlike digital outputs from 3D generative AI models, the fabrication method plays a crucial role in the usability of physical objects. 5) Accessibility: While generative AI simplifies 3D model creation, the need for fabrication equipment can limit participation, making AI-assisted creation less inclusive. These five key aspects provide a framework for assessing how well a physical making process aligns with the capabilities of 3D generative AI and values in the world.

Via

Access Paper or Ask Questions

Using street view imagery and deep generative modeling for estimating the health of urban forests

Apr 20, 2025

Akshit Gupta, Remko Uijlenhoet

Abstract:Healthy urban forests comprising of diverse trees and shrubs play a crucial role in mitigating climate change. They provide several key advantages such as providing shade for energy conservation, and intercepting rainfall to reduce flood runoff and soil erosion. Traditional approaches for monitoring the health of urban forests require instrumented inspection techniques, often involving a high amount of human labor and subjective evaluations. As a result, they are not scalable for cities which lack extensive resources. Recent approaches involving multi-spectral imaging data based on terrestrial sensing and satellites, are constrained respectively with challenges related to dedicated deployments and limited spatial resolutions. In this work, we propose an alternative approach for monitoring the urban forests using simplified inputs: street view imagery, tree inventory data and meteorological conditions. We propose to use image-to-image translation networks to estimate two urban forest health parameters, namely, NDVI and CTD. Finally, we aim to compare the generated results with ground truth data using an onsite campaign utilizing handheld multi-spectral and thermal imaging sensors. With the advent and expansion of street view imagery platforms such as Google Street View and Mapillary, this approach should enable effective management of urban forests for the authorities in cities at scale.

* Accepted at ICLR 2025 Workshop

Via

Access Paper or Ask Questions

ClearVision: Leveraging CycleGAN and SigLIP-2 for Robust All-Weather Classification in Traffic Camera Imagery

Apr 28, 2025

Anush Lakshman Sivaraman, Kojo Adu-Gyamfi, Ibne Farabi Shihab, Anuj Sharma

Abstract:Accurate weather classification from low-quality traffic camera imagery remains a challenging task, particularly under adverse nighttime conditions. In this study, we propose a scalable framework that combines generative domain adaptation with efficient contrastive learning to enhance classification performance. Using CycleGAN-based domain translation, we improve the quality of nighttime images, enabling better feature extraction by downstream models. While the baseline EVA-02 model employing CLIP-based contrastive loss achieves an overall accuracy of 96.55\%, it exhibits a significant performance gap between daytime (97.21\%) and nighttime conditions (63.40\%). Replacing CLIP with the lightweight SigLIP-2 (Sigmoid contrastive loss) achieves a competitive overall accuracy of 94.00\%, with substantial improvements in nighttime performance (85.90\% accuracy). The combination of Vision-SigLIP-2, Text-SigLIP-2, CycleGAN, and contrastive training achieves the best nighttime accuracy (85.90\%) among all models tested, while EVA-02 with CycleGAN maintains the highest overall accuracy (97.01\%) and per-class accuracies. These findings demonstrate the potential of combining domain adaptation and efficient contrastive learning to build practical, resource-efficient weather classification systems for intelligent transportation infrastructure.

Via

Access Paper or Ask Questions

Structure-Accurate Medical Image Translation based on Dynamic Frequency Balance and Knowledge Guidance

Apr 13, 2025

Jiahua Xu, Dawei Zhou, Lei Hu, Zaiyi Liu, Nannan Wang, Xinbo Gao

Abstract:Multimodal medical images play a crucial role in the precise and comprehensive clinical diagnosis. Diffusion model is a powerful strategy to synthesize the required medical images. However, existing approaches still suffer from the problem of anatomical structure distortion due to the overfitting of high-frequency information and the weakening of low-frequency information. Thus, we propose a novel method based on dynamic frequency balance and knowledge guidance. Specifically, we first extract the low-frequency and high-frequency components by decomposing the critical features of the model using wavelet transform. Then, a dynamic frequency balance module is designed to adaptively adjust frequency for enhancing global low-frequency features and effective high-frequency details as well as suppressing high-frequency noise. To further overcome the challenges posed by the large differences between different medical modalities, we construct a knowledge-guided mechanism that fuses the prior clinical knowledge from a visual language model with visual features, to facilitate the generation of accurate anatomical structures. Experimental evaluations on multiple datasets show the proposed method achieves significant improvements in qualitative and quantitative assessments, verifying its effectiveness and superiority.

* Medical image translation, Diffusion model, 16 pages

Via

Access Paper or Ask Questions

Development of a WAZOBIA-Named Entity Recognition System

May 10, 2025

S. E Emedem, I. E Onyenwe, E. G Onyedinma

Abstract:Named Entity Recognition NER is very crucial for various natural language processing applications, including information extraction, machine translation, and sentiment analysis. Despite the ever-increasing interest in African languages within computational linguistics, existing NER systems focus mainly on English, European, and a few other global languages, leaving a significant gap for under-resourced languages. This research presents the development of a WAZOBIA-NER system tailored for the three most prominent Nigerian languages: Hausa, Yoruba, and Igbo. This research begins with a comprehensive compilation of annotated datasets for each language, addressing data scarcity and linguistic diversity challenges. Exploring the state-of-the-art machine learning technique, Conditional Random Fields (CRF) and deep learning models such as Bidirectional Long Short-Term Memory (BiLSTM), Bidirectional Encoder Representation from Transformers (Bert) and fine-tune with a Recurrent Neural Network (RNN), the study evaluates the effectiveness of these approaches in recognizing three entities: persons, organizations, and locations. The system utilizes optical character recognition (OCR) technology to convert textual images into machine-readable text, thereby enabling the Wazobia system to accept both input text and textual images for extraction purposes. The system achieved a performance of 0.9511 in precision, 0.9400 in recall, 0.9564 in F1-score, and 0.9301 in accuracy. The model's evaluation was conducted across three languages, with precision, recall, F1-score, and accuracy as key assessment metrics. The Wazobia-NER system demonstrates that it is feasible to build robust NER tools for under-resourced African languages using current NLP frameworks and transfer learning.

* 6 pages, 3 figures, 1 table

Via

Access Paper or Ask Questions

OSDM-MReg: Multimodal Image Registration based One Step Diffusion Model

Apr 08, 2025

Xiaochen Wei, Weiwei Guo, Wenxian Yu, Feiming Wei, Dongying Li

Abstract:Multimodal remote sensing image registration aligns images from different sensors for data fusion and analysis. However, current methods often fail to extract modality-invariant features when aligning image pairs with large nonlinear radiometric differences. To address this issues, we propose OSDM-MReg, a novel multimodal image registration framework based image-to-image translation to eliminate the gap of multimodal images. Firstly, we propose a novel one-step unaligned target-guided conditional denoising diffusion probabilistic models(UTGOS-CDDPM)to translate multimodal images into a unified domain. In the inference stage, traditional conditional DDPM generate translated source image by a large number of iterations, which severely slows down the image registration task. To address this issues, we use the unaligned traget image as a condition to promote the generation of low-frequency features of the translated source image. Furthermore, during the training stage, we add the inverse process of directly predicting the translated image to ensure that the translated source image can be generated in one step during the testing stage. Additionally, to supervised the detail features of translated source image, we propose a new perceptual loss that focuses on the high-frequency feature differences between the translated and ground-truth images. Finally, a multimodal multiscale image registration network (MM-Reg) fuse the multimodal feature of the unimodal images and multimodal images by proposed multimodal feature fusion strategy. Experiments demonstrate superior accuracy and efficiency across various multimodal registration tasks, particularly for SAR-optical image pairs.

Via

Access Paper or Ask Questions

Enhancing Lidar Point Cloud Sampling via Colorization and Super-Resolution of Lidar Imagery

May 04, 2025

Sier Ha, Honghao Du, Xianjia Yu, Tomi Westerlund

Abstract:Recent advancements in lidar technology have led to improved point cloud resolution as well as the generation of 360 degrees, low-resolution images by encoding depth, reflectivity, or near-infrared light within each pixel. These images enable the application of deep learning (DL) approaches, originally developed for RGB images from cameras to lidar-only systems, eliminating other efforts, such as lidar-camera calibration. Compared with conventional RGB images, lidar imagery demonstrates greater robustness in adverse environmental conditions, such as low light and foggy weather. Moreover, the imaging capability addresses the challenges in environments where the geometric information in point clouds may be degraded, such as long corridors, and dense point clouds may be misleading, potentially leading to drift errors. Therefore, this paper proposes a novel framework that leverages DL-based colorization and super-resolution techniques on lidar imagery to extract reliable samples from lidar point clouds for odometry estimation. The enhanced lidar images, enriched with additional information, facilitate improved keypoint detection, which is subsequently employed for more effective point cloud downsampling. The proposed method enhances point cloud registration accuracy and mitigates mismatches arising from insufficient geometric information or misleading extra points. Experimental results indicate that our approach surpasses previous methods, achieving lower translation and rotation errors while using fewer points.

* 7 pages. arXiv admin note: substantial text overlap with arXiv:2409.11532

Via

Access Paper or Ask Questions

Topic:Image To Image Translation

Papers and Code