Image-to-image translation is the process of converting an image from one domain to another using deep learning techniques.
Optical coherence tomography (OCT) is a non-invasive volumetric imaging modality with high spatial and temporal resolution. For imaging larger tissue structures, OCT probes need to be moved to scan the respective area. For handheld scanning, stitching of the acquired OCT volumes requires overlap to register the images. For robotic scanning and stitching, a typical approach is to restrict the motion to translations, as this avoids a full hand-eye calibration, which is complicated by the small field of view of most OCT probes. However, stitching by registration or by translational scanning are limited when curved tissue surfaces need to be scanned. We propose a marker for full six-dimensional hand-eye calibration of a robot mounted OCT probe. We show that the calibration results in highly repeatable estimates of the transformation. Moreover, we evaluate robotic scanning of two phantom surfaces to demonstrate that the proposed calibration allows for consistent scanning of large, curved tissue surfaces. As the proposed approach is not relying on image registration, it does not suffer from a potential accumulation of errors along a scan path. We also illustrate the improvement compared to conventional 3D-translational robotic scanning.
With the development of foundational models, model compression has become a critical requirement. Various model compression approaches have been proposed such as low-rank decomposition, pruning, quantization, ergodic dynamic systems, and knowledge distillation, which are based on different heuristics. To elevate the field from fragmentation to a principled discipline, we construct a unifying mathematical framework for model compression grounded in measure theory. We further demonstrate that each model compression technique is mathematically equivalent to a neural network subject to a regularization. Building upon this mathematical and structural equivalence, we propose an experimentally-verified data-free model compression framework, termed \textit{Big2Small}, which translates Implicit Neural Representations (INRs) from data domain to the domain of network parameters. \textit{Big2Small} trains compact INRs to encode the weights of larger models and reconstruct the weights during inference. To enhance reconstruction fidelity, we introduce Outlier-Aware Preprocessing to handle extreme weight values and a Frequency-Aware Loss function to preserve high-frequency details. Experiments on image classification and segmentation demonstrate that \textit{Big2Small} achieves competitive accuracy and compression ratios compared to state-of-the-art baselines.
Graphic design is a creative and innovative process that plays a crucial role in applications such as e-commerce and advertising. However, developing an automated design system that can faithfully translate user intentions into editable design files remains an open challenge. Although recent studies have leveraged powerful text-to-image models and MLLMs to assist graphic design, they typically simplify professional workflows, resulting in limited flexibility and intuitiveness. To address these limitations, we propose PSDesigner, an automated graphic design system that emulates the creative workflow of human designers. Building upon multiple specialized components, PSDesigner collects theme-related assets based on user instructions, and autonomously infers and executes tool calls to manipulate design files, such as integrating new assets or refining inferior elements. To endow the system with strong tool-use capabilities, we construct a design dataset, CreativePSD, which contains a large amount of high-quality PSD design files annotated with operation traces across a wide range of design scenarios and artistic styles, enabling models to learn expert design procedures. Extensive experiments demonstrate that PSDesigner outperforms existing methods across diverse graphic design tasks, empowering non-specialists to conveniently create production-quality designs.
Most existing image keypoint detection and description methods rely on datasets with accurate pose and depth annotations, limiting scalability and generalization, and often degrading navigation and localization performance. We propose ViBA, a sustainable learning framework that integrates geometric optimization with feature learning for continuous online training on unconstrained video streams. Embedded in a standard visual odometry pipeline, it consists of an implicitly differentiable geometric residual framework: (i) an initial tracking network for inter-frame correspondences, (ii) depth-based outlier filtering, and (iii) differentiable global bundle adjustment that jointly refines camera poses and feature positions by minimizing reprojection errors. By combining geometric consistency from BA with long-term temporal consistency across frames, ViBA enforces stable and accurate feature representations. We evaluate ViBA on EuRoC and UMA datasets. Compared with state-of-the-art methods such as SuperPoint+SuperGlue, ALIKED, and LightGlue, ViBA reduces mean absolute translation error (ATE) by 12-18% and absolute rotation error (ARE) by 5-10% across sequences, while maintaining real-time inference speeds (FPS 36-91). When evaluated on unseen sequences, it retains over 90% localization accuracy, demonstrating robust generalization. These results show that ViBA supports continuous online learning with geometric and temporal consistency, consistently improving navigation and localization in real-world scenarios.
Virtual try-off (VTOFF) aims to recover canonical flat-garment representations from images of dressed persons for standardized display and downstream virtual try-on. Prior methods often treat VTOFF as direct image translation driven by local masks or text-only prompts, overlooking the gap between on-body appearances and flat layouts. This gap frequently leads to inconsistent completion in unobserved regions and unstable garment structure. We propose BridgeDiff, a diffusion-based framework that explicitly bridges human-centric observations and flat-garment synthesis through two complementary components. First, the Garment Condition Bridge Module (GCBM) builds a garment-cue representation that captures global appearance and semantic identity, enabling robust inference of continuous details under partial visibility. Second, the Flat Structure Constraint Module (FSCM) injects explicit flat-garment structural priors via Flat-Constraint Attention (FC-Attention) at selected denoising stages, improving structural stability beyond text-only conditioning. Extensive experiments on standard VTOFF benchmarks show that BridgeDiff achieves state-of-the-art performance, producing higher-quality flat-garment reconstructions while preserving fine-grained appearance and structural integrity.
Although diffusion models have achieved remarkable progress in multi-modal magnetic resonance imaging (MRI) translation tasks, existing methods still tend to suffer from anatomical inconsistencies or degraded texture details when handling arbitrary missing-modality scenarios. To address these issues, we propose a latent diffusion-based multi-modal MRI translation framework, termed MSG-LDM. By leveraging the available modalities, the proposed method infers complete structural information, which preserves reliable boundary details. Specifically, we introduce a style--structure disentanglement mechanism in the latent space, which explicitly separates modality-specific style features from shared structural representations, and jointly models low-frequency anatomical layouts and high-frequency boundary details in a multi-scale feature space. During the structure disentanglement stage, high-frequency structural information is explicitly incorporated to enhance feature representations, guiding the model to focus on fine-grained structural cues while learning modality-invariant low-frequency anatomical representations. Furthermore, to reduce interference from modality-specific styles and improve the stability of structure representations, we design a style consistency loss and a structure-aware loss. Extensive experiments on the BraTS2020 and WMH datasets demonstrate that the proposed method outperforms existing MRI synthesis approaches, particularly in reconstructing complete structures. The source code is publicly available at https://github.com/ziyi-start/MSG-LDM.
Cosine similarity is often used to measure the similarity of vectors. These vectors might be the representations of neural network models. However, it is not guaranteed that cosine similarity of model representations will tell us anything about model behaviour. In this paper we show that when using a softmax classifier, be it an image classifier or an autoregressive language model, measuring the cosine similarity between label representations (called unembeddings in the paper) does not give any information on the probabilities assigned by the model. Specifically, we prove that for any softmax classifier model, given two label representations, it is possible to make another model which gives the same probabilities for all labels and inputs, but where the cosine similarity between the representations is now either 1 or -1. We give specific examples of models with very high or low cosine simlarity between representations and show how to we can make equivalent models where the cosine similarity is now -1 or 1. This translation ambiguity can be fixed by centering the label representations, however, labels with representations with low cosine similarity can still have high probability for the same inputs. Fixing the length of the representations still does not give a guarantee that high(or low) cosine similarity will give high(or low) probability to the labels for the same inputs. This means that when working with softmax classifiers, cosine similarity values between label representations should not be used to explain model probabilities.
Three-dimensional (3D) Ultrasound (US) can facilitate diagnosis, treatment planning, and image-guided therapy. However, current studies rarely provide a comprehensive evaluation of volumetric accuracy and reproducibility, highlighting the need for robust Quality Assurance (QA) frameworks, particularly for tracked 3D US reconstruction using freehand or robotic acquisition. This study presents a QA framework for 3D US reconstruction and a flexible open source platform for tracked US research. A custom phantom containing geometric inclusions with varying symmetry properties enables straightforward evaluation of optical, electromagnetic, and robotic kinematic tracking for 3D US at different scanning speeds and insonation angles. A standardised pipeline performs real-time segmentation and 3D reconstruction of geometric targets (DSC = 0.97, FPS = 46) without GPU acceleration, followed by automated registration and comparison with ground-truth geometries. Applying this framework showed that our robotic 3D US achieves state-of-the-art reconstruction performance (DSC-3D = 0.94 +- 0.01, HD95 = 1.17 +- 0.12), approaching the spatial resolution limit imposed by the transducer. This work establishes a flexible experimental platform and a reproducible validation methodology for 3D US reconstruction. The proposed framework enables robust cross-platform comparisons and improved reporting practices, supporting the safe and effective clinical translation of 3D ultrasound in diagnostic and image-guided therapy applications.
As plots play a critical role in modern data visualization and analysis, Plot2API is launched to help non-experts and beginners create their desired plots by directly recommending graphical APIs from reference plot images by neural networks. However, previous works on Plot2API have primarily focused on the recommendation for standard plot images, while overlooking the hand-drawn plot images that are more accessible to non-experts and beginners. To make matters worse, both Plot2API models trained on standard plot images and powerful multi-modal large language models struggle to effectively recommend APIs for hand-drawn plot images due to the domain gap and lack of expertise. To facilitate non-experts and beginners, we introduce a hand-drawn plot dataset named HDpy-13 to improve the performance of graphical API recommendations for hand-drawn plot images. Additionally, to alleviate the considerable strain of parameter growth and computational resource costs arising from multi-domain and multi-language challenges in Plot2API, we propose Plot-Adapter that allows for the training and storage of separate adapters rather than requiring an entire model for each language and domain. In particular, Plot-Adapter incorporates a lightweight CNN block to improve the ability to capture local features and implements projection matrix sharing to reduce the number of fine-tuning parameters further. Experimental results demonstrate both the effectiveness of HDpy-13 and the efficiency of Plot-Adapter.
Existing computational spectral imaging systems typically rely on coded aperture and beam splitters that block a substantial fraction of incident light, degrading reconstruction quality under light-starved conditions. To address this limitation, we develop the Oscillating Dispersion Imaging Spectrometer (ODIS), which for the first time achieves near-full light throughput by axially translating a disperser between the conjugate image plane and a defocused position, sequentially capturing a panchromatic (PAN) image and a dispersed measurement along a single optical path. We further propose a PAN-guided Dispersion-Aware Deep Unfolding Network (PDAUN) that recovers high-fidelity spectral information from maskless dispersion under PAN structural guidance. Its data-fidelity step derives an FFT-Woodbury preconditioned solver by exploiting the cyclic-convolution property of the ODIS forward model, while a Dispersion-Aware Deformable Convolution module (DADC) corrects sub-pixel spectral misalignment using PAN features. Experiments show state-of-the-art performance on standard benchmarks, and cross-system comparisons confirm that ODIS yields decisive gains under low illumination. High-fidelity reconstruction is validated on a physical prototype.