Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Topic:Synthetic Data Generation

What is Synthetic Data Generation? Synthetic data generation is the process of creating artificial data samples to augment or balance training datasets.

Physics-Informed Neural Networks and Sequence Encoder: Application to heating and early cooling of thermo-stamping process

Mar 27, 2026

Mouad Elaarabi, Domenico Borzacchiello, Philippe Le Bot, Nathan Lauzeral, Sebastien Comas-Cardona

Abstract:In a previous work (Elaarabi et al., 2025b), the Sequence Encoder for online dynamical system identification (Elaarabi et al., 2025a) and its combination with PINN (PINN-SE) were introduced and tested on both synthetic and real data case scenarios. The sequence encoder is able to effectively encode time series into feature vectors, which the PINN then uses to map to dynamical behavior, predicting system response under changes in parameters, ICs and BCs. Previously (Elaarabi et al., 2025b), the tests on real data were limited to simple 1D problems and only 1D time series inputs of the Sequence Encoder. In this work, the possibility of applying PINN-SE to a more realistic case is investigated: heating and early cooling of the thermo-stamping process, which is a critical stage in the forming process of continuous fiber reinforced composite materials with thermoplastic polymer. The possibility of extending the PINN-SE inputs to multimodal data, such as sequences of temporal 2D images and to scenarios involving variable geometries, is also explored. The results show that combining multiple encoders with the previously proposed method (Elaarabi et al., 2025b) is feasible, we also show that training the model on synthetic data generated based on experimental data can help the model to generalize well for real experimental data, unseen during the training phase.

Via

Access Paper or Ask Questions

DRUM: Diffusion-based Raydrop-aware Unpaired Mapping for Sim2Real LiDAR Segmentation

Mar 27, 2026

Tomoya Miyawaki, Kazuto Nakashima, Yumi Iwashita, Ryo Kurazume

Abstract:LiDAR-based semantic segmentation is a key component for autonomous mobile robots, yet large-scale annotation of LiDAR point clouds is prohibitively expensive and time-consuming. Although simulators can provide labeled synthetic data, models trained on synthetic data often underperform on real-world data due to a data-level domain gap. To address this issue, we propose DRUM, a novel Sim2Real translation framework. We leverage a diffusion model pre-trained on unlabeled real-world data as a generative prior and translate synthetic data by reproducing two key measurement characteristics: reflectance intensity and raydrop noise. To improve sample fidelity, we introduce a raydrop-aware masked guidance mechanism that selectively enforces consistency with the input synthetic data while preserving realistic raydrop noise induced by the diffusion prior. Experimental results demonstrate that DRUM consistently improves Sim2Real performance across multiple representations of LiDAR data. The project page is available at https://miya-tomoya.github.io/drum.

* ICRA 2026

Via

Access Paper or Ask Questions

Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation

Mar 26, 2026

Adam Jakobsen, Sushant Gautam, Hugo Lewi Hammer, Susanne Olofsdotter, Miriam S Johanson, Pål Halvorsen, Vajira Thambawita

Abstract:AI systems in healthcare research have shown potential to increase patient throughput and assist clinicians, yet progress is constrained by limited access to real patient data. To address this issue, we present a zero-shot, knowledge-guided framework for psychiatric tabular data in which large language models (LLMs) are steered via Retrieval-Augmented Generation using the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) and the International Classification of Diseases (ICD-10). We conducted experiments using different combinations of knowledge bases to generate privacy-preserving synthetic data. The resulting models were benchmarked against two state-of-the-art deep learning models for synthetic tabular data generation, namely CTGAN and TVAE, both of which rely on real data and therefore entail potential privacy risks. Evaluation was performed on six anxiety-related disorders: specific phobia, social anxiety disorder, agoraphobia, generalized anxiety disorder, separation anxiety disorder, and panic disorder. CTGAN typically achieves the best marginals and multivariate structure, while the knowledge-augmented LLM is competitive on pairwise structure and attains the lowest pairwise error in separation anxiety and social anxiety. An ablation study shows that clinical retrieval reliably improves univariate and pairwise fidelity over a no-retrieval LLM. Privacy analyses indicate that the real data-free LLM yields modest overlaps and a low average linkage risk comparable to CTGAN, whereas TVAE exhibits extensive duplication despite a low k-map score. Overall, grounding an LLM in clinical knowledge enables high-quality, privacy-preserving synthetic psychiatric data when real datasets are unavailable or cannot be shared.

* Submitted to CBMS 2026

Via

Access Paper or Ask Questions

Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

Mar 26, 2026

Jannis Vamvas, Ignacio Pérez Prat, Angela Heldstab, Dominic P. Fischer, Sina Ahmadi, Rico Sennrich

Abstract:Recent strategies for low-resource machine translation rely on LLMs to generate synthetic data from higher-resource languages. We find that this method fails for Romansh, because LLMs tend to confuse its 6 distinct language varieties. Our experiments show that instead, the direction of data augmentation should be aligned with the resource gradient between source and target language. This approach surpasses Gemini 3 Pro in the lowest-resource variety of Romansh by 23 BLEU. A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.

* Preprint

Via

Access Paper or Ask Questions

Seeing Through Smoke: Surgical Desmoking for Improved Visual Perception

Mar 26, 2026

Jingpei Lu, Fengyi Jiang, Xiaorui Zhang, Lingbo Jin, Omid Mohareri

Abstract:Minimally invasive and robot-assisted surgery relies heavily on endoscopic imaging, yet surgical smoke produced by electrocautery and vessel-sealing instruments can severely degrade visual perception and hinder vision-based functionalities. We present a transformer-based surgical desmoking model with a physics-inspired desmoking head that jointly predicts smoke-free image and corresponding smoke map. To address the scarcity of paired smoky-to-smoke-free training data, we develop a synthetic data generation pipeline that blends artificial smoke patterns with real endoscopic images, yielding over 80,000 paired samples for supervised training. We further curate, to our knowledge, the largest paired surgical smoke dataset to date, comprising 5,817 image pairs captured with the da Vinci robotic surgical system, enabling benchmarking on high-resolution endoscopic images. Extensive experiments on both a public benchmark and our dataset demonstrate state-of-the-art performance in image reconstruction compared to existing dehazing and desmoking approaches. We also assess the impact of desmoking on downstream stereo depth estimation and instrument segmentation, highlighting both the potential benefits and current limitations of digital smoke removal methods.

* 8 pages, 4 figures, 3 tables

Via

Access Paper or Ask Questions

SoftMimicGen: A Data Generation System for Scalable Robot Learning in Deformable Object Manipulation

Mar 26, 2026

Masoud Moghani, Mahdi Azizian, Animesh Garg, Yuke Zhu, Sean Huver, Ajay Mandlekar

Abstract:Large-scale robot datasets have facilitated the learning of a wide range of robot manipulation skills, but these datasets remain difficult to collect and scale further, owing to the intractable amount of human time, effort, and cost required. Simulation and synthetic data generation have proven to be an effective alternative to fuel this need for data, especially with the advent of recent work showing that such synthetic datasets can dramatically reduce real-world data requirements and facilitate generalization to novel scenarios unseen in real-world demonstrations. However, this paradigm has been limited to rigid-body tasks, which are easy to simulate. Deformable object manipulation encompasses a large portion of real-world manipulation and remains a crucial gap to address towards increasing adoption of the synthetic simulation data paradigm. In this paper, we introduce SoftMimicGen, an automated data generation pipeline for deformable object manipulation tasks. We introduce a suite of high-fidelity simulation environments that encompasses a wide range of deformable objects (stuffed animal, rope, tissue, towel) and manipulation behaviors (high-precision threading, dynamic whipping, folding, pick-and-place), across four robot embodiments: a single-arm manipulator, bimanual arms, a humanoid, and a surgical robot. We apply SoftMimicGen to generate datasets across the task suite, train high-performing policies from the data, and systematically analyze the data generation system. Project website: \href{https://softmimicgen.github.io}{softmimicgen.github.io}.

Via

Access Paper or Ask Questions

Identifying Connectivity Distributions from Neural Dynamics Using Flows

Mar 27, 2026

Timothy Doyeon Kim, Ulises Pereira-Obilinovic, Yiliu Wang, Eric Shea-Brown, Uygar Sümbül

Abstract:Connectivity structure shapes neural computation, but inferring this structure from population recordings is degenerate: multiple connectivity structures can generate identical dynamics. Recent work uses low-rank recurrent neural networks (lrRNNs) to infer low-dimensional latent dynamics and connectivity structure from observed activity, enabling a mechanistic interpretation of the dynamics. However, standard approaches for training lrRNNs can recover spurious structures irrelevant to the underlying dynamics. We first characterize the identifiability of connectivity structures in lrRNNs and determine conditions under which a unique solution exists. Then, to find such solutions, we develop an inference framework based on maximum entropy and continuous normalizing flows (CNFs), trained via flow matching. Instead of estimating a single connectivity matrix, our method learns the maximally unbiased distribution over connection weights consistent with observed dynamics. This approach captures complex yet necessary distributions such as heavy-tailed connectivity found in empirical data. We validate our method on synthetic datasets with connectivity structures that generate multistable attractors, limit cycles, and ring attractors, and demonstrate its applicability in recordings from rat frontal cortex during decision-making. Our framework shifts circuit inference from recovering connectivity to identifying which connectivity structures are computationally required, and which are artifacts of underconstrained inference.

Via

Access Paper or Ask Questions

Zero-Shot Depth from Defocus

Mar 27, 2026

Yiming Zuo, Hongyu Wen, Venkat Subramanian, Patrick Chen, Karhan Kayan, Mario Bijelic, Felix Heide, Jia Deng

Abstract:Depth from Defocus (DfD) is the task of estimating a dense metric depth map from a focus stack. Unlike previous works overfitting to a certain dataset, this paper focuses on the challenging and practical setting of zero-shot generalization. We first propose a new real-world DfD benchmark ZEDD, which contains 8.3x more scenes and significantly higher quality images and ground-truth depth maps compared to previous benchmarks. We also design a novel network architecture named FOSSA. FOSSA is a Transformer-based architecture with novel designs tailored to the DfD task. The key contribution is a stack attention layer with a focus distance embedding, allowing efficient information exchange across the focus stack. Finally, we develop a new training data pipeline allowing us to utilize existing large-scale RGBD datasets to generate synthetic focus stacks. Experiment results on ZEDD and other benchmarks show a significant improvement over the baselines, reducing errors by up to 55.7%. The ZEDD benchmark is released at https://zedd.cs.princeton.edu. The code and checkpoints are released at https://github.com/princeton-vl/FOSSA.

Via

Access Paper or Ask Questions

From Synthetic Data to Real Restorations: Diffusion Model for Patient-specific Dental Crown Completion

Mar 27, 2026

Dávid Pukanec, Tibor Kubík, Michal Španěl

Abstract:We present ToothCraft, a diffusion-based model for the contextual generation of tooth crowns, trained on artificially created incomplete teeth. Building upon recent advancements in conditioned diffusion models for 3D shapes, we developed a model capable of an automated tooth crown completion conditioned on local anatomical context. To address the lack of training data for this task, we designed an augmentation pipeline that generates incomplete tooth geometries from a publicly available dataset of complete dental arches (3DS, ODD). By synthesising a diverse set of training examples, our approach enables robust learning across a wide spectrum of tooth defects. Experimental results demonstrate the strong capability of our model to reconstruct complete tooth crowns, achieving an intersection over union (IoU) of 81.8% and a Chamfer Distance (CD) of 0.00034 on synthetically damaged testing restorations. Our experiments demonstrate that the model can be applied directly to real-world cases, effectively filling in incomplete teeth, while generated crowns show minimal intersection with the opposing dentition, thus reducing the risk of occlusal interference. Access to the code, model weights, and dataset information will be available at: https://github.com/ikarus1211/VISAPP_ToothCraft

* VISAPP 2026 Conference

Via

Access Paper or Ask Questions

Generative Score Inference for Multimodal Data

Mar 27, 2026

Xinyu Tian, Xiaotong Shen

Abstract:Accurate uncertainty quantification is crucial for making reliable decisions in various supervised learning scenarios, particularly when dealing with complex, multimodal data such as images and text. Current approaches often face notable limitations, including rigid assumptions and limited generalizability, constraining their effectiveness across diverse supervised learning tasks. To overcome these limitations, we introduce Generative Score Inference (GSI), a flexible inference framework capable of constructing statistically valid and informative prediction and confidence sets across a wide range of multimodal learning problems. GSI utilizes synthetic samples generated by deep generative models to approximate conditional score distributions, facilitating precise uncertainty quantification without imposing restrictive assumptions about the data or tasks. We empirically validate GSI's capabilities through two representative scenarios: hallucination detection in large language models and uncertainty estimation in image captioning. Our method achieves state-of-the-art performance in hallucination detection and robust predictive uncertainty in image captioning, and its performance is positively influenced by the quality of the underlying generative model. These findings underscore the potential of GSI as a versatile inference framework, significantly enhancing uncertainty quantification and trustworthiness in multimodal learning.

* 25 pages, 4 figures

Via

Access Paper or Ask Questions

Topic:Synthetic Data Generation

Papers and Code