Abstract:Large language models (LLMs) perform strongly on general-purpose code generation, yet their applicability to enterprise domain-specific languages (DSLs) remains underexplored, especially for repository-scale change generation spanning multiple files and folder structures from a single natural-language (NL) instruction. We report an industrial case study at BMW that adapts code-oriented LLMs to generate and modify project-root DSL artifacts for an Xtext-based DSL that drives downstream Java/TypeScript code generation. We develop an end-to-end pipeline for dataset construction, multi-file task representation, model adaptation, and evaluation. We encode DSL folder hierarchies as structured, path-preserving JSON, allowing single-response generation at repository scale and learning cross-file dependencies. We evaluate two instruction-tuned code LLMs (Qwen2.5-Coder and DeepSeek-Coder, 7B) under three configurations: baseline prompting, one-shot in-context learning, and parameter-efficient fine-tuning (QLoRA). Beyond standard similarity metrics, we introduce task-specific measures that assess edit correctness and repository structural fidelity. Fine-tuning yields the most significant gains across models and metrics, achieving high exact-match accuracy, substantial edit similarity, and structural fidelity of 1.00 on our held-out set for multi-file outputs. At the same time, one-shot in-context learning provides smaller but consistent improvements over baseline prompting. We further validate practical utility via an expert developer survey and an execution-based check using the existing code generator.




Abstract:This work demonstrates the ability to produce readily interpretable statistical metrics for model fit, fixed effects covariance coefficients, and prediction confidence. Importantly, this work compares 4 suitable and commonly applied epistemic UQ approaches, BNN, SWAG, MC dropout, and ensemble approaches in their ability to calculate these statistical metrics for the ARMED MEDL models. In our experiment for AD prognosis, not only do the UQ methods provide these benefits, but several UQ methods maintain the high performance of the original ARMED method, some even provide a modest (but not statistically significant) performance improvement. The ensemble models, especially the ensemble method with a 90% subsampling, performed well across all metrics we tested with (1) high performance that was comparable to the non-UQ ARMED model, (2) properly deweights the confounds probes and assigns them statistically insignificant p-values, (3) attains relatively high calibration of the output prediction confidence. Based on the results, the ensemble approaches, especially with a subsampling of 90%, provided the best all-round performance for prediction and uncertainty estimation, and achieved our goals to provide statistical significance for model fit, statistical significance covariate coefficients, and confidence in prediction, while maintaining the baseline performance of MEDL using ARMED