Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiahui Wu

Evaluating Semantic Fragility in Text-to-Audio Generation Systems Under Controlled Prompt Perturbations

Mar 14, 2026

Jiahui Wu

Abstract:Recent advances in text-to-audio generation enable models to translate natural-language descriptions into diverse musical output. However, the robustness of these systems under semantically equivalent prompt variations remains largely unexplored. Small linguistic changes may lead to substantial variation in generated audio, raising concerns about reliability in practical use. In this study, we evaluate the semantic fragility of text-to-audio systems under controlled prompt perturbations. We selected MusicGen-small, MusicGen-large, and Stable Audio 2.5 as representative models, and we evaluated them under Minimal Lexical Substitution (MLS), Intensity Shifts (IS), and Structural Rephrasing (SR). The proposed dataset contains 75 prompt groups designed to preserve semantic intent while introducing localized linguistic variation. Generated outputs are compared through complementary spectral, temporal, and semantic similarity measures, enabling robustness analysis across multiple representational levels. Experimental results show that larger models achieve improved semantic consistency, with MusicGen-large reaching cosine similarities of 0.77 under MLS and 0.82 under IS. However, acoustic and temporal analyses reveal persistent divergence across all models, even when embedding similarity remains high. These findings indicate that fragility arises primarily during semantic-to-acoustic realization rather than multi-modal embedding alignment. Our study introduces a controlled framework for evaluating robustness in text-to-audio generation and highlights the need for multi-level stability assessment in generative audio systems.

* 8 pages, 4 figures, Under ICCC'26 review

Via

Access Paper or Ask Questions

UAMTERS: Uncertainty-Aware Mutation Analysis for DL-enabled Robotic Software

Feb 23, 2026

Chengjie Lu, Jiahui Wu, Shaukat Ali, Malaika Din Hashmi, Sebastian Mathias Thomle Mason, Francois Picard, Mikkel Labori Olsen, Thomas Peyrucain

Abstract:Self-adaptive robots adjust their behaviors in response to unpredictable environmental changes. These robots often incorporate deep learning (DL) components into their software to support functionality such as perception, decision-making, and control, enhancing autonomy and self-adaptability. However, the inherent uncertainty of DL-enabled software makes it challenging to ensure its dependability in dynamic environments. Consequently, test generation techniques have been developed to test robot software, and classical mutation analysis injects faults into the software to assess the test suite's effectiveness in detecting the resulting failures. However, there is a lack of mutation analysis techniques to assess the effectiveness under the uncertainty inherent to DL-enabled software. To this end, we propose UAMTERS, an uncertainty-aware mutation analysis framework that introduces uncertainty-aware mutation operators to explicitly inject stochastic uncertainty into DL-enabled robotic software, simulating uncertainty in its behavior. We further propose mutation score metrics to quantify a test suite's ability to detect failures under varying levels of uncertainty. We evaluate UAMTERS across three robotic case studies, demonstrating that UAMTERS more effectively distinguishes test suite quality and captures uncertainty-induced failures in DL-enabled software.

* 23 pages, 6 figures, 7 tables

Via

Access Paper or Ask Questions

Say Cheese! Detail-Preserving Portrait Collection Generation via Natural Language Edits

Jan 28, 2026

Zelong Sun, Jiahui Wu, Ying Ba, Dong Jing, Zhiwu Lu

Abstract:As social media platforms proliferate, users increasingly demand intuitive ways to create diverse, high-quality portrait collections. In this work, we introduce Portrait Collection Generation (PCG), a novel task that generates coherent portrait collections by editing a reference portrait image through natural language instructions. This task poses two unique challenges to existing methods: (1) complex multi-attribute modifications such as pose, spatial layout, and camera viewpoint; and (2) high-fidelity detail preservation including identity, clothing, and accessories. To address these challenges, we propose CHEESE, the first large-scale PCG dataset containing 24K portrait collections and 573K samples with high-quality modification text annotations, constructed through an Large Vison-Language Model-based pipeline with inversion-based verification. We further propose SCheese, a framework that combines text-guided generation with hierarchical identity and detail preservation. SCheese employs adaptive feature fusion mechanism to maintain identity consistency, and ConsistencyNet to inject fine-grained features for detail consistency. Comprehensive experiments validate the effectiveness of CHEESE in advancing PCG, with SCheese achieving state-of-the-art performance.

Via

Access Paper or Ask Questions

StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models

Aug 07, 2025

Xiangxiang Zhang, Jingxuan Wei, Donghong Zhong, Qi Chen, Caijun Jia, Cheng Tan, Jinming Gu, Xiaobo Qin, Zhiping Liu, Liang Hu(+24 more)

Figure 1 for StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models

Figure 2 for StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models

Figure 3 for StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models

Figure 4 for StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models

Abstract:Existing Vision-Language Models often struggle with complex, multi-question reasoning tasks where partial correctness is crucial for effective learning. Traditional reward mechanisms, which provide a single binary score for an entire response, are too coarse to guide models through intricate problems with multiple sub-parts. To address this, we introduce StructVRM, a method that aligns multimodal reasoning with Structured and Verifiable Reward Models. At its core is a model-based verifier trained to provide fine-grained, sub-question-level feedback, assessing semantic and mathematical equivalence rather than relying on rigid string matching. This allows for nuanced, partial credit scoring in previously intractable problem formats. Extensive experiments demonstrate the effectiveness of StructVRM. Our trained model, Seed-StructVRM, achieves state-of-the-art performance on six out of twelve public multimodal benchmarks and our newly curated, high-difficulty STEM-Bench. The success of StructVRM validates that training with structured, verifiable rewards is a highly effective approach for advancing the capabilities of multimodal models in complex, real-world reasoning domains.

Via

Access Paper or Ask Questions

UniPTMs: The First Unified Multi-type PTM Site Prediction Model via Master-Slave Architecture-Based Multi-Stage Fusion Strategy and Hierarchical Contrastive Loss

Jun 05, 2025

Yiyu Lin, Yan Wang, You Zhou, Xinye Ni, Jiahui Wu, Sen Yang

Abstract:As a core mechanism of epigenetic regulation in eukaryotes, protein post-translational modifications (PTMs) require precise prediction to decipher dynamic life activity networks. To address the limitations of existing deep learning models in cross-modal feature fusion, domain generalization, and architectural optimization, this study proposes UniPTMs: the first unified framework for multi-type PTM prediction. The framework innovatively establishes a "Master-Slave" dual-path collaborative architecture: The master path dynamically integrates high-dimensional representations of protein sequences, structures, and evolutionary information through a Bidirectional Gated Cross-Attention (BGCA) module, while the slave path optimizes feature discrepancies and recalibration between structural and traditional features using a Low-Dimensional Fusion Network (LDFN). Complemented by a Multi-scale Adaptive convolutional Pyramid (MACP) for capturing local feature patterns and a Bidirectional Hierarchical Gated Fusion Network (BHGFN) enabling multi-level feature integration across paths, the framework employs a Hierarchical Dynamic Weighting Fusion (HDWF) mechanism to intelligently aggregate multimodal features. Enhanced by a novel Hierarchical Contrastive loss function for feature consistency optimization, UniPTMs demonstrates significant performance improvements (3.2%-11.4% MCC and 4.2%-14.3% AP increases) over state-of-the-art models across five modification types and transcends the Single-Type Prediction Paradigm. To strike a balance between model complexity and performance, we have also developed a lightweight variant named UniPTMs-mini.

Via

Access Paper or Ask Questions

Network-Based Transfer Learning Helps Improve Short-Term Crime Prediction Accuracy

Jun 10, 2024

Jiahui Wu, Vanessa Frias-Martinez

Figure 1 for Network-Based Transfer Learning Helps Improve Short-Term Crime Prediction Accuracy

Figure 2 for Network-Based Transfer Learning Helps Improve Short-Term Crime Prediction Accuracy

Figure 3 for Network-Based Transfer Learning Helps Improve Short-Term Crime Prediction Accuracy

Figure 4 for Network-Based Transfer Learning Helps Improve Short-Term Crime Prediction Accuracy

Abstract:Deep learning architectures enhanced with human mobility data have been shown to improve the accuracy of short-term crime prediction models trained with historical crime data. However, human mobility data may be scarce in some regions, negatively impacting the correct training of these models. To address this issue, we propose a novel transfer learning framework for short-term crime prediction models, whereby weights from the deep learning crime prediction models trained in source regions with plenty of mobility data are transferred to target regions to fine-tune their local crime prediction models and improve crime prediction accuracy. Our results show that the proposed transfer learning framework improves the F1 scores for target cities with mobility data scarcity, especially when the number of months of available mobility data is small. We also show that the F1 score improvements are pervasive across different types of crimes and diverse cities in the US.

* 19 pages, 3 figures, 7 tables. arXiv admin note: substantial text overlap with arXiv:2406.04382

Via

Access Paper or Ask Questions

Improving the Fairness of Deep-Learning, Short-term Crime Prediction with Under-reporting-aware Models

Jun 06, 2024

Jiahui Wu, Vanessa Frias-Martinez

Figure 1 for Improving the Fairness of Deep-Learning, Short-term Crime Prediction with Under-reporting-aware Models

Figure 2 for Improving the Fairness of Deep-Learning, Short-term Crime Prediction with Under-reporting-aware Models

Figure 3 for Improving the Fairness of Deep-Learning, Short-term Crime Prediction with Under-reporting-aware Models

Figure 4 for Improving the Fairness of Deep-Learning, Short-term Crime Prediction with Under-reporting-aware Models

Abstract:Deep learning crime predictive tools use past crime data and additional behavioral datasets to forecast future crimes. Nevertheless, these tools have been shown to suffer from unfair predictions across minority racial and ethnic groups. Current approaches to address this unfairness generally propose either pre-processing methods that mitigate the bias in the training datasets by applying corrections to crime counts based on domain knowledge or in-processing methods that are implemented as fairness regularizers to optimize for both accuracy and fairness. In this paper, we propose a novel deep learning architecture that combines the power of these two approaches to increase prediction fairness. Our results show that the proposed model improves the fairness of crime predictions when compared to models with in-processing de-biasing approaches and with models without any type of bias correction, albeit at the cost of reducing accuracy.

* 25 pages, 4 figures

Via

Access Paper or Ask Questions

Random-coupled Neural Network

Mar 26, 2024

Haoran Liu, Mingzhe Liu, Peng Li, Jiahui Wu, Xin Jiang, Zhuo Zuo, Bingqi Liu

Abstract:Improving the efficiency of current neural networks and modeling them in biological neural systems have become popular research directions in recent years. Pulse-coupled neural network (PCNN) is a well applicated model for imitating the computation characteristics of the human brain in computer vision and neural network fields. However, differences between the PCNN and biological neural systems remain: limited neural connection, high computational cost, and lack of stochastic property. In this study, random-coupled neural network (RCNN) is proposed. It overcomes these difficulties in PCNN's neuromorphic computing via a random inactivation process. This process randomly closes some neural connections in the RCNN model, realized by the random inactivation weight matrix of link input. This releases the computational burden of PCNN, making it affordable to achieve vast neural connections. Furthermore, the image and video processing mechanisms of RCNN are researched. It encodes constant stimuli as periodic spike trains and periodic stimuli as chaotic spike trains, the same as biological neural information encoding characteristics. Finally, the RCNN is applicated to image segmentation, fusion, and pulse shape discrimination subtasks. It is demonstrated to be robust, efficient, and highly anti-noised, with outstanding performance in all applications mentioned above.

Via

Access Paper or Ask Questions