Purdue
Abstract:The proliferation of unmanned aerial vehicle (UAV) swarms has enabled a wide range of mission-critical applications, but also exposes UAV networks to severe Denial-of-Service (DoS) threats due to their open wireless environment, dynamic topology, and resource constraints. Traditional static or centralized defense mechanisms are often inadequate for such dynamic and distributed scenarios. To address these challenges, we propose a novel federated multi-agent deep reinforcement learning (FMADRL)-driven moving target defense (MTD) framework for proactive and adaptive DoS mitigation in UAV swarm networks. Specifically, we design three lightweight and coordinated MTD mechanisms, including leader switching, route mutation, and frequency hopping, that leverage the inherent flexibility of UAV swarms to disrupt attacker efforts and enhance network resilience. The defense problem is formulated as a multi-agent partially observable Markov decision process (POMDP), capturing the distributed, resource-constrained, and uncertain nature of UAV swarms under attack. Each UAV is equipped with a local policy agent that autonomously selects MTD actions based on partial observations and local experiences. By employing a policy gradient-based FMADRL algorithm, UAVs collaboratively optimize their defense policies via reward-weighted aggregation, enabling distributed learning without sharing raw data and thus reducing communication overhead. Extensive simulations demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving up to a 34.6% improvement in attack mitigation rate, a reduction in average recovery time of up to 94.6%, and decreases in energy consumption and defense cost by as much as 29.3% and 98.3%, respectively, while maintaining robust mission continuity under various DoS attack strategies.
Abstract:In recent years, model collapse has become a critical issue in language model training, making it essential to understand the underlying mechanisms driving this phenomenon. In this paper, we investigate recursive parametric model training from a probabilistic perspective, aiming to characterize the conditions under which model collapse occurs and, crucially, how it can be mitigated. We conceptualize the recursive training process as a random walk of the model estimate, highlighting how the sample size influences the step size and how the estimation procedure determines the direction and potential bias of the random walk. Under mild conditions, we rigorously show that progressively increasing the sample size at each training step is necessary to prevent model collapse. In particular, when the estimation is unbiased, the required growth rate follows a superlinear pattern. This rate needs to be accelerated even further in the presence of substantial estimation bias. Building on this probabilistic framework, we also investigate the probability that recursive training on synthetic data yields models that outperform those trained solely on real data. Moreover, we extend these results to general parametric model family in an asymptotic regime. Finally, we validate our theoretical results through extensive simulations and a real-world dataset.
Abstract:Tabular data synthesis involves not only multi-table synthesis but also generating multi-modal data (e.g., strings and categories), which enables diverse knowledge synthesis. However, separating numerical and categorical data has limited the effectiveness of tabular data generation. The GReaT (Generate Realistic Tabular Data) framework uses Large Language Models (LLMs) to encode entire rows, eliminating the need to partition data types. Despite this, the framework's performance is constrained by two issues: (1) tabular data entries lack sufficient semantic meaning, limiting LLM's ability to leverage pre-trained knowledge for in-context learning, and (2) complex multi-table datasets struggle to establish effective relationships for collaboration. To address these, we propose GReaTER (Generate Realistic Tabular Data after data Enhancement and Reduction), which includes: (1) a data semantic enhancement system that improves LLM's understanding of tabular data through mapping, enabling better in-context learning, and (2) a cross-table connecting method to establish efficient relationships across complex tables. Experimental results show that GReaTER outperforms the GReaT framework.
Abstract:Recent studies identified an intriguing phenomenon in recursive generative model training known as model collapse, where models trained on data generated by previous models exhibit severe performance degradation. Addressing this issue and developing more effective training strategies have become central challenges in generative model research. In this paper, we investigate this phenomenon theoretically within a novel framework, where generative models are iteratively trained on a combination of newly collected real data and synthetic data from the previous training step. To develop an optimal training strategy for integrating real and synthetic data, we evaluate the performance of a weighted training scheme in various scenarios, including Gaussian distribution estimation and linear regression. We theoretically characterize the impact of the mixing proportion and weighting scheme of synthetic data on the final model's performance. Our key finding is that, across different settings, the optimal weighting scheme under different proportions of synthetic data asymptotically follows a unified expression, revealing a fundamental trade-off between leveraging synthetic data and generative model performance. Notably, in some cases, the optimal weight assigned to real data corresponds precisely to the reciprocal of the golden ratio. Finally, we validate our theoretical results on extensive simulated datasets and a real tabular dataset.
Abstract:The rapid evolution of cloud computing technologies and the increasing number of cloud applications have provided a large number of benefits in daily lives. However, the diversity and complexity of different components pose a significant challenge to cloud security, especially when dealing with sophisticated and advanced cyberattacks. Recent advancements in generative foundation models (GFMs), particularly in the large language models (LLMs), offer promising solutions for security intelligence. By exploiting the powerful abilities in language understanding, data analysis, task inference, action planning, and code generation, we present LLM-PD, a novel proactive defense architecture that defeats various threats in a proactive manner. LLM-PD can efficiently make a decision through comprehensive data analysis and sequential reasoning, as well as dynamically creating and deploying actionable defense mechanisms on the target cloud. Furthermore, it can flexibly self-evolve based on experience learned from previous interactions and adapt to new attack scenarios without additional training. The experimental results demonstrate its remarkable ability in terms of defense effectiveness and efficiency, particularly highlighting an outstanding success rate when compared with other existing methods.
Abstract:In this paper, we propose a novel statistical framework for watermarking generative categorical data. Our method systematically embeds pre-agreed secret signals by splitting the data distribution into two components and modifying one distribution based on a deterministic relationship with the other, ensuring the watermark is embedded at the distribution-level. To verify the watermark, we introduce an insertion inverse algorithm and detect its presence by measuring the total variation distance between the inverse-decoded data and the original distribution. Unlike previous categorical watermarking methods, which primarily focus on embedding watermarks into a given dataset, our approach operates at the distribution-level, allowing for verification from a statistical distributional perspective. This makes it particularly well-suited for the modern paradigm of synthetic data generation, where the underlying data distribution, rather than specific data points, is of primary importance. The effectiveness of our method is demonstrated through both theoretical analysis and empirical validation.
Abstract:Data collaboration via Data Clean Room offers value but raises privacy concerns, which can be addressed through synthetic data and multi-table synthesizers. Common multi-table synthesizers fail to perform when subjects occur repeatedly in both tables. This is an urgent yet unresolved problem, since having both tables with repeating subjects is common. To improve performance in this scenario, we present the DEREC 3-step pre-processing pipeline to generalize adaptability of multi-table synthesizers. We also introduce the SIMPRO 3-aspect evaluation metrics, which leverage conditional distribution and large-scale simultaneous hypothesis testing to provide comprehensive feedback on synthetic data fidelity at both column and table levels. Results show that using DEREC improves fidelity, and multi-table synthesizers outperform single-table counterparts in collaboration settings. Together, the DEREC-SIMPRO pipeline offers a robust solution for generalizing data collaboration, promoting a more efficient, data-driven society.
Abstract:In the current era of big data and machine learning, it's essential to find ways to shrink the size of training dataset while preserving the training performance to improve efficiency. However, the challenge behind it includes providing practical ways to find points that can be deleted without significantly harming the training result and suffering from problems like underfitting. We therefore present the perfect deleted point problem for 1-step noisy SGD in the classical linear regression task, which aims to find the perfect deleted point in the training dataset such that the model resulted from the deleted dataset will be identical to the one trained without deleting it. We apply the so-called signal-to-noise ratio and suggest that its value is closely related to the selection of the perfect deleted point. We also implement an algorithm based on this and empirically show the effectiveness of it in a synthetic dataset. Finally we analyze the consequences of the perfect deleted point, specifically how it affects the training performance and privacy budget, therefore highlighting its potential. This research underscores the importance of data deletion and calls for urgent need for more studies in this field.
Abstract:In this paper, we leverage the power of latent diffusion models to generate synthetic time series tabular data. Along with the temporal and feature correlations, the heterogeneous nature of the feature in the table has been one of the main obstacles in time series tabular data modeling. We tackle this problem by combining the ideas of the variational auto-encoder (VAE) and the denoising diffusion probabilistic model (DDPM). Our model named as \texttt{TimeAutoDiff} has several key advantages including (1) Generality: the ability to handle the broad spectrum of time series tabular data from single to multi-sequence datasets; (2) Good fidelity and utility guarantees: numerical experiments on six publicly available datasets demonstrating significant improvements over state-of-the-art models in generating time series tabular data, across four metrics measuring fidelity and utility; (3) Fast sampling speed: entire time series data generation as opposed to the sequential data sampling schemes implemented in the existing diffusion-based models, eventually leading to significant improvements in sampling speed, (4) Entity conditional generation: the first implementation of conditional generation of multi-sequence time series tabular data with heterogenous features in the literature, enabling scenario exploration across multiple scientific and engineering domains. Codes are in preparation for release to the public, but available upon request.
Abstract:The evaluation of synthetic data generation is crucial, especially in the retail sector where data accuracy is paramount. This paper introduces a comprehensive framework for assessing synthetic retail data, focusing on fidelity, utility, and privacy. Our approach differentiates between continuous and discrete data attributes, providing precise evaluation criteria. Fidelity is measured through stability and generalizability. Stability ensures synthetic data accurately replicates known data distributions, while generalizability confirms its robustness in novel scenarios. Utility is demonstrated through the synthetic data's effectiveness in critical retail tasks such as demand forecasting and dynamic pricing, proving its value in predictive analytics and strategic planning. Privacy is safeguarded using Differential Privacy, ensuring synthetic data maintains a perfect balance between resembling training and holdout datasets without compromising security. Our findings validate that this framework provides reliable and scalable evaluation for synthetic retail data. It ensures high fidelity, utility, and privacy, making it an essential tool for advancing retail data science. This framework meets the evolving needs of the retail industry with precision and confidence, paving the way for future advancements in synthetic data methodologies.