Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mushtari Sadia

A Case for Computing on Unstructured Data

Sep 18, 2025

Mushtari Sadia, Amrita Roy Chowdhury, Ang Chen

Abstract:Unstructured data, such as text, images, audio, and video, comprises the vast majority of the world's information, yet it remains poorly supported by traditional data systems that rely on structured formats for computation. We argue for a new paradigm, which we call computing on unstructured data, built around three stages: extraction of latent structure, transformation of this structure through data processing techniques, and projection back into unstructured formats. This bi-directional pipeline allows unstructured data to benefit from the analytical power of structured computation, while preserving the richness and accessibility of unstructured representations for human and AI consumption. We illustrate this paradigm through two use cases and present the research components that need to be developed in a new data system called MXFlow.

Via

Access Paper or Ask Questions

SQUiD: Synthesizing Relational Databases from Unstructured Text

May 25, 2025

Mushtari Sadia, Zhenning Yang, Yunming Xiao, Ang Chen, Amrita Roy Chowdhury

Abstract:Relational databases are central to modern data management, yet most data exists in unstructured forms like text documents. To bridge this gap, we leverage large language models (LLMs) to automatically synthesize a relational database by generating its schema and populating its tables from raw text. We introduce SQUiD, a novel neurosymbolic framework that decomposes this task into four stages, each with specialized techniques. Our experiments show that SQUiD consistently outperforms baselines across diverse datasets.

Via

Access Paper or Ask Questions

LogShield: A Transformer-based APT Detection System Leveraging Self-Attention

Nov 09, 2023

Sihat Afnan, Mushtari Sadia, Shahrear Iqbal, Anindya Iqbal

Figure 1 for LogShield: A Transformer-based APT Detection System Leveraging Self-Attention

Figure 2 for LogShield: A Transformer-based APT Detection System Leveraging Self-Attention

Figure 3 for LogShield: A Transformer-based APT Detection System Leveraging Self-Attention

Figure 4 for LogShield: A Transformer-based APT Detection System Leveraging Self-Attention

Abstract:Cyber attacks are often identified using system and network logs. There have been significant prior works that utilize provenance graphs and ML techniques to detect attacks, specifically advanced persistent threats, which are very difficult to detect. Lately, there have been studies where transformer-based language models are being used to detect various types of attacks from system logs. However, no such attempts have been made in the case of APTs. In addition, existing state-of-the-art techniques that use system provenance graphs, lack a data processing framework generalized across datasets for optimal performance. For mitigating this limitation as well as exploring the effectiveness of transformer-based language models, this paper proposes LogShield, a framework designed to detect APT attack patterns leveraging the power of self-attention in transformers. We incorporate customized embedding layers to effectively capture the context of event sequences derived from provenance graphs. While acknowledging the computational overhead associated with training transformer networks, our framework surpasses existing LSTM and Language models regarding APT detection. We integrated the model parameters and training procedure from the RoBERTa model and conducted extensive experiments on well-known APT datasets (DARPA OpTC and DARPA TC E3). Our framework achieved superior F1 scores of 98% and 95% on the two datasets respectively, surpassing the F1 scores of 96% and 94% obtained by LSTM models. Our findings suggest that LogShield's performance benefits from larger datasets and demonstrates its potential for generalization across diverse domains. These findings contribute to the advancement of APT attack detection methods and underscore the significance of transformer-based architectures in addressing security challenges in computer systems.

Via

Access Paper or Ask Questions