Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yang Ba

Beyond Predefined Schemas: TRACE-KG for Context-Enriched Knowledge Graphs from Complex Documents

Apr 03, 2026

Mohammad Sadeq Abolhasani, Yang Ba, Yixuan He, Rong Pan

Abstract:Knowledge graph construction typically relies either on predefined ontologies or on schema-free extraction. Ontology-driven pipelines enforce consistent typing but require costly schema design and maintenance, whereas schema-free methods often produce fragmented graphs with weak global organization, especially in long technical documents with dense, context-dependent information. We propose TRACE-KG (Text-dRiven schemA for Context-Enriched Knowledge Graphs), a multimodal framework that jointly constructs a context-enriched knowledge graph and an induced schema without assuming a predefined ontology. TRACE-KG captures conditional relations through structured qualifiers and organizes entities and relations using a data-driven schema that serves as a reusable semantic scaffold while preserving full traceability to the source evidence. Experiments show that TRACE-KG produces structurally coherent, traceable knowledge graphs and offers a practical alternative to both ontology-driven and schema-free construction pipelines.

Via

Access Paper or Ask Questions

MemArchitect: A Policy Driven Memory Governance Layer

Mar 18, 2026

Lingavasan Suresh Kumar, Yang Ba, Rong Pan

Abstract:Persistent Large Language Model (LLM) agents expose a critical governance gap in memory management. Standard Retrieval-Augmented Generation (RAG) frameworks treat memory as passive storage, lacking mechanisms to resolve contradictions, enforce privacy, or prevent outdated information ("zombie memories") from contaminating the context window. We introduce MemArchitect, a governance layer that decouples memory lifecycle management from model weights. MemArchitect enforces explicit, rule-based policies, including memory decay, conflict resolution, and privacy controls. We demonstrate that governed memory consistently outperforms unmanaged memory in agentic settings, highlighting the necessity of structured memory governance for reliable and safe autonomous systems.

* This is an on going research work and will be updated periodically

Via

Access Paper or Ask Questions

Measuring Dataset Diversity from a Geometric Perspective

Feb 10, 2026

Yang Ba, Mohammad Sadeq Abolhasani, Michelle V Mancenido, Rong Pan

Abstract:Diversity can be broadly defined as the presence of meaningful variation across elements, which can be viewed from multiple perspectives, including statistical variation and geometric structural richness in the dataset. Existing diversity metrics, such as feature-space dispersion and metric-space magnitude, primarily capture distributional variation or entropy, while largely neglecting the geometric structure of datasets. To address this gap, we introduce a framework based on topological data analysis (TDA) and persistence landscapes (PLs) to extract and quantify geometric features from data. This approach provides a theoretically grounded means of measuring diversity beyond entropy, capturing the rich geometric and structural properties of datasets. Through extensive experiments across diverse modalities, we demonstrate that our proposed PLs-based diversity metric (PLDiv) is powerful, reliable, and interpretable, directly linking data diversity to its underlying geometry and offering a foundational tool for dataset construction, augmentation, and evaluation.

Via

Access Paper or Ask Questions

How Does Data Diversity Shape the Weight Landscape of Neural Networks?

Oct 18, 2024

Yang Ba, Michelle V. Mancenido, Rong Pan

Figure 1 for How Does Data Diversity Shape the Weight Landscape of Neural Networks?

Figure 2 for How Does Data Diversity Shape the Weight Landscape of Neural Networks?

Figure 3 for How Does Data Diversity Shape the Weight Landscape of Neural Networks?

Figure 4 for How Does Data Diversity Shape the Weight Landscape of Neural Networks?

Abstract:To enhance the generalization of machine learning models to unseen data, techniques such as dropout, weight decay ($L_2$ regularization), and noise augmentation are commonly employed. While regularization methods (i.e., dropout and weight decay) are geared toward adjusting model parameters to prevent overfitting, data augmentation increases the diversity of the input training set, a method purported to improve accuracy and calibration error. In this paper, we investigate the impact of each of these techniques on the parameter space of neural networks, with the goal of understanding how they alter the weight landscape in transfer learning scenarios. To accomplish this, we employ Random Matrix Theory to analyze the eigenvalue distributions of pre-trained models, fine-tuned using these techniques but using different levels of data diversity, for the same downstream tasks. We observe that diverse data influences the weight landscape in a similar fashion as dropout. Additionally, we compare commonly used data augmentation methods with synthetic data created by generative models. We conclude that synthetic data can bring more diversity into real input data, resulting in a better performance on out-of-distribution test instances.

Via

Access Paper or Ask Questions

Fill In The Gaps: Model Calibration and Generalization with Synthetic Data

Oct 07, 2024

Yang Ba, Michelle V. Mancenido, Rong Pan

Figure 1 for Fill In The Gaps: Model Calibration and Generalization with Synthetic Data

Figure 2 for Fill In The Gaps: Model Calibration and Generalization with Synthetic Data

Figure 3 for Fill In The Gaps: Model Calibration and Generalization with Synthetic Data

Figure 4 for Fill In The Gaps: Model Calibration and Generalization with Synthetic Data

Abstract:As machine learning models continue to swiftly advance, calibrating their performance has become a major concern prior to practical and widespread implementation. Most existing calibration methods often negatively impact model accuracy due to the lack of diversity of validation data, resulting in reduced generalizability. To address this, we propose a calibration method that incorporates synthetic data without compromising accuracy. We derive the expected calibration error (ECE) bound using the Probably Approximately Correct (PAC) learning framework. Large language models (LLMs), known for their ability to mimic real data and generate text with mixed class labels, are utilized as a synthetic data generation strategy to lower the ECE bound and improve model accuracy on real test data. Additionally, we propose data generation mechanisms for efficient calibration. Testing our method on four different natural language processing tasks, we observed an average up to 34\% increase in accuracy and 33\% decrease in ECE.

* Accepted to EMNLP 2024 Main Conference (Long paper)

Via

Access Paper or Ask Questions