Alert button
Picture for Weijie Xu

Weijie Xu

Alert button

Sequence-Level Certainty Reduces Hallucination In Knowledge-Grounded Dialogue Generation

Oct 28, 2023
Yixin Wan, Fanyou Wu, Weijie Xu, Srinivasan H. Sengamedu

Model hallucination has been a crucial interest of research in Natural Language Generation (NLG). In this work, we propose sequence-level certainty as a common theme over hallucination in NLG, and explore the correlation between sequence-level certainty and the level of hallucination in model responses. We categorize sequence-level certainty into two aspects: probabilistic certainty and semantic certainty, and reveal through experiments on Knowledge-Grounded Dialogue Generation (KGDG) task that both a higher level of probabilistic certainty and a higher level of semantic certainty in model responses are significantly correlated with a lower level of hallucination. What's more, we provide theoretical proof and analysis to show that semantic certainty is a good estimator of probabilistic certainty, and therefore has the potential as an alternative to probability-based certainty estimation in black-box scenarios. Based on the observation on the relationship between certainty and hallucination, we further propose Certainty-based Response Ranking (CRR), a decoding-time method for mitigating hallucination in NLG. Based on our categorization of sequence-level certainty, we propose 2 types of CRR approach: Probabilistic CRR (P-CRR) and Semantic CRR (S-CRR). P-CRR ranks individually sampled model responses using their arithmetic mean log-probability of the entire sequence. S-CRR approaches certainty estimation from meaning-space, and ranks a number of model response candidates based on their semantic certainty level, which is estimated by the entailment-based Agreement Score (AS). Through extensive experiments across 3 KGDG datasets, 3 decoding methods, and on 4 different models, we validate the effectiveness of our 2 proposed CRR methods to reduce model hallucination.

Viaarxiv icon

DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM

Oct 23, 2023
Weijie Xu, Wenxiang Hu, Fanyou Wu, Srinivasan Sengamedu

Figure 1 for DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM
Figure 2 for DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM
Figure 3 for DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM
Figure 4 for DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM

In the burgeoning field of natural language processing, Neural Topic Models (NTMs) and Large Language Models (LLMs) have emerged as areas of significant research interest. Despite this, NTMs primarily utilize contextual embeddings from LLMs, which are not optimal for clustering or capable for topic generation. Our study addresses this gap by introducing a novel framework named Diffusion-Enhanced Topic Modeling using Encoder-Decoder-based LLMs (DeTiME). DeTiME leverages ncoder-Decoder-based LLMs to produce highly clusterable embeddings that could generate topics that exhibit both superior clusterability and enhanced semantic coherence compared to existing methods. Additionally, by exploiting the power of diffusion, our framework also provides the capability to generate content relevant to the identified topics. This dual functionality allows users to efficiently produce highly clustered topics and related content simultaneously. DeTiME's potential extends to generating clustered embeddings as well. Notably, our proposed framework proves to be efficient to train and exhibits high adaptability, demonstrating its potential for a wide array of applications.

* EMNLP 2023  
* 19 pages, 4 figures, EMNLP 2023 
Viaarxiv icon

S2vNTM: Semi-supervised vMF Neural Topic Modeling

Jul 06, 2023
Weijie Xu, Jay Desai, Srinivasan Sengamedu, Xiaoyu Jiang, Francis Iannacci

Figure 1 for S2vNTM: Semi-supervised vMF Neural Topic Modeling
Figure 2 for S2vNTM: Semi-supervised vMF Neural Topic Modeling
Figure 3 for S2vNTM: Semi-supervised vMF Neural Topic Modeling
Figure 4 for S2vNTM: Semi-supervised vMF Neural Topic Modeling

Language model based methods are powerful techniques for text classification. However, the models have several shortcomings. (1) It is difficult to integrate human knowledge such as keywords. (2) It needs a lot of resources to train the models. (3) It relied on large text data to pretrain. In this paper, we propose Semi-Supervised vMF Neural Topic Modeling (S2vNTM) to overcome these difficulties. S2vNTM takes a few seed keywords as input for topics. S2vNTM leverages the pattern of keywords to identify potential topics, as well as optimize the quality of topics' keywords sets. Across a variety of datasets, S2vNTM outperforms existing semi-supervised topic modeling methods in classification accuracy with limited keywords provided. S2vNTM is at least twice as fast as baselines.

* ICLR Workshop 2023  
* 17 pages, 9 figures, ICLR Workshop 2023. arXiv admin note: text overlap with arXiv:2307.01226 
Viaarxiv icon

KDSTM: Neural Semi-supervised Topic Modeling with Knowledge Distillation

Jul 04, 2023
Weijie Xu, Xiaoyu Jiang, Jay Desai, Bin Han, Fuqin Yan, Francis Iannacci

Figure 1 for KDSTM: Neural Semi-supervised Topic Modeling with Knowledge Distillation
Figure 2 for KDSTM: Neural Semi-supervised Topic Modeling with Knowledge Distillation
Figure 3 for KDSTM: Neural Semi-supervised Topic Modeling with Knowledge Distillation
Figure 4 for KDSTM: Neural Semi-supervised Topic Modeling with Knowledge Distillation

In text classification tasks, fine tuning pretrained language models like BERT and GPT-3 yields competitive accuracy; however, both methods require pretraining on large text datasets. In contrast, general topic modeling methods possess the advantage of analyzing documents to extract meaningful patterns of words without the need of pretraining. To leverage topic modeling's unsupervised insights extraction on text classification tasks, we develop the Knowledge Distillation Semi-supervised Topic Modeling (KDSTM). KDSTM requires no pretrained embeddings, few labeled documents and is efficient to train, making it ideal under resource constrained settings. Across a variety of datasets, our method outperforms existing supervised topic modeling methods in classification accuracy, robustness and efficiency and achieves similar performance compare to state of the art weakly supervised text classification methods.

* ICLR 2022 Workshop PML4DC  
* 12 pages, 4 figures, ICLR 2022 Workshop 
Viaarxiv icon

Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving Training Data Release for Machine Learning

Jul 04, 2023
Tamas Madl, Weijie Xu, Olivia Choudhury, Matthew Howard

Figure 1 for Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving Training Data Release for Machine Learning
Figure 2 for Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving Training Data Release for Machine Learning
Figure 3 for Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving Training Data Release for Machine Learning
Figure 4 for Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving Training Data Release for Machine Learning

The availability of large amounts of informative data is crucial for successful machine learning. However, in domains with sensitive information, the release of high-utility data which protects the privacy of individuals has proven challenging. Despite progress in differential privacy and generative modeling for privacy-preserving data release in the literature, only a few approaches optimize for machine learning utility: most approaches only take into account statistical metrics on the data itself and fail to explicitly preserve the loss metrics of machine learning models that are to be subsequently trained on the generated data. In this paper, we introduce a data release framework, 3A (Approximate, Adapt, Anonymize), to maximize data utility for machine learning, while preserving differential privacy. We also describe a specific implementation of this framework that leverages mixture models to approximate, kernel-inducing points to adapt, and Gaussian differential privacy to anonymize a dataset, in order to ensure that the resulting data is both privacy-preserving and high utility. We present experimental evidence showing minimal discrepancy between performance metrics of models trained on real versus privatized datasets, when evaluated on held-out real data. We also compare our results with several privacy-preserving synthetic data generation models (such as differentially private generative adversarial networks), and report significant increases in classification performance metrics compared to state-of-the-art models. These favorable comparisons show that the presented framework is a promising direction of research, increasing the utility of low-risk synthetic data release for machine learning.

* AAAI 2023 Workshop on Privacy-Preserving Artificial Intelligence  
* 10 pages, 3 figures, AAAI Workshop 
Viaarxiv icon

vONTSS: vMF based semi-supervised neural topic modeling with optimal transport

Jul 03, 2023
Weijie Xu, Xiaoyu Jiang, Srinivasan H. Sengamedu, Francis Iannacci, Jinjin Zhao

Figure 1 for vONTSS: vMF based semi-supervised neural topic modeling with optimal transport
Figure 2 for vONTSS: vMF based semi-supervised neural topic modeling with optimal transport
Figure 3 for vONTSS: vMF based semi-supervised neural topic modeling with optimal transport
Figure 4 for vONTSS: vMF based semi-supervised neural topic modeling with optimal transport

Recently, Neural Topic Models (NTM), inspired by variational autoencoders, have attracted a lot of research interest; however, these methods have limited applications in the real world due to the challenge of incorporating human knowledge. This work presents a semi-supervised neural topic modeling method, vONTSS, which uses von Mises-Fisher (vMF) based variational autoencoders and optimal transport. When a few keywords per topic are provided, vONTSS in the semi-supervised setting generates potential topics and optimizes topic-keyword quality and topic classification. Experiments show that vONTSS outperforms existing semi-supervised topic modeling methods in classification accuracy and diversity. vONTSS also supports unsupervised topic modeling. Quantitative and qualitative experiments show that vONTSS in the unsupervised setting outperforms recent NTMs on multiple aspects: vONTSS discovers highly clustered and coherent topics on benchmark datasets. It is also much faster than the state-of-the-art weakly supervised text classification method while achieving similar classification performance. We further prove the equivalence of optimal transport loss and cross-entropy loss at the global minimum.

* ACL Findings 2023  
* 24 pages, 12 figures, ACL findings 2023 
Viaarxiv icon

FFPDG: Fast, Fair and Private Data Generation

Jun 30, 2023
Weijie Xu, Jinjin Zhao, Francis Iannacci, Bo Wang

Figure 1 for FFPDG: Fast, Fair and Private Data Generation
Figure 2 for FFPDG: Fast, Fair and Private Data Generation
Figure 3 for FFPDG: Fast, Fair and Private Data Generation
Figure 4 for FFPDG: Fast, Fair and Private Data Generation

Generative modeling has been used frequently in synthetic data generation. Fairness and privacy are two big concerns for synthetic data. Although Recent GAN [\cite{goodfellow2014generative}] based methods show good results in preserving privacy, the generated data may be more biased. At the same time, these methods require high computation resources. In this work, we design a fast, fair, flexible and private data generation method. We show the effectiveness of our method theoretically and empirically. We show that models trained on data generated by the proposed method can perform well (in inference stage) on real application scenarios.

* ICLR 2021 Workshop on Synthetic Data Generation  
* 12 pages, 2 figures, ICLR 2021 Workshop on Synthetic Data Generation 
Viaarxiv icon