Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aitzaz Ahmad

Controlled Automatic Task-Specific Synthetic Data Generation for Hallucination Detection

Oct 16, 2024

Yong Xie, Karan Aggarwal, Aitzaz Ahmad, Stephen Lau

Figure 1 for Controlled Automatic Task-Specific Synthetic Data Generation for Hallucination Detection

Figure 2 for Controlled Automatic Task-Specific Synthetic Data Generation for Hallucination Detection

Figure 3 for Controlled Automatic Task-Specific Synthetic Data Generation for Hallucination Detection

Figure 4 for Controlled Automatic Task-Specific Synthetic Data Generation for Hallucination Detection

Abstract:We present a novel approach to automatically generate non-trivial task-specific synthetic datasets for hallucination detection. Our approach features a two-step generation-selection pipeline, using hallucination pattern guidance and a language style alignment during generation. Hallucination pattern guidance leverages the most important task-specific hallucination patterns while language style alignment aligns the style of the synthetic dataset with benchmark text. To obtain robust supervised detectors from synthetic datasets, we also adopt a data mixture strategy to improve performance robustness and generalization. Our results on three datasets show that our generated hallucination text is more closely aligned with non-hallucinated text versus baselines, to train hallucination detectors with better generalization. Our hallucination detectors trained on synthetic datasets outperform in-context-learning (ICL)-based detectors by a large margin of 32%. Our extensive experiments confirm the benefits of our approach with cross-task and cross-generator generalization. Our data-mixture-based training further improves the generalization and robustness of hallucination detection.

Via

Access Paper or Ask Questions

Efficient Continual Pre-training for Building Domain Specific Large Language Models

Nov 14, 2023

Yong Xie, Karan Aggarwal, Aitzaz Ahmad

Figure 1 for Efficient Continual Pre-training for Building Domain Specific Large Language Models

Figure 2 for Efficient Continual Pre-training for Building Domain Specific Large Language Models

Figure 3 for Efficient Continual Pre-training for Building Domain Specific Large Language Models

Figure 4 for Efficient Continual Pre-training for Building Domain Specific Large Language Models

Abstract:Large language models (LLMs) have demonstrated remarkable open-domain capabilities. Traditionally, LLMs tailored for a domain are trained from scratch to excel at handling domain-specific tasks. In this work, we explore an alternative strategy of continual pre-training as a means to develop domain-specific LLMs. We introduce FinPythia-6.9B, developed through domain-adaptive continual pre-training on the financial domain. Continual pre-trained FinPythia showcases consistent improvements on financial tasks over the original foundational model. We further explore simple but effective data selection strategies for continual pre-training. Our data selection strategies outperforms vanilla continual pre-training's performance with just 10% of corpus size and cost, without any degradation on open-domain standard tasks. Our work proposes an alternative solution to building domain-specific LLMs from scratch in a cost-effective manner.

Via

Access Paper or Ask Questions