Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiaojun Wan

Who Writes What: Unveiling the Impact of Author Roles on AI-generated Text Detection

Feb 18, 2025

Jiatao Li, Xiaojun Wan

Figure 1 for Who Writes What: Unveiling the Impact of Author Roles on AI-generated Text Detection

Figure 2 for Who Writes What: Unveiling the Impact of Author Roles on AI-generated Text Detection

Figure 3 for Who Writes What: Unveiling the Impact of Author Roles on AI-generated Text Detection

Figure 4 for Who Writes What: Unveiling the Impact of Author Roles on AI-generated Text Detection

Abstract:The rise of Large Language Models (LLMs) necessitates accurate AI-generated text detection. However, current approaches largely overlook the influence of author characteristics. We investigate how sociolinguistic attributes-gender, CEFR proficiency, academic field, and language environment-impact state-of-the-art AI text detectors. Using the ICNALE corpus of human-authored texts and parallel AI-generated texts from diverse LLMs, we conduct a rigorous evaluation employing multi-factor ANOVA and weighted least squares (WLS). Our results reveal significant biases: CEFR proficiency and language environment consistently affected detector accuracy, while gender and academic field showed detector-dependent effects. These findings highlight the crucial need for socially aware AI text detection to avoid unfairly penalizing specific demographic groups. We offer novel empirical evidence, a robust statistical framework, and actionable insights for developing more equitable and reliable detection systems in real-world, out-of-domain contexts. This work paves the way for future research on bias mitigation, inclusive evaluation benchmarks, and socially responsible LLM detectors.

* Under Review

Via

Access Paper or Ask Questions

A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability

Feb 17, 2025

Xinyu Hu, Mingqi Gao, Li Lin, Zhenghan Yu, Xiaojun Wan

Abstract:In NLG meta-evaluation, evaluation metrics are typically assessed based on their consistency with humans. However, we identify some limitations in traditional NLG meta-evaluation approaches, such as issues in handling human ratings and ambiguous selections of correlation measures, which undermine the effectiveness of meta-evaluation. In this work, we propose a dual-perspective NLG meta-evaluation framework that focuses on different evaluation capabilities, thereby providing better interpretability. In addition, we introduce a method of automatically constructing the corresponding benchmarks without requiring new human annotations. Furthermore, we conduct experiments with 16 representative LLMs as the evaluators based on our proposed framework, comprehensively analyzing their evaluation performance from different perspectives.

* 23 pages

Via

Access Paper or Ask Questions

Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Dec 31, 2024

Mingqi Gao, Yixin Liu, Xinyu Hu, Xiaojun Wan, Jonathan Bragg, Arman Cohan

Figure 1 for Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Figure 2 for Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Figure 3 for Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Figure 4 for Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Abstract:Evaluating and ranking the capabilities of different LLMs is crucial for understanding their performance and alignment with human preferences. Due to the high cost and time-consuming nature of human evaluations, an automatic LLM bencher (i.e., an automatic evaluation framework that aims to rank LLMs based on their alignment with human preferences) is indispensable. An automatic LLM bencher consists of four components: the input set (e.g., a user instruction), the evaluation model (e.g., an LLM), the evaluation type (e.g., pairwise comparison), and the aggregation method (e.g., the ELO rating system). However, previous work has not thoroughly explored how to select these components or how their different combinations influence the results. In this work, through controlled experiments, we provide a series of recommendations on how to choose each component to better automate the evaluation of LLMs. Furthermore, we discovered that when evaluating LLMs with similar performance, the performance of the automatic LLM bencher declines sharply, underscoring the limitations of current benchers and calling for future work. Lastly, we found that the evaluation models' performance at the instance level (e.g., the accuracy of selecting the best output) does not always align with their effectiveness when used as a component of a bencher, highlighting the importance of dedicated system-level evaluation of benchers.

Via

Access Paper or Ask Questions

DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models

Dec 17, 2024

Jinxiang Xie, Yilin Li, Xunjian Yin, Xiaojun Wan

Abstract:Evaluating the performance of Grammatical Error Correction (GEC) models has become increasingly challenging, as large language model (LLM)-based GEC systems often produce corrections that diverge from provided gold references. This discrepancy undermines the reliability of traditional reference-based evaluation metrics. In this study, we propose a novel evaluation framework for GEC models, DSGram, integrating Semantic Coherence, Edit Level, and Fluency, and utilizing a dynamic weighting mechanism. Our framework employs the Analytic Hierarchy Process (AHP) in conjunction with large language models to ascertain the relative importance of various evaluation criteria. Additionally, we develop a dataset incorporating human annotations and LLM-simulated sentences to validate our algorithms and fine-tune more cost-effective models. Experimental results indicate that our proposed approach enhances the effectiveness of GEC model evaluations.

* Extended version of a paper to appear in AAAI-25

Via

Access Paper or Ask Questions

$B^4$: A Black-Box Scrubbing Attack on LLM Watermarks

Nov 02, 2024

Baizhou Huang, Xiao Pu, Xiaojun Wan

Figure 1 for $B^4$: A Black-Box Scrubbing Attack on LLM Watermarks

Figure 2 for $B^4$: A Black-Box Scrubbing Attack on LLM Watermarks

Figure 3 for $B^4$: A Black-Box Scrubbing Attack on LLM Watermarks

Figure 4 for $B^4$: A Black-Box Scrubbing Attack on LLM Watermarks

Abstract:Watermarking has emerged as a prominent technique for LLM-generated content detection by embedding imperceptible patterns. Despite supreme performance, its robustness against adversarial attacks remains underexplored. Previous work typically considers a grey-box attack setting, where the specific type of watermark is already known. Some even necessitates knowledge about hyperparameters of the watermarking method. Such prerequisites are unattainable in real-world scenarios. Targeting at a more realistic black-box threat model with fewer assumptions, we here propose $\mathcal{B}^4$, a black-box scrubbing attack on watermarks. Specifically, we formulate the watermark scrubbing attack as a constrained optimization problem by capturing its objectives with two distributions, a Watermark Distribution and a Fidelity Distribution. This optimization problem can be approximately solved using two proxy distributions. Experimental results across 12 different settings demonstrate the superior performance of $\mathcal{B}^4$ compared with other baselines.

Via

Access Paper or Ask Questions

Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation

Oct 22, 2024

Mingqi Gao, Xinyu Hu, Li Lin, Xiaojun Wan

Figure 1 for Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation

Figure 2 for Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation

Figure 3 for Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation

Figure 4 for Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation

Abstract:The correlation between NLG automatic evaluation metrics and human evaluation is often regarded as a critical criterion for assessing the capability of an evaluation metric. However, different grouping methods and correlation coefficients result in various types of correlation measures used in meta-evaluation. In specific evaluation scenarios, prior work often directly follows conventional measure settings, but the characteristics and differences between these measures have not gotten sufficient attention. Therefore, this paper analyzes 12 common correlation measures using a large amount of real-world data from six widely-used NLG evaluation datasets and 32 evaluation metrics, revealing that different measures indeed impact the meta-evaluation results. Furthermore, we propose three perspectives that reflect the capability of meta-evaluation and find that the measure using global grouping and Pearson correlation exhibits the best overall performance, involving the discriminative power, ranking consistency, and sensitivity to score granularity.

Via

Access Paper or Ask Questions

Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles

Oct 17, 2024

Xiao Pu, Tianxing He, Xiaojun Wan

Figure 1 for Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles

Figure 2 for Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles

Figure 3 for Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles

Figure 4 for Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles

Abstract:Prompt compression condenses contexts while maintaining their informativeness for different usage scenarios. It not only shortens the inference time and reduces computational costs during the usage of large language models, but also lowers expenses when using closed-source models. In a preliminary study, we discover that when instructing language models to compress prompts, different compression styles (e.g., extractive or abstractive) impact performance of compressed prompts on downstream tasks. Building on this insight, we propose Style-Compress, a lightweight framework that adapts a smaller language model to compress prompts for a larger model on a new task without additional training. Our approach iteratively generates and selects effective compressed prompts as task-specific demonstrations through style variation and in-context learning, enabling smaller models to act as efficient compressors with task-specific examples. Style-Compress outperforms two baseline compression models in four tasks: original prompt reconstruction, text summarization, multi-hop QA, and CoT reasoning. In addition, with only 10 samples and 100 queries for adaptation, prompts compressed by Style-Compress achieve performance on par with or better than original prompts at a compression ratio of 0.25 or 0.5.

* EMNLP 2024 Findings

Via

Access Paper or Ask Questions

Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models

Oct 17, 2024

Jiatao Li, Xinyu Hu, Xunjian Yin, Xiaojun Wan

Figure 1 for Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models

Figure 2 for Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models

Figure 3 for Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models

Figure 4 for Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models

Abstract:In retrieval-augmented generation systems, the integration of self-generated documents (SGDs) alongside retrieved content has emerged as a promising strategy for enhancing the performance of large language model. However, previous research primarily focuses on optimizing the use of SGDs, with the inherent properties of SGDs remaining underexplored. Therefore, this paper conducts a comprehensive analysis of different types of SGDs and experiments on various knowledge-intensive tasks. We develop a taxonomy of SGDs grounded in Systemic Functional Linguistics (SFL) to compare the influence of different SGD categories. Our findings offer key insights into what kinds of SGDs most effectively contribute to improving LLM's performance. The results and further fusion methods based on SGD categories also provide practical guidelines for taking better advantage of SGDs to achieve significant advancements in knowledge-driven QA tasks with RAG.

* Under Review

Via

Access Paper or Ask Questions

Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement

Oct 06, 2024

Xunjian Yin, Xinyi Wang, Liangming Pan, Xiaojun Wan, William Yang Wang

Figure 1 for Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement

Figure 2 for Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement

Figure 3 for Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement

Figure 4 for Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement

Abstract:The rapid advancement of large language models (LLMs) has significantly enhanced the capabilities of AI-driven agents across various tasks. However, existing agentic systems, whether based on fixed pipeline algorithms or pre-defined meta-learning frameworks, cannot search the whole agent design space due to the restriction of human-designed components, and thus might miss the globally optimal agent design. In this paper, we introduce G\"odel Agent, a self-evolving framework inspired by the G\"odel machine, enabling agents to recursively improve themselves without relying on predefined routines or fixed optimization algorithms. G\"odel Agent leverages LLMs to dynamically modify its own logic and behavior, guided solely by high-level objectives through prompting. Experimental results on mathematical reasoning and complex agent tasks demonstrate that implementation of G\"odel Agent can achieve continuous self-improvement, surpassing manually crafted agents in performance, efficiency, and generalizability.

* Work in progress

Via

Access Paper or Ask Questions

SMART-RAG: Selection using Determinantal Matrices for Augmented Retrieval

Sep 21, 2024

Jiatao Li, Xinyu Hu, Xiaojun Wan

Figure 1 for SMART-RAG: Selection using Determinantal Matrices for Augmented Retrieval

Figure 2 for SMART-RAG: Selection using Determinantal Matrices for Augmented Retrieval

Figure 3 for SMART-RAG: Selection using Determinantal Matrices for Augmented Retrieval

Figure 4 for SMART-RAG: Selection using Determinantal Matrices for Augmented Retrieval

Abstract:Retrieval-Augmented Generation (RAG) has greatly improved large language models (LLMs) by enabling them to generate accurate, contextually grounded responses through the integration of external information. However, conventional RAG approaches, which prioritize top-ranked documents based solely on query-context relevance, often introduce redundancy and conflicting information. This issue is particularly evident in unsupervised retrieval settings, where there are no mechanisms to effectively mitigate these problems, leading to suboptimal context selection. To address this, we propose Selection using Matrices for Augmented Retrieval (SMART) in question answering tasks, a fully unsupervised and training-free framework designed to optimize context selection in RAG. SMART leverages Determinantal Point Processes (DPPs) to simultaneously model relevance, diversity and conflict, ensuring the selection of potentially high-quality contexts. Experimental results across multiple datasets demonstrate that SMART significantly enhances QA performance and surpasses previous unsupervised context selection methods, showing a promising strategy for RAG.

* Under Review

Via

Access Paper or Ask Questions