Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xinyu Hu

CFunModel: A "Funny" Language Model Capable of Chinese Humor Generation and Processing

Mar 26, 2025

Zhenghan Yu, Xinyu Hu, Xiaojun Wan

Abstract:Humor plays a significant role in daily language communication. With the rapid development of large language models (LLMs), natural language processing has made significant strides in understanding and generating various genres of texts. However, most LLMs exhibit poor performance in generating and processing Chinese humor. In this study, we introduce a comprehensive Chinese humor-related dataset, the Chinese Fun Set (CFunSet). This dataset aggregates existing Chinese humor datasets and includes over 20,000 jokes collected from Tieba-JokeBar, a Chinese online platform known for joke sharing. The resulting corpus comprises more than 160,000 entries. Leveraging CFunSet, we developed the Chinese Fun Model (CFunModel), the first large language model designed to handle various Chinese humor-related tasks including Crosstalk Response Selection, Humor Recognition, Joke Generation, etc. Experimental results demonstrate that CFunModel outperforms popular large language models in these tasks. Our CFunSet is available at https://huggingface.co/datasets/ZhenghanYU/CFunSet and CFunModel is available at https://huggingface.co/ZhenghanYU/CFunModel. A demostration video of our work is available at https://youtu.be/MOsISOJ66Ms.

* 9 pages

Via

Access Paper or Ask Questions

Exploring the Multilingual NLG Evaluation Abilities of LLM-Based Evaluators

Mar 06, 2025

Jiayi Chang, Mingqi Gao, Xinyu Hu, Xiaojun Wan

Abstract:Previous research has shown that LLMs have potential in multilingual NLG evaluation tasks. However, existing research has not fully explored the differences in the evaluation capabilities of LLMs across different languages. To this end, this study provides a comprehensive analysis of the multilingual evaluation performance of 10 recent LLMs, spanning high-resource and low-resource languages through correlation analysis, perturbation attacks, and fine-tuning. We found that 1) excluding the reference answer from the prompt and using large-parameter LLM-based evaluators leads to better performance across various languages; 2) most LLM-based evaluators show a higher correlation with human judgments in high-resource languages than in low-resource languages; 3) in the languages where they are most sensitive to such attacks, they also tend to exhibit the highest correlation with human judgments; and 4) fine-tuning with data from a particular language yields a broadly consistent enhancement in the model's evaluation performance across diverse languages. Our findings highlight the imbalance in LLMs'evaluation capabilities across different languages and suggest that low-resource language scenarios deserve more attention.

Via

Access Paper or Ask Questions

GRNFormer: A Biologically-Guided Framework for Integrating Gene Regulatory Networks into RNA Foundation Models

Mar 03, 2025

Mufan Qiu, Xinyu Hu, Fengwei Zhan, Sukwon Yun, Jie Peng, Ruichen Zhang, Bhavya Kailkhura, Jiekun Yang, Tianlong Chen

Abstract:Foundation models for single-cell RNA sequencing (scRNA-seq) have shown promising capabilities in capturing gene expression patterns. However, current approaches face critical limitations: they ignore biological prior knowledge encoded in gene regulatory relationships and fail to leverage multi-omics signals that could provide complementary regulatory insights. In this paper, we propose GRNFormer, a new framework that systematically integrates multi-scale Gene Regulatory Networks (GRNs) inferred from multi-omics data into RNA foundation model training. Our framework introduces two key innovations. First, we introduce a pipeline for constructing hierarchical GRNs that capture regulatory relationships at both cell-type-specific and cell-specific resolutions. Second, we design a structure-aware integration framework that addresses the information asymmetry in GRNs through two technical advances: (1) A graph topological adapter using multi-head cross-attention to weight regulatory relationships dynamically, and (2) a novel edge perturbation strategy that perturb GRNs with biologically-informed co-expression links to augment graph neural network training. Comprehensive experiments have been conducted on three representative downstream tasks across multiple model architectures to demonstrate the effectiveness of GRNFormer. It achieves consistent improvements over state-of-the-art (SoTA) baselines: $3.6\%$ increase in drug response prediction correlation, $9.6\%$ improvement in single-cell drug classification AUC, and $1.1\%$ average gain in gene perturbation prediction accuracy.

Via

Access Paper or Ask Questions

Aspect-Guided Multi-Level Perturbation Analysis of Large Language Models in Automated Peer Review

Feb 18, 2025

Jiatao Li, Yanheng Li, Xinyu Hu, Mingqi Gao, Xiaojun Wan

Abstract:We propose an aspect-guided, multi-level perturbation framework to evaluate the robustness of Large Language Models (LLMs) in automated peer review. Our framework explores perturbations in three key components of the peer review process-papers, reviews, and rebuttals-across several quality aspects, including contribution, soundness, presentation, tone, and completeness. By applying targeted perturbations and examining their effects on both LLM-as-Reviewer and LLM-as-Meta-Reviewer, we investigate how aspect-based manipulations, such as omitting methodological details from papers or altering reviewer conclusions, can introduce significant biases in the review process. We identify several potential vulnerabilities: review conclusions that recommend a strong reject may significantly influence meta-reviews, negative or misleading reviews may be wrongly interpreted as thorough, and incomplete or hostile rebuttals can unexpectedly lead to higher acceptance rates. Statistical tests show that these biases persist under various Chain-of-Thought prompting strategies, highlighting the lack of robust critical evaluation in current LLMs. Our framework offers a practical methodology for diagnosing these vulnerabilities, thereby contributing to the development of more reliable and robust automated reviewing systems.

* Under Review

Via

Access Paper or Ask Questions

A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability

Feb 17, 2025

Xinyu Hu, Mingqi Gao, Li Lin, Zhenghan Yu, Xiaojun Wan

Abstract:In NLG meta-evaluation, evaluation metrics are typically assessed based on their consistency with humans. However, we identify some limitations in traditional NLG meta-evaluation approaches, such as issues in handling human ratings and ambiguous selections of correlation measures, which undermine the effectiveness of meta-evaluation. In this work, we propose a dual-perspective NLG meta-evaluation framework that focuses on different evaluation capabilities, thereby providing better interpretability. In addition, we introduce a method of automatically constructing the corresponding benchmarks without requiring new human annotations. Furthermore, we conduct experiments with 16 representative LLMs as the evaluators based on our proposed framework, comprehensively analyzing their evaluation performance from different perspectives.

* 23 pages

Via

Access Paper or Ask Questions

Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Dec 31, 2024

Mingqi Gao, Yixin Liu, Xinyu Hu, Xiaojun Wan, Jonathan Bragg, Arman Cohan

Figure 1 for Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Figure 2 for Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Figure 3 for Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Figure 4 for Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Abstract:Evaluating and ranking the capabilities of different LLMs is crucial for understanding their performance and alignment with human preferences. Due to the high cost and time-consuming nature of human evaluations, an automatic LLM bencher (i.e., an automatic evaluation framework that aims to rank LLMs based on their alignment with human preferences) is indispensable. An automatic LLM bencher consists of four components: the input set (e.g., a user instruction), the evaluation model (e.g., an LLM), the evaluation type (e.g., pairwise comparison), and the aggregation method (e.g., the ELO rating system). However, previous work has not thoroughly explored how to select these components or how their different combinations influence the results. In this work, through controlled experiments, we provide a series of recommendations on how to choose each component to better automate the evaluation of LLMs. Furthermore, we discovered that when evaluating LLMs with similar performance, the performance of the automatic LLM bencher declines sharply, underscoring the limitations of current benchers and calling for future work. Lastly, we found that the evaluation models' performance at the instance level (e.g., the accuracy of selecting the best output) does not always align with their effectiveness when used as a component of a bencher, highlighting the importance of dedicated system-level evaluation of benchers.

Via

Access Paper or Ask Questions

What You See Is What Matters: A Novel Visual and Physics-Based Metric for Evaluating Video Generation Quality

Nov 20, 2024

Zihan Wang, Songlin Li, Lingyan Hao, Bowen Song, Xinyu Hu

Figure 1 for What You See Is What Matters: A Novel Visual and Physics-Based Metric for Evaluating Video Generation Quality

Figure 2 for What You See Is What Matters: A Novel Visual and Physics-Based Metric for Evaluating Video Generation Quality

Figure 3 for What You See Is What Matters: A Novel Visual and Physics-Based Metric for Evaluating Video Generation Quality

Figure 4 for What You See Is What Matters: A Novel Visual and Physics-Based Metric for Evaluating Video Generation Quality

Abstract:As video generation models advance rapidly, assessing the quality of generated videos has become increasingly critical. Existing metrics, such as Fr\'echet Video Distance (FVD), Inception Score (IS), and ClipSim, measure quality primarily in latent space rather than from a human visual perspective, often overlooking key aspects like appearance and motion consistency to physical laws. In this paper, we propose a novel metric, VAMP (Visual Appearance and Motion Plausibility), that evaluates both the visual appearance and physical plausibility of generated videos. VAMP is composed of two main components: an appearance score, which assesses color, shape, and texture consistency across frames, and a motion score, which evaluates the realism of object movements. We validate VAMP through two experiments: corrupted video evaluation and generated video evaluation. In the corrupted video evaluation, we introduce various types of corruptions into real videos and measure the correlation between corruption severity and VAMP scores. In the generated video evaluation, we use state-of-the-art models to generate videos from carefully designed prompts and compare VAMP's performance to human evaluators' rankings. Our results demonstrate that VAMP effectively captures both visual fidelity and temporal consistency, offering a more comprehensive evaluation of video quality than traditional methods.

Via

Access Paper or Ask Questions

Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation

Oct 22, 2024

Mingqi Gao, Xinyu Hu, Li Lin, Xiaojun Wan

Figure 1 for Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation

Figure 2 for Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation

Figure 3 for Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation

Figure 4 for Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation

Abstract:The correlation between NLG automatic evaluation metrics and human evaluation is often regarded as a critical criterion for assessing the capability of an evaluation metric. However, different grouping methods and correlation coefficients result in various types of correlation measures used in meta-evaluation. In specific evaluation scenarios, prior work often directly follows conventional measure settings, but the characteristics and differences between these measures have not gotten sufficient attention. Therefore, this paper analyzes 12 common correlation measures using a large amount of real-world data from six widely-used NLG evaluation datasets and 32 evaluation metrics, revealing that different measures indeed impact the meta-evaluation results. Furthermore, we propose three perspectives that reflect the capability of meta-evaluation and find that the measure using global grouping and Pearson correlation exhibits the best overall performance, involving the discriminative power, ranking consistency, and sensitivity to score granularity.

Via

Access Paper or Ask Questions

Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models

Oct 17, 2024

Jiatao Li, Xinyu Hu, Xunjian Yin, Xiaojun Wan

Figure 1 for Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models

Figure 2 for Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models

Figure 3 for Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models

Figure 4 for Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models

Abstract:In retrieval-augmented generation systems, the integration of self-generated documents (SGDs) alongside retrieved content has emerged as a promising strategy for enhancing the performance of large language model. However, previous research primarily focuses on optimizing the use of SGDs, with the inherent properties of SGDs remaining underexplored. Therefore, this paper conducts a comprehensive analysis of different types of SGDs and experiments on various knowledge-intensive tasks. We develop a taxonomy of SGDs grounded in Systemic Functional Linguistics (SFL) to compare the influence of different SGD categories. Our findings offer key insights into what kinds of SGDs most effectively contribute to improving LLM's performance. The results and further fusion methods based on SGD categories also provide practical guidelines for taking better advantage of SGDs to achieve significant advancements in knowledge-driven QA tasks with RAG.

* Under Review

Via

Access Paper or Ask Questions

SMART-RAG: Selection using Determinantal Matrices for Augmented Retrieval

Sep 21, 2024

Jiatao Li, Xinyu Hu, Xiaojun Wan

Figure 1 for SMART-RAG: Selection using Determinantal Matrices for Augmented Retrieval

Figure 2 for SMART-RAG: Selection using Determinantal Matrices for Augmented Retrieval

Figure 3 for SMART-RAG: Selection using Determinantal Matrices for Augmented Retrieval

Figure 4 for SMART-RAG: Selection using Determinantal Matrices for Augmented Retrieval

Abstract:Retrieval-Augmented Generation (RAG) has greatly improved large language models (LLMs) by enabling them to generate accurate, contextually grounded responses through the integration of external information. However, conventional RAG approaches, which prioritize top-ranked documents based solely on query-context relevance, often introduce redundancy and conflicting information. This issue is particularly evident in unsupervised retrieval settings, where there are no mechanisms to effectively mitigate these problems, leading to suboptimal context selection. To address this, we propose Selection using Matrices for Augmented Retrieval (SMART) in question answering tasks, a fully unsupervised and training-free framework designed to optimize context selection in RAG. SMART leverages Determinantal Point Processes (DPPs) to simultaneously model relevance, diversity and conflict, ensuring the selection of potentially high-quality contexts. Experimental results across multiple datasets demonstrate that SMART significantly enhances QA performance and surpasses previous unsupervised context selection methods, showing a promising strategy for RAG.

* Under Review

Via

Access Paper or Ask Questions