Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yujia Hu

HateXScore: A Metric Suite for Evaluating Reasoning Quality in Hate Speech Explanations

Jan 20, 2026

Yujia Hu, Roy Ka-Wei Lee

Abstract:Hateful speech detection is a key component of content moderation, yet current evaluation frameworks rarely assess why a text is deemed hateful. We introduce \textsf{HateXScore}, a four-component metric suite designed to evaluate the reasoning quality of model explanations. It assesses (i) conclusion explicitness, (ii) faithfulness and causal grounding of quoted spans, (iii) protected group identification (policy-configurable), and (iv) logical consistency among these elements. Evaluated on six diverse hate speech datasets, \textsf{HateXScore} is intended as a diagnostic complement to reveal interpretability failures and annotation inconsistencies that are invisible to standard metrics like Accuracy or F1. Moreover, human evaluation shows strong agreement with \textsf{HateXScore}, validating it as a practical tool for trustworthy and transparent moderation. \textcolor{red}{Disclaimer: This paper contains sensitive content that may be disturbing to some readers.}

* EACL 2026 Main Conference

Via

Access Paper or Ask Questions

Cell Behavior Video Classification Challenge, a benchmark for computer vision methods in time-lapse microscopy

Jan 15, 2026

Raffaella Fiamma Cabini, Deborah Barkauskas, Guangyu Chen, Zhi-Qi Cheng, David E Cicchetti, Judith Drazba, Rodrigo Fernandez-Gonzalez, Raymond Hawkins, Yujia Hu, Jyoti Kini(+12 more)

Abstract:The classification of microscopy videos capturing complex cellular behaviors is crucial for understanding and quantifying the dynamics of biological processes over time. However, it remains a frontier in computer vision, requiring approaches that effectively model the shape and motion of objects without rigid boundaries, extract hierarchical spatiotemporal features from entire image sequences rather than static frames, and account for multiple objects within the field of view. To this end, we organized the Cell Behavior Video Classification Challenge (CBVCC), benchmarking 35 methods based on three approaches: classification of tracking-derived features, end-to-end deep learning architectures to directly learn spatiotemporal features from the entire video sequence without explicit cell tracking, or ensembling tracking-derived with image-derived features. We discuss the results achieved by the participants and compare the potential and limitations of each approach, serving as a basis to foster the development of computer vision methods for studying cellular dynamics.

Via

Access Paper or Ask Questions

Mining the Mind: What 100M Beliefs Reveal About Frontier LLM Knowledge

Oct 08, 2025

Shrestha Ghosh, Luca Giordano, Yujia Hu, Tuan-Phong Nguyen, Simon Razniewski

Figure 1 for Mining the Mind: What 100M Beliefs Reveal About Frontier LLM Knowledge

Figure 2 for Mining the Mind: What 100M Beliefs Reveal About Frontier LLM Knowledge

Figure 3 for Mining the Mind: What 100M Beliefs Reveal About Frontier LLM Knowledge

Figure 4 for Mining the Mind: What 100M Beliefs Reveal About Frontier LLM Knowledge

Abstract:LLMs are remarkable artifacts that have revolutionized a range of NLP and AI tasks. A significant contributor is their factual knowledge, which, to date, remains poorly understood, and is usually analyzed from biased samples. In this paper, we take a deep tour into the factual knowledge (or beliefs) of a frontier LLM, based on GPTKB v1.5 (Hu et al., 2025a), a recursively elicited set of 100 million beliefs of one of the strongest currently available frontier LLMs, GPT-4.1. We find that the models' factual knowledge differs quite significantly from established knowledge bases, and that its accuracy is significantly lower than indicated by previous benchmarks. We also find that inconsistency, ambiguity and hallucinations are major issues, shedding light on future research opportunities concerning factual LLM knowledge.

Via

Access Paper or Ask Questions

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Sep 18, 2025

Yujia Hu, Ming Shan Hee, Preslav Nakov, Roy Ka-Wei Lee

Figure 1 for Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Figure 2 for Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Figure 3 for Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Figure 4 for Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Abstract:The advancement of Large Language Models (LLMs) has transformed natural language processing; however, their safety mechanisms remain under-explored in low-resource, multilingual settings. Here, we aim to bridge this gap. In particular, we introduce \textsf{SGToxicGuard}, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context, including Singlish, Chinese, Malay, and Tamil. SGToxicGuard adopts a red-teaming approach to systematically probe LLM vulnerabilities in three real-world scenarios: \textit{conversation}, \textit{question-answering}, and \textit{content composition}. We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails. By offering actionable insights into cultural sensitivity and toxicity mitigation, we lay the foundation for safer and more inclusive AI systems in linguistically diverse environments.\footnote{Link to the dataset: https://github.com/Social-AI-Studio/SGToxicGuard.} \textcolor{red}{Disclaimer: This paper contains sensitive content that may be disturbing to some readers.}

* 9 pages, EMNLP 2025

Via

Access Paper or Ask Questions

Image Editing As Programs with Diffusion Models

Jun 04, 2025

Yujia Hu, Songhua Liu, Zhenxiong Tan, Xingyi Yang, Xinchao Wang

Abstract:While diffusion models have achieved remarkable success in text-to-image generation, they encounter significant challenges with instruction-driven image editing. Our research highlights a key challenge: these models particularly struggle with structurally inconsistent edits that involve substantial layout changes. To mitigate this gap, we introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture. At its core, IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations. Each operation is implemented via a lightweight adapter sharing the same DiT backbone and is specialized for a specific type of edit. Programmed by a vision-language model (VLM)-based agent, these operations collaboratively support arbitrary and structurally inconsistent transformations. By modularizing and sequencing edits in this way, IEAP generalizes robustly across a wide range of editing tasks, from simple adjustments to substantial structural changes. Extensive experiments demonstrate that IEAP significantly outperforms state-of-the-art methods on standard benchmarks across various editing scenarios. In these evaluations, our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions. Codes are available at https://github.com/YujiaHu1109/IEAP.

Via

Access Paper or Ask Questions

Flash Sculptor: Modular 3D Worlds from Objects

Apr 08, 2025

Yujia Hu, Songhua Liu, Xingyi Yang, Xinchao Wang

Abstract:Existing text-to-3D and image-to-3D models often struggle with complex scenes involving multiple objects and intricate interactions. Although some recent attempts have explored such compositional scenarios, they still require an extensive process of optimizing the entire layout, which is highly cumbersome if not infeasible at all. To overcome these challenges, we propose Flash Sculptor in this paper, a simple yet effective framework for compositional 3D scene/object reconstruction from a single image. At the heart of Flash Sculptor lies a divide-and-conquer strategy, which decouples compositional scene reconstruction into a sequence of sub-tasks, including handling the appearance, rotation, scale, and translation of each individual instance. Specifically, for rotation, we introduce a coarse-to-fine scheme that brings the best of both worlds--efficiency and accuracy--while for translation, we develop an outlier-removal-based algorithm that ensures robust and precise parameters in a single step, without any iterative optimization. Extensive experiments demonstrate that Flash Sculptor achieves at least a 3 times speedup over existing compositional 3D methods, while setting new benchmarks in compositional 3D reconstruction performance. Codes are available at https://github.com/YujiaHu1109/Flash-Sculptor.

Via

Access Paper or Ask Questions

Analyst Reports and Stock Performance: Evidence from the Chinese Market

Nov 13, 2024

Rui Liu, Jiayou Liang, Haolong Chen, Yujia Hu

Figure 1 for Analyst Reports and Stock Performance: Evidence from the Chinese Market

Figure 2 for Analyst Reports and Stock Performance: Evidence from the Chinese Market

Figure 3 for Analyst Reports and Stock Performance: Evidence from the Chinese Market

Figure 4 for Analyst Reports and Stock Performance: Evidence from the Chinese Market

Abstract:This article applies natural language processing (NLP) to extract and quantify textual information to predict stock performance. Using an extensive dataset of Chinese analyst reports and employing a customized BERT deep learning model for Chinese text, this study categorizes the sentiment of the reports as positive, neutral, or negative. The findings underscore the predictive capacity of this sentiment indicator for stock volatility, excess returns, and trading volume. Specifically, analyst reports with strong positive sentiment will increase excess return and intraday volatility, and vice versa, reports with strong negative sentiment also increase volatility and trading volume, but decrease future excess return. The magnitude of this effect is greater for positive sentiment reports than for negative sentiment reports. This article contributes to the empirical literature on sentiment analysis and the response of the stock market to news in the Chinese stock market.

Via

Access Paper or Ask Questions

GPTKB: Building Very Large Knowledge Bases from Language Models

Nov 07, 2024

Yujia Hu, Shrestha Ghosh, Tuan-Phong Nugyen, Simon Razniewski

Figure 1 for GPTKB: Building Very Large Knowledge Bases from Language Models

Figure 2 for GPTKB: Building Very Large Knowledge Bases from Language Models

Figure 3 for GPTKB: Building Very Large Knowledge Bases from Language Models

Figure 4 for GPTKB: Building Very Large Knowledge Bases from Language Models

Abstract:General-domain knowledge bases (KB), in particular the "big three" -- Wikidata, Yago and DBpedia -- are the backbone of many intelligent applications. While these three have seen steady development, comprehensive KB construction at large has seen few fresh attempts. In this work, we propose to build a large general-domain KB entirely from a large language model (LLM). We demonstrate the feasibility of large-scale KB construction from LLMs, while highlighting specific challenges arising around entity recognition, entity and property canonicalization, and taxonomy construction. As a prototype, we use GPT-4o-mini to construct GPTKB, which contains 105 million triples for more than 2.9 million entities, at a cost 100x less than previous KBC projects. Our work is a landmark for two fields: For NLP, for the first time, it provides \textit{constructive} insights into the knowledge (or beliefs) of LLMs. For the Semantic Web, it shows novel ways forward for the long-standing challenge of general-domain KB construction. GPTKB is accessible at https://gptkb.org.

* 11 pages, 4 tables

Via

Access Paper or Ask Questions

InstructAV: Instruction Fine-tuning Large Language Models for Authorship Verification

Jul 16, 2024

Yujia Hu, Zhiqiang Hu, Chun-Wei Seah, Roy Ka-Wei Lee

Figure 1 for InstructAV: Instruction Fine-tuning Large Language Models for Authorship Verification

Figure 2 for InstructAV: Instruction Fine-tuning Large Language Models for Authorship Verification

Figure 3 for InstructAV: Instruction Fine-tuning Large Language Models for Authorship Verification

Figure 4 for InstructAV: Instruction Fine-tuning Large Language Models for Authorship Verification

Abstract:Large Language Models (LLMs) have demonstrated remarkable proficiency in a wide range of NLP tasks. However, when it comes to authorship verification (AV) tasks, which involve determining whether two given texts share the same authorship, even advanced models like ChatGPT exhibit notable limitations. This paper introduces a novel approach, termed InstructAV, for authorship verification. This approach utilizes LLMs in conjunction with a parameter-efficient fine-tuning (PEFT) method to simultaneously improve accuracy and explainability. The distinctiveness of InstructAV lies in its ability to align classification decisions with transparent and understandable explanations, representing a significant progression in the field of authorship verification. Through comprehensive experiments conducted across various datasets, InstructAV demonstrates its state-of-the-art performance on the AV task, offering high classification accuracy coupled with enhanced explanation reliability.

Via

Access Paper or Ask Questions

ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

Jun 18, 2024

Yunze Xiao, Yujia Hu, Kenny Tsu Wei Choo, Roy Ka-wei Lee

Figure 1 for ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

Figure 2 for ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

Figure 3 for ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

Figure 4 for ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

Abstract:Detecting hate speech and offensive language is essential for maintaining a safe and respectful digital environment. This study examines the limitations of state-of-the-art large language models (LLMs) in identifying offensive content within systematically perturbed data, with a focus on Chinese, a language particularly susceptible to such perturbations. We introduce \textsf{ToxiCloakCN}, an enhanced dataset derived from ToxiCN, augmented with homophonic substitutions and emoji transformations, to test the robustness of LLMs against these cloaking perturbations. Our findings reveal that existing models significantly underperform in detecting offensive content when these perturbations are applied. We provide an in-depth analysis of how different types of offensive content are affected by these perturbations and explore the alignment between human and model explanations of offensiveness. Our work highlights the urgent need for more advanced techniques in offensive language detection to combat the evolving tactics used to evade detection mechanisms.

* 10 pages,5 Tables, 2 Figures

Via

Access Paper or Ask Questions