Toxic Spans Detection


Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor

Add code
Mar 19, 2026
Viaarxiv icon

MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection

Add code
Mar 05, 2026
Viaarxiv icon

Truth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning

Add code
Mar 01, 2026
Viaarxiv icon

Trust The Typical

Add code
Feb 04, 2026
Viaarxiv icon

KID: Knowledge-Injected Dual-Head Learning for Knowledge-Grounded Harmful Meme Detection

Add code
Jan 29, 2026
Viaarxiv icon

SoftHateBench: Evaluating Moderation Models Against Reasoning-Driven, Policy-Compliant Hostility

Add code
Jan 28, 2026
Viaarxiv icon

Beyond Consensus: Perspectivist Modeling and Evaluation of Annotator Disagreement in NLP

Add code
Jan 14, 2026
Viaarxiv icon

Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio

Add code
Nov 14, 2025
Viaarxiv icon

OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models

Add code
Nov 13, 2025
Figure 1 for OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
Figure 2 for OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
Figure 3 for OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
Figure 4 for OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
Viaarxiv icon

Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content

Add code
Sep 16, 2025
Viaarxiv icon