Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Liang Yang

VisualQuest: A Diverse Image Dataset for Evaluating Visual Recognition in LLMs

Mar 25, 2025

Kelaiti Xiao, Liang Yang, Paerhati Tulajiang, Hongfei Lin

Abstract:This paper introduces VisualQuest, a novel image dataset designed to assess the ability of large language models (LLMs) to interpret non-traditional, stylized imagery. Unlike conventional photographic benchmarks, VisualQuest challenges models with images that incorporate abstract, symbolic, and metaphorical elements, requiring the integration of domain-specific knowledge and advanced reasoning. The dataset was meticulously curated through multiple stages of filtering, annotation, and standardization to ensure high quality and diversity. Our evaluations using several state-of-the-art multimodal LLMs reveal significant performance variations that underscore the importance of both factual background knowledge and inferential capabilities in visual recognition tasks. VisualQuest thus provides a robust and comprehensive benchmark for advancing research in multimodal reasoning and model architecture design.

Via

Access Paper or Ask Questions

DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents

Mar 14, 2025

Yibin Xu, Liang Yang, Hao Chen, Hua Wang, Zhi Chen, Yaohua Tang

Figure 1 for DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents

Figure 2 for DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents

Figure 3 for DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents

Figure 4 for DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents

Abstract:The limitation of graphical user interface (GUI) data has been a significant barrier to the development of GUI agents today, especially for the desktop / computer use scenarios. To address this, we propose an automated GUI data generation pipeline, AutoCaptioner, which generates data with rich descriptions while minimizing human effort. Using AutoCaptioner, we created a novel large-scale desktop GUI dataset, DeskVision, along with the largest desktop test benchmark, DeskVision-Eval, which reflects daily usage and covers diverse systems and UI elements, each with rich descriptions. With DeskVision, we train a new GUI understanding model, GUIExplorer. Results show that GUIExplorer achieves state-of-the-art (SOTA) performance in understanding/grounding visual elements without the need for complex architectural designs. We further validated the effectiveness of the DeskVision dataset through ablation studies on various large visual language models (LVLMs). We believe that AutoCaptioner and DeskVision will significantly advance the development of GUI agents, and will open-source them for the community.

Via

Access Paper or Ask Questions

Generalizable AI-Generated Image Detection Based on Fractal Self-Similarity in the Spectrum

Mar 11, 2025

Shengpeng Xiao, Yuanfang Guo, Heqi Peng, Zeming Liu, Liang Yang, Yunhong Wang

Figure 1 for Generalizable AI-Generated Image Detection Based on Fractal Self-Similarity in the Spectrum

Figure 2 for Generalizable AI-Generated Image Detection Based on Fractal Self-Similarity in the Spectrum

Figure 3 for Generalizable AI-Generated Image Detection Based on Fractal Self-Similarity in the Spectrum

Figure 4 for Generalizable AI-Generated Image Detection Based on Fractal Self-Similarity in the Spectrum

Abstract:The generalization performance of AI-generated image detection remains a critical challenge. Although most existing methods perform well in detecting images from generative models included in the training set, their accuracy drops significantly when faced with images from unseen generators. To address this limitation, we propose a novel detection method based on the fractal self-similarity of the spectrum, a common feature among images generated by different models. Specifically, we demonstrate that AI-generated images exhibit fractal-like spectral growth through periodic extension and low-pass filtering. This observation motivates us to exploit the similarity among different fractal branches of the spectrum. Instead of directly analyzing the spectrum, our method mitigates the impact of varying spectral characteristics across different generators, improving detection performance for images from unseen models. Experiments on a public benchmark demonstrated the generalized detection performance across both GANs and diffusion models.

Via

Access Paper or Ask Questions

Secure Wireless-Powered zeRIS Communications

Mar 10, 2025

Jingyu Chen, Kunrui Cao, Panagiotis D. Diamantoulakis, Lu Lv, Liang Yang, Haolian Chi, Haiyang Ding

Figure 1 for Secure Wireless-Powered zeRIS Communications

Figure 2 for Secure Wireless-Powered zeRIS Communications

Figure 3 for Secure Wireless-Powered zeRIS Communications

Figure 4 for Secure Wireless-Powered zeRIS Communications

Abstract:This paper introduces the concept of wireless-powered zero-energy reconfigurable intelligent surface (zeRIS), and investigates a wireless-powered zeRIS aided communication system in terms of security, reliability and energy efficiency. In particular, we propose three new wireless-powered zeRIS modes: 1) in mode-I, N reconfigurable reflecting elements are adjusted to the optimal phase shift design of information user to maximize the reliability of the system; 2) in mode-II, N reconfigurable reflecting elements are adjusted to the optimal phase shift design of cooperative jamming user to maximize the security of the system; 3) in mode-III, N1 and N2 (N1+N2=N) reconfigurable reflecting elements are respectively adjusted to the optimal phase shift designs of information user and cooperative jamming user to balance the reliability and security of the system. Then, we propose three new metrics, i.e., joint outage probability (JOP), joint intercept probability (JIP), and secrecy energy efficiency (SEE), and analyze their closed-form expressions in three modes, respectively. The results show that under high transmission power, all the diversity gains of three modes are 1, and the JOPs of mode-I, mode-II and mode-III are improved by increasing the number of zeRIS elements, which are related to N2, N, and N^2_1, respectively. In addition, mode-I achieves the best JOP, while mode-II achieves the best JIP among three modes. We exploit two security-reliability trade-off (SRT) metrics, i.e., JOP versus JIP, and normalized joint intercept and outage probability (JIOP), to reveal the SRT performance of the proposed three modes. It is obtained that mode-II outperforms the other two modes in the JOP versus JIP, while mode-III and mode-II achieve the best performance of normalized JIOP at low and high transmission power, respectively.

* 13 pages, 7 figures

Via

Access Paper or Ask Questions

Unveiling the Capabilities of Large Language Models in Detecting Offensive Language with Annotation Disagreement

Feb 10, 2025

Junyu Lu, Kai Ma, Kaichun Wang, Kelaiti Xiao, Roy Ka-Wei Lee, Bo Xu, Liang Yang, Hongfei Lin

Abstract:LLMs are widely used for offensive language detection due to their advanced capability. However, the challenges posed by human annotation disagreement in real-world datasets remain underexplored. These disagreement samples are difficult to detect due to their ambiguous nature. Additionally, the confidence of LLMs in processing disagreement samples can provide valuable insights into their alignment with human annotators. To address this gap, we systematically evaluate the ability of LLMs to detect offensive language with annotation disagreement. We compare the binary accuracy of multiple LLMs across varying annotation agreement levels and analyze the relationship between LLM confidence and annotation agreement. Furthermore, we investigate the impact of disagreement samples on LLM decision-making during few-shot learning and instruction fine-tuning. Our findings highlight the challenges posed by disagreement samples and offer guidance for improving LLM-based offensive language detection.

* 17 pages, submitted to the ACL 2025

Via

Access Paper or Ask Questions

Commonality and Individuality! Integrating Humor Commonality with Speaker Individuality for Humor Recognition

Feb 07, 2025

Haohao Zhu, Junyu Lu, Zeyuan Zeng, Zewen Bai, Xiaokun Zhang, Liang Yang, Hongfei Lin

Figure 1 for Commonality and Individuality! Integrating Humor Commonality with Speaker Individuality for Humor Recognition

Figure 2 for Commonality and Individuality! Integrating Humor Commonality with Speaker Individuality for Humor Recognition

Figure 3 for Commonality and Individuality! Integrating Humor Commonality with Speaker Individuality for Humor Recognition

Figure 4 for Commonality and Individuality! Integrating Humor Commonality with Speaker Individuality for Humor Recognition

Abstract:Humor recognition aims to identify whether a specific speaker's text is humorous. Current methods for humor recognition mainly suffer from two limitations: (1) they solely focus on one aspect of humor commonalities, ignoring the multifaceted nature of humor; and (2) they typically overlook the critical role of speaker individuality, which is essential for a comprehensive understanding of humor expressions. To bridge these gaps, we introduce the Commonality and Individuality Incorporated Network for Humor Recognition (CIHR), a novel model designed to enhance humor recognition by integrating multifaceted humor commonalities with the distinctive individuality of speakers. The CIHR features a Humor Commonality Analysis module that explores various perspectives of multifaceted humor commonality within user texts, and a Speaker Individuality Extraction module that captures both static and dynamic aspects of a speaker's profile to accurately model their distinctive individuality. Additionally, Static and Dynamic Fusion modules are introduced to effectively incorporate the humor commonality with speaker's individuality in the humor recognition process. Extensive experiments demonstrate the effectiveness of CIHR, underscoring the importance of concurrently addressing both multifaceted humor commonality and distinctive speaker individuality in humor recognition.

* Accepted by NAACL 2025

Via

Access Paper or Ask Questions

STATE ToxiCN: A Benchmark for Span-level Target-Aware Toxicity Extraction in Chinese Hate Speech Detection

Jan 26, 2025

Zewen Bai, Yuanyuan Sun, Shengdi Yin, Junyu Lu, Jingjie Zeng, Haohao Zhu, Liang Yang, Hongfei Lin

Figure 1 for STATE ToxiCN: A Benchmark for Span-level Target-Aware Toxicity Extraction in Chinese Hate Speech Detection

Figure 2 for STATE ToxiCN: A Benchmark for Span-level Target-Aware Toxicity Extraction in Chinese Hate Speech Detection

Figure 3 for STATE ToxiCN: A Benchmark for Span-level Target-Aware Toxicity Extraction in Chinese Hate Speech Detection

Figure 4 for STATE ToxiCN: A Benchmark for Span-level Target-Aware Toxicity Extraction in Chinese Hate Speech Detection

Abstract:The proliferation of hate speech has caused significant harm to society. The intensity and directionality of hate are closely tied to the target and argument it is associated with. However, research on hate speech detection in Chinese has lagged behind, and existing datasets lack span-level fine-grained annotations. Furthermore, the lack of research on Chinese hateful slang poses a significant challenge. In this paper, we provide a solution for fine-grained detection of Chinese hate speech. First, we construct a dataset containing Target-Argument-Hateful-Group quadruples (STATE ToxiCN), which is the first span-level Chinese hate speech dataset. Secondly, we evaluate the span-level hate speech detection performance of existing models using STATE ToxiCN. Finally, we conduct the first study on Chinese hateful slang and evaluate the ability of LLMs to detect such expressions. Our work contributes valuable resources and insights to advance span-level hate speech detection in Chinese

Via

Access Paper or Ask Questions

SAM-Guided Masked Token Prediction for 3D Scene Understanding

Oct 17, 2024

Zhimin Chen, Liang Yang, Yingwei Li, Longlong Jing, Bing Li

Figure 1 for SAM-Guided Masked Token Prediction for 3D Scene Understanding

Figure 2 for SAM-Guided Masked Token Prediction for 3D Scene Understanding

Figure 3 for SAM-Guided Masked Token Prediction for 3D Scene Understanding

Figure 4 for SAM-Guided Masked Token Prediction for 3D Scene Understanding

Abstract:Foundation models have significantly enhanced 2D task performance, and recent works like Bridge3D have successfully applied these models to improve 3D scene understanding through knowledge distillation, marking considerable advancements. Nonetheless, challenges such as the misalignment between 2D and 3D representations and the persistent long-tail distribution in 3D datasets still restrict the effectiveness of knowledge distillation from 2D to 3D using foundation models. To tackle these issues, we introduce a novel SAM-guided tokenization method that seamlessly aligns 3D transformer structures with region-level knowledge distillation, replacing the traditional KNN-based tokenization techniques. Additionally, we implement a group-balanced re-weighting strategy to effectively address the long-tail problem in knowledge distillation. Furthermore, inspired by the recent success of masked feature prediction, our framework incorporates a two-stage masked token prediction process in which the student model predicts both the global embeddings and the token-wise local embeddings derived from the teacher models trained in the first stage. Our methodology has been validated across multiple datasets, including SUN RGB-D, ScanNet, and S3DIS, for tasks like 3D object detection and semantic segmentation. The results demonstrate significant improvements over current State-of-the-art self-supervised methods, establishing new benchmarks in this field.

* Accepted by NeurIPS 2024

Via

Access Paper or Ask Questions

Towards Comprehensive Detection of Chinese Harmful Memes

Oct 03, 2024

Junyu Lu, Bo Xu, Xiaokun Zhang, Hongbo Wang, Haohao Zhu, Dongyu Zhang, Liang Yang, Hongfei Lin

Figure 1 for Towards Comprehensive Detection of Chinese Harmful Memes

Figure 2 for Towards Comprehensive Detection of Chinese Harmful Memes

Figure 3 for Towards Comprehensive Detection of Chinese Harmful Memes

Figure 4 for Towards Comprehensive Detection of Chinese Harmful Memes

Abstract:This paper has been accepted in the NeurIPS 2024 D & B Track. Harmful memes have proliferated on the Chinese Internet, while research on detecting Chinese harmful memes significantly lags behind due to the absence of reliable datasets and effective detectors. To this end, we focus on the comprehensive detection of Chinese harmful memes. We construct ToxiCN MM, the first Chinese harmful meme dataset, which consists of 12,000 samples with fine-grained annotations for various meme types. Additionally, we propose a baseline detector, Multimodal Knowledge Enhancement (MKE), incorporating contextual information of meme content generated by the LLM to enhance the understanding of Chinese memes. During the evaluation phase, we conduct extensive quantitative experiments and qualitative analyses on multiple baselines, including LLMs and our MKE. The experimental results indicate that detecting Chinese harmful memes is challenging for existing models while demonstrating the effectiveness of MKE. The resources for this paper are available at https://github.com/DUT-lujunyu/ToxiCN_MM.

Via

Access Paper or Ask Questions

Towards Patronizing and Condescending Language in Chinese Videos: A Multimodal Dataset and Detector

Sep 10, 2024

Hongbo Wang, Junyu Lu, Yan Han, Kai Ma, Liang Yang, Hongfei Lin

Figure 1 for Towards Patronizing and Condescending Language in Chinese Videos: A Multimodal Dataset and Detector

Figure 2 for Towards Patronizing and Condescending Language in Chinese Videos: A Multimodal Dataset and Detector

Figure 3 for Towards Patronizing and Condescending Language in Chinese Videos: A Multimodal Dataset and Detector

Figure 4 for Towards Patronizing and Condescending Language in Chinese Videos: A Multimodal Dataset and Detector

Abstract:Patronizing and Condescending Language (PCL) is a form of discriminatory toxic speech targeting vulnerable groups, threatening both online and offline safety. While toxic speech research has mainly focused on overt toxicity, such as hate speech, microaggressions in the form of PCL remain underexplored. Additionally, dominant groups' discriminatory facial expressions and attitudes toward vulnerable communities can be more impactful than verbal cues, yet these frame features are often overlooked. In this paper, we introduce the PCLMM dataset, the first Chinese multimodal dataset for PCL, consisting of 715 annotated videos from Bilibili, with high-quality PCL facial frame spans. We also propose the MultiPCL detector, featuring a facial expression detection module for PCL recognition, demonstrating the effectiveness of modality complementarity in this challenging task. Our work makes an important contribution to advancing microaggression detection within the domain of toxic speech.

* Under review in ICASSP 2025

Via

Access Paper or Ask Questions