Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yeshuang Zhu

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

Apr 24, 2025

Hanlei Zhang, Zhuohang Li, Yeshuang Zhu, Hua Xu, Peiwu Wang, Haige Zhu, Jie Zhou, Jinchao Zhang

Abstract:Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.

* 23 pages, 5 figures

Via

Access Paper or Ask Questions

Control-CLIP: Decoupling Category and Style Guidance in CLIP for Specific-Domain Generation

Feb 17, 2025

Zexi Jia, Chuanwei Huang, Hongyan Fei, Yeshuang Zhu, Zhiqiang Yuan, Jinchao Zhang, Jie Zhou

Abstract:Text-to-image diffusion models have shown remarkable capabilities of generating high-quality images closely aligned with textual inputs. However, the effectiveness of text guidance heavily relies on the CLIP text encoder, which is trained to pay more attention to general content but struggles to capture semantics in specific domains like styles. As a result, generation models tend to fail on prompts like "a photo of a cat in Pokemon style" in terms of simply producing images depicting "a photo of a cat". To fill this gap, we propose Control-CLIP, a novel decoupled CLIP fine-tuning framework that enables the CLIP model to learn the meaning of category and style in a complement manner. With specially designed fine-tuning tasks on minimal data and a modified cross-attention mechanism, Control-CLIP can precisely guide the diffusion model to a specific domain. Moreover, the parameters of the diffusion model remain unchanged at all, preserving the original generation performance and diversity. Experiments across multiple domains confirm the effectiveness of our approach, particularly highlighting its robust plug-and-play capability in generating content with various specific styles.

Via

Access Paper or Ask Questions

Semantic to Structure: Learning Structural Representations for Infringement Detection

Feb 11, 2025

Chuanwei Huang, Zexi Jia, Hongyan Fei, Yeshuang Zhu, Zhiqiang Yuan, Jinchao Zhang, Jie Zhou

Abstract:Structural information in images is crucial for aesthetic assessment, and it is widely recognized in the artistic field that imitating the structure of other works significantly infringes on creators' rights. The advancement of diffusion models has led to AI-generated content imitating artists' structural creations, yet effective detection methods are still lacking. In this paper, we define this phenomenon as "structural infringement" and propose a corresponding detection method. Additionally, we develop quantitative metrics and create manually annotated datasets for evaluation: the SIA dataset of synthesized data, and the SIR dataset of real data. Due to the current lack of datasets for structural infringement detection, we propose a new data synthesis strategy based on diffusion models and LLM, successfully training a structural infringement detection model. Experimental results show that our method can successfully detect structural infringements and achieve notable improvements on annotated test sets.

Via

Access Paper or Ask Questions

ILDiff: Generate Transparent Animated Stickers by Implicit Layout Distillation

Dec 30, 2024

Ting Zhang, Zhiqiang Yuan, Yeshuang Zhu, Jinchao Zhang

Figure 1 for ILDiff: Generate Transparent Animated Stickers by Implicit Layout Distillation

Figure 2 for ILDiff: Generate Transparent Animated Stickers by Implicit Layout Distillation

Figure 3 for ILDiff: Generate Transparent Animated Stickers by Implicit Layout Distillation

Figure 4 for ILDiff: Generate Transparent Animated Stickers by Implicit Layout Distillation

Abstract:High-quality animated stickers usually contain transparent channels, which are often ignored by current video generation models. To generate fine-grained animated transparency channels, existing methods can be roughly divided into video matting algorithms and diffusion-based algorithms. The methods based on video matting have poor performance in dealing with semi-open areas in stickers, while diffusion-based methods are often used to model a single image, which will lead to local flicker when modeling animated stickers. In this paper, we firstly propose an ILDiff method to generate animated transparent channels through implicit layout distillation, which solves the problems of semi-open area collapse and no consideration of temporal information in existing methods. Secondly, we create the Transparent Animated Sticker Dataset (TASD), which contains 0.32M high-quality samples with transparent channel, to provide data support for related fields. Extensive experiments demonstrate that ILDiff can produce finer and smoother transparent channels compared to other methods such as Matting Anything and Layer Diffusion. Our code and dataset will be released at link https://xiaoyuan1996.github.io.

Via

Access Paper or Ask Questions

AutoCAD: Automatically Generating Counterfactuals for Mitigating Shortcut Learning

Nov 29, 2022

Jiaxin Wen, Yeshuang Zhu, Jinchao Zhang, Jie Zhou, Minlie Huang

Figure 1 for AutoCAD: Automatically Generating Counterfactuals for Mitigating Shortcut Learning

Figure 2 for AutoCAD: Automatically Generating Counterfactuals for Mitigating Shortcut Learning

Figure 3 for AutoCAD: Automatically Generating Counterfactuals for Mitigating Shortcut Learning

Figure 4 for AutoCAD: Automatically Generating Counterfactuals for Mitigating Shortcut Learning

Abstract:Recent studies have shown the impressive efficacy of counterfactually augmented data (CAD) for reducing NLU models' reliance on spurious features and improving their generalizability. However, current methods still heavily rely on human efforts or task-specific designs to generate counterfactuals, thereby impeding CAD's applicability to a broad range of NLU tasks. In this paper, we present AutoCAD, a fully automatic and task-agnostic CAD generation framework. AutoCAD first leverages a classifier to unsupervisedly identify rationales as spans to be intervened, which disentangles spurious and causal features. Then, AutoCAD performs controllable generation enhanced by unlikelihood training to produce diverse counterfactuals. Extensive evaluations on multiple out-of-domain and challenge benchmarks demonstrate that AutoCAD consistently and significantly boosts the out-of-distribution performance of powerful pre-trained models across different NLU tasks, which is comparable or even better than previous state-of-the-art human-in-the-loop or task-specific CAD methods. The code is publicly available at https://github.com/thu-coai/AutoCAD.

* Accepted by EMNLP 2022 findings

Via

Access Paper or Ask Questions

Selecting Stickers in Open-Domain Dialogue through Multitask Learning

Sep 16, 2022

Zhexin Zhang, Yeshuang Zhu, Zhengcong Fei, Jinchao Zhang, Jie Zhou

Figure 1 for Selecting Stickers in Open-Domain Dialogue through Multitask Learning

Figure 2 for Selecting Stickers in Open-Domain Dialogue through Multitask Learning

Figure 3 for Selecting Stickers in Open-Domain Dialogue through Multitask Learning

Figure 4 for Selecting Stickers in Open-Domain Dialogue through Multitask Learning

Abstract:With the increasing popularity of online chatting, stickers are becoming important in our online communication. Selecting appropriate stickers in open-domain dialogue requires a comprehensive understanding of both dialogues and stickers, as well as the relationship between the two types of modalities. To tackle these challenges, we propose a multitask learning method comprised of three auxiliary tasks to enhance the understanding of dialogue history, emotion and semantic meaning of stickers. Extensive experiments conducted on a recent challenging dataset show that our model can better combine the multimodal information and achieve significantly higher accuracy over strong baselines. Ablation study further verifies the effectiveness of each auxiliary task. Our code is available at \url{https://github.com/nonstopfor/Sticker-Selection}

* ACL 2022 findings, camera-ready

Via

Access Paper or Ask Questions