Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ming Yan

DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation

Jul 18, 2024

Baihan Li, Zeyu Xie, Xuenan Xu, Yiwei Guo, Ming Yan, Ji Zhang, Kai Yu, Mengyue Wu

Figure 1 for DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation

Figure 2 for DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation

Figure 3 for DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation

Figure 4 for DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation

Abstract:Audio generation has attracted significant attention. Despite remarkable enhancement in audio quality, existing models overlook diversity evaluation. This is partially due to the lack of a systematic sound class diversity framework and a matching dataset. To address these issues, we propose DiveSound, a novel framework for constructing multimodal datasets with in-class diversified taxonomy, assisted by large language models. As both textual and visual information can be utilized to guide diverse generation, DiveSound leverages multimodal contrastive representations in data construction. Our framework is highly autonomous and can be easily scaled up. We provide a textaudio-image aligned diversity dataset whose sound event class tags have an average of 2.42 subcategories. Text-to-audio experiments on the constructed dataset show a substantial increase of diversity with the help of the guidance of visual information.

Via

Access Paper or Ask Questions

Modeling Comparative Logical Relation with Contrastive Learning for Text Generation

Jun 13, 2024

Yuhao Dan, Junfeng Tian, Jie Zhou, Ming Yan, Ji Zhang, Qin Chen, Liang He

Figure 1 for Modeling Comparative Logical Relation with Contrastive Learning for Text Generation

Figure 2 for Modeling Comparative Logical Relation with Contrastive Learning for Text Generation

Figure 3 for Modeling Comparative Logical Relation with Contrastive Learning for Text Generation

Figure 4 for Modeling Comparative Logical Relation with Contrastive Learning for Text Generation

Abstract:Data-to-Text Generation (D2T), a classic natural language generation problem, aims at producing fluent descriptions for structured input data, such as a table. Existing D2T works mainly focus on describing the superficial associative relations among entities, while ignoring the deep comparative logical relations, such as A is better than B in a certain aspect with a corresponding opinion, which is quite common in our daily life. In this paper, we introduce a new D2T task named comparative logical relation generation (CLRG). Additionally, we propose a Comparative Logic (CoLo) based text generation method, which generates texts following specific comparative logical relations with contrastive learning. Specifically, we first construct various positive and negative samples by fine-grained perturbations in entities, aspects and opinions. Then, we perform contrastive learning in the encoder layer to have a better understanding of the comparative logical relations, and integrate it in the decoder layer to guide the model to correctly generate the relations. Noting the data scarcity problem, we construct a Chinese Comparative Logical Relation Dataset (CLRD), which is a high-quality human-annotated dataset and challenging for text generation with descriptions of multiple entities and annotations on their comparative logical relations. Extensive experiments show that our method achieves impressive performance in both automatic and human evaluations.

Via

Access Paper or Ask Questions

Text-like Encoding of Collaborative Information in Large Language Models for Recommendation

Jun 05, 2024

Yang Zhang, Keqin Bao, Ming Yan, Wenjie Wang, Fuli Feng, Xiangnan He

Figure 1 for Text-like Encoding of Collaborative Information in Large Language Models for Recommendation

Figure 2 for Text-like Encoding of Collaborative Information in Large Language Models for Recommendation

Figure 3 for Text-like Encoding of Collaborative Information in Large Language Models for Recommendation

Figure 4 for Text-like Encoding of Collaborative Information in Large Language Models for Recommendation

Abstract:When adapting Large Language Models for Recommendation (LLMRec), it is crucial to integrate collaborative information. Existing methods achieve this by learning collaborative embeddings in LLMs' latent space from scratch or by mapping from external models. However, they fail to represent the information in a text-like format, which may not align optimally with LLMs. To bridge this gap, we introduce BinLLM, a novel LLMRec method that seamlessly integrates collaborative information through text-like encoding. BinLLM converts collaborative embeddings from external models into binary sequences -- a specific text format that LLMs can understand and operate on directly, facilitating the direct usage of collaborative information in text-like format by LLMs. Additionally, BinLLM provides options to compress the binary sequence using dot-decimal notation to avoid excessively long lengths. Extensive experiments validate that BinLLM introduces collaborative information in a manner better aligned with LLMs, resulting in enhanced performance. We release our code at https://github.com/zyang1580/BinLLM.

* Accepted by ACL 2024

Via

Access Paper or Ask Questions

Initialization-enhanced Physics-Informed Neural Network with Domain Decomposition (IDPINN)

Jun 05, 2024

Chenhao Si, Ming Yan

Abstract:We propose a new physics-informed neural network framework, IDPINN, based on the enhancement of initialization and domain decomposition to improve prediction accuracy. We train a PINN using a small dataset to obtain an initial network structure, including the weighted matrix and bias, which initializes the PINN for each subdomain. Moreover, we leverage the smoothness condition on the interface to enhance the prediction performance. We numerically evaluated it on several forward problems and demonstrated the benefits of IDPINN in terms of accuracy.

* 20 pages, 14 figures

Via

Access Paper or Ask Questions

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Jun 03, 2024

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang

Abstract:Mobile device operation tasks are increasingly becoming a popular multi-modal AI application scenario. Current Multi-modal Large Language Models (MLLMs), constrained by their training data, lack the capability to function effectively as operation assistants. Instead, MLLM-based agents, which enhance capabilities through tool invocation, are gradually being applied to this scenario. However, the two major navigation challenges in mobile device operation tasks, task progress navigation and focus content navigation, are significantly complicated under the single-agent architecture of existing work. This is due to the overly long token sequences and the interleaved text-image data format, which limit performance. To address these navigation challenges effectively, we propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. The planning agent generates task progress, making the navigation of history operations more efficient. To retain focus content, we design a memory unit that updates with task progress. Additionally, to correct erroneous operations, the reflection agent observes the outcomes of each operation and handles any mistakes accordingly. Experimental results indicate that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent. The code is open-sourced at https://github.com/X-PLUG/MobileAgent.

* 22 pages, 11 figures, 10 Tables

Via

Access Paper or Ask Questions

Time-Varying Graph Signal Recovery Using High-Order Smoothness and Adaptive Low-rankness

May 16, 2024

Weihong Guo, Yifei Lou, Jing Qin, Ming Yan

Figure 1 for Time-Varying Graph Signal Recovery Using High-Order Smoothness and Adaptive Low-rankness

Figure 2 for Time-Varying Graph Signal Recovery Using High-Order Smoothness and Adaptive Low-rankness

Figure 3 for Time-Varying Graph Signal Recovery Using High-Order Smoothness and Adaptive Low-rankness

Figure 4 for Time-Varying Graph Signal Recovery Using High-Order Smoothness and Adaptive Low-rankness

Abstract:Time-varying graph signal recovery has been widely used in many applications, including climate change, environmental hazard monitoring, and epidemic studies. It is crucial to choose appropriate regularizations to describe the characteristics of the underlying signals, such as the smoothness of the signal over the graph domain and the low-rank structure of the spatial-temporal signal modeled in a matrix form. As one of the most popular options, the graph Laplacian is commonly adopted in designing graph regularizations for reconstructing signals defined on a graph from partially observed data. In this work, we propose a time-varying graph signal recovery method based on the high-order Sobolev smoothness and an error-function weighted nuclear norm regularization to enforce the low-rankness. Two efficient algorithms based on the alternating direction method of multipliers and iterative reweighting are proposed, and convergence of one algorithm is shown in detail. We conduct various numerical experiments on synthetic and real-world data sets to demonstrate the proposed method's effectiveness compared to the state-of-the-art in graph signal recovery.

Via

Access Paper or Ask Questions

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Apr 25, 2024

Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, Fei Huang

Figure 1 for TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Figure 2 for TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Figure 3 for TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Figure 4 for TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Abstract:Charts are important for presenting and explaining complex data relationships. Recently, multimodal large language models (MLLMs) have shown remarkable capabilities in various chart understanding tasks. However, the sheer size of these models in terms of parameters and computational requirements limits their use in resource-constrained environments. In this paper, we present TinyChart, an efficient MLLM for chart understanding with only 3B parameters. TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, which trains the model to generate Python programs for numerical calculations, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges most similar vision tokens. Extensive experiments demonstrate that our 3B TinyChart achieves SOTA performance on a variety of chart understanding benchmarks including ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. It outperforms several chart understanding MLLM with up to 13B parameters such as ChartLlama and ChartAst, and close-sourced general-purpose MLLM GPT-4V on ChartQA. It also demonstrates its superior efficiency with higher throughput during inference due to a smaller model scale and more efficient vision encoding. Our code and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/TinyChart.

* 13 pages, 11 figures

Via

Access Paper or Ask Questions

Adaptive Feature Fusion Neural Network for Glaucoma Segmentation on Unseen Fundus Images

Apr 02, 2024

Jiyuan Zhong, Hu Ke, Ming Yan

Abstract:Fundus image segmentation on unseen domains is challenging, especially for the over-parameterized deep models trained on the small medical datasets. To address this challenge, we propose a method named Adaptive Feature-fusion Neural Network (AFNN) for glaucoma segmentation on unseen domains, which mainly consists of three modules: domain adaptor, feature-fusion network, and self-supervised multi-task learning. Specifically, the domain adaptor helps the pretrained-model fast adapt from other image domains to the medical fundus image domain. Feature-fusion network and self-supervised multi-task learning for the encoder and decoder are introduced to improve the domain generalization ability. In addition, we also design the weighted-dice-loss to improve model performance on complex optic-cup segmentation tasks. Our proposed method achieves a competitive performance over existing fundus segmentation methods on four public glaucoma datasets.

* 17 pages, 11 figures

Via

Access Paper or Ask Questions

Shortcuts Arising from Contrast: Effective and Covert Clean-Label Attacks in Prompt-Based Learning

Mar 30, 2024

Xiaopeng Xie, Ming Yan, Xiwen Zhou, Chenlong Zhao, Suli Wang, Yong Zhang, Joey Tianyi Zhou

Abstract:Prompt-based learning paradigm has demonstrated remarkable efficacy in enhancing the adaptability of pretrained language models (PLMs), particularly in few-shot scenarios. However, this learning paradigm has been shown to be vulnerable to backdoor attacks. The current clean-label attack, employing a specific prompt as a trigger, can achieve success without the need for external triggers and ensure correct labeling of poisoned samples, which is more stealthy compared to the poisoned-label attack, but on the other hand, it faces significant issues with false activations and poses greater challenges, necessitating a higher rate of poisoning. Using conventional negative data augmentation methods, we discovered that it is challenging to trade off between effectiveness and stealthiness in a clean-label setting. In addressing this issue, we are inspired by the notion that a backdoor acts as a shortcut and posit that this shortcut stems from the contrast between the trigger and the data utilized for poisoning. In this study, we propose a method named Contrastive Shortcut Injection (CSI), by leveraging activation values, integrates trigger design and data selection strategies to craft stronger shortcut features. With extensive experiments on full-shot and few-shot text classification tasks, we empirically validate CSI's high effectiveness and high stealthiness at low poisoning rates. Notably, we found that the two approaches play leading roles in full-shot and few-shot settings, respectively.

* 10 pages, 6 figures, conference

Via

Access Paper or Ask Questions

Collaborative Knowledge Infusion for Low-resource Stance Detection

Mar 28, 2024

Ming Yan, Joey Tianyi Zhou, Ivor W. Tsang

Abstract:Stance detection is the view towards a specific target by a given context (\textit{e.g.} tweets, commercial reviews). Target-related knowledge is often needed to assist stance detection models in understanding the target well and making detection correctly. However, prevailing works for knowledge-infused stance detection predominantly incorporate target knowledge from a singular source that lacks knowledge verification in limited domain knowledge. The low-resource training data further increases the challenge for the data-driven large models in this task. To address those challenges, we propose a collaborative knowledge infusion approach for low-resource stance detection tasks, employing a combination of aligned knowledge enhancement and efficient parameter learning techniques. Specifically, our stance detection approach leverages target background knowledge collaboratively from different knowledge sources with the help of knowledge alignment. Additionally, we also introduce the parameter-efficient collaborative adaptor with a staged optimization algorithm, which collaboratively addresses the challenges associated with low-resource stance detection tasks from both network structure and learning perspectives. To assess the effectiveness of our method, we conduct extensive experiments on three public stance detection datasets, including low-resource and cross-target settings. The results demonstrate significant performance improvements compared to the existing stance detection approaches.

* 13 pages, 3 figures, Big Data Mining and Analysis

Via

Access Paper or Ask Questions