Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiaodong He

Department of R and D, UnionString Technology Co. Ltd

Leveraging Label Information for Multimodal Emotion Recognition

Sep 05, 2023

Peiying Wang, Sunlu Zeng, Junqing Chen, Lu Fan, Meng Chen, Youzheng Wu, Xiaodong He

Abstract:Multimodal emotion recognition (MER) aims to detect the emotional status of a given expression by combining the speech and text information. Intuitively, label information should be capable of helping the model locate the salient tokens/frames relevant to the specific emotion, which finally facilitates the MER task. Inspired by this, we propose a novel approach for MER by leveraging label information. Specifically, we first obtain the representative label embeddings for both text and speech modalities, then learn the label-enhanced text/speech representations for each utterance via label-token and label-frame interactions. Finally, we devise a novel label-guided attentive fusion module to fuse the label-aware text and speech representations for emotion classification. Extensive experiments were conducted on the public IEMOCAP dataset, and experimental results demonstrate that our proposed approach outperforms existing baselines and achieves new state-of-the-art performance.

* Accepted by Interspeech 2023

Via

Access Paper or Ask Questions

AUGUST: an Automatic Generation Understudy for Synthesizing Conversational Recommendation Datasets

Jun 16, 2023

Yu Lu, Junwei Bao, Zichen Ma, Xiaoguang Han, Youzheng Wu, Shuguang Cui, Xiaodong He

Figure 1 for AUGUST: an Automatic Generation Understudy for Synthesizing Conversational Recommendation Datasets

Figure 2 for AUGUST: an Automatic Generation Understudy for Synthesizing Conversational Recommendation Datasets

Figure 3 for AUGUST: an Automatic Generation Understudy for Synthesizing Conversational Recommendation Datasets

Figure 4 for AUGUST: an Automatic Generation Understudy for Synthesizing Conversational Recommendation Datasets

Abstract:High-quality data is essential for conversational recommendation systems and serves as the cornerstone of the network architecture development and training strategy design. Existing works contribute heavy human efforts to manually labeling or designing and extending recommender dialogue templates. However, they suffer from (i) the limited number of human annotators results in that datasets can hardly capture rich and large-scale cases in the real world, (ii) the limited experience and knowledge of annotators account for the uninformative corpus and inappropriate recommendations. In this paper, we propose a novel automatic dataset synthesis approach that can generate both large-scale and high-quality recommendation dialogues through a data2text generation process, where unstructured recommendation conversations are generated from structured graphs based on user-item information from the real world. In doing so, we comprehensively exploit: (i) rich personalized user profiles from traditional recommendation datasets, (ii) rich external knowledge from knowledge graphs, and (iii) the conversation ability contained in human-to-human conversational recommendation datasets. Extensive experiments validate the benefit brought by the automatically synthesized data under low-resource scenarios and demonstrate the promising potential to facilitate the development of a more effective conversational recommendation system.

* 10 pages

Via

Access Paper or Ask Questions

OTF: Optimal Transport based Fusion of Supervised and Self-Supervised Learning Models for Automatic Speech Recognition

Jun 05, 2023

Li Fu, Siqi Li, Qingtao Li, Fangzhu Li, Liping Deng, Lu Fan, Meng Chen, Youzheng Wu, Xiaodong He

Figure 1 for OTF: Optimal Transport based Fusion of Supervised and Self-Supervised Learning Models for Automatic Speech Recognition

Figure 2 for OTF: Optimal Transport based Fusion of Supervised and Self-Supervised Learning Models for Automatic Speech Recognition

Figure 3 for OTF: Optimal Transport based Fusion of Supervised and Self-Supervised Learning Models for Automatic Speech Recognition

Figure 4 for OTF: Optimal Transport based Fusion of Supervised and Self-Supervised Learning Models for Automatic Speech Recognition

Abstract:Self-Supervised Learning (SSL) Automatic Speech Recognition (ASR) models have shown great promise over Supervised Learning (SL) ones in low-resource settings. However, the advantages of SSL are gradually weakened when the amount of labeled data increases in many industrial applications. To further improve the ASR performance when abundant labels are available, we first explore the potential of combining SL and SSL ASR models via analyzing their complementarity in recognition accuracy and optimization property. Then, we propose a novel Optimal Transport based Fusion (OTF) method for SL and SSL models without incurring extra computation cost in inference. Specifically, optimal transport is adopted to softly align the layer-wise weights to unify the two different networks into a single one. Experimental results on the public 1k-hour English LibriSpeech dataset and our in-house 2.6k-hour Chinese dataset show that OTF largely outperforms the individual models with lower error rates.

* Accepted by Interspeech 2023

Via

Access Paper or Ask Questions

DiffusEmp: A Diffusion Model-Based Framework with Multi-Grained Control for Empathetic Response Generation

Jun 02, 2023

Guanqun Bi, Lei Shen, Yanan Cao, Meng Chen, Yuqiang Xie, Zheng Lin, Xiaodong He

Figure 1 for DiffusEmp: A Diffusion Model-Based Framework with Multi-Grained Control for Empathetic Response Generation

Figure 2 for DiffusEmp: A Diffusion Model-Based Framework with Multi-Grained Control for Empathetic Response Generation

Figure 3 for DiffusEmp: A Diffusion Model-Based Framework with Multi-Grained Control for Empathetic Response Generation

Figure 4 for DiffusEmp: A Diffusion Model-Based Framework with Multi-Grained Control for Empathetic Response Generation

Abstract:Empathy is a crucial factor in open-domain conversations, which naturally shows one's caring and understanding to others. Though several methods have been proposed to generate empathetic responses, existing works often lead to monotonous empathy that refers to generic and safe expressions. In this paper, we propose to use explicit control to guide the empathy expression and design a framework DiffusEmp based on conditional diffusion language model to unify the utilization of dialogue context and attribute-oriented control signals. Specifically, communication mechanism, intent, and semantic frame are imported as multi-grained signals that control the empathy realization from coarse to fine levels. We then design a specific masking strategy to reflect the relationship between multi-grained signals and response tokens, and integrate it into the diffusion model to influence the generative process. Experimental results on a benchmark dataset EmpatheticDialogue show that our framework outperforms competitive baselines in terms of controllability, informativeness, and diversity without the loss of context-relatedness.

* accepted by ACL 2023 main conference (Oral)

Via

Access Paper or Ask Questions

Learning to Generate Poetic Chinese Landscape Painting with Calligraphy

May 08, 2023

Shaozu Yuan, Aijun Dai, Zhiling Yan, Ruixue Liu, Meng Chen, Baoyang Chen, Zhijie Qiu, Xiaodong He

Figure 1 for Learning to Generate Poetic Chinese Landscape Painting with Calligraphy

Figure 2 for Learning to Generate Poetic Chinese Landscape Painting with Calligraphy

Figure 3 for Learning to Generate Poetic Chinese Landscape Painting with Calligraphy

Figure 4 for Learning to Generate Poetic Chinese Landscape Painting with Calligraphy

Abstract:In this paper, we present a novel system (denoted as Polaca) to generate poetic Chinese landscape painting with calligraphy. Unlike previous single image-to-image painting generation, Polaca takes the classic poetry as input and outputs the artistic landscape painting image with the corresponding calligraphy. It is equipped with three different modules to complete the whole piece of landscape painting artwork: the first one is a text-to-image module to generate landscape painting image, the second one is an image-to-image module to generate stylistic calligraphy image, and the third one is an image fusion module to fuse the two images into a whole piece of aesthetic artwork.

* Accepted by IJCAI 2022

Via

Access Paper or Ask Questions

A Novel Vector-Field-Based Motion Planning for 3D Nonholonomic Robots

Feb 22, 2023

Xiaodong He, Zhiyong Sun, Zhongkui Li

Abstract:This paper focuses on the motion planning for mobile robots in 3D, which are modelled by 6-DOF rigid body systems with nonholonomic constraints. We not only specify the target position, but also bring in the requirement of the heading direction at the terminal time, which gives rise to a new and more challenging 3D motion planning problem. The proposed planning algorithm involves a novel velocity vector field (VF) over the workspace, and by following the VF, the robot can be navigated to the destination with the specified heading direction. In order to circumvent potential collisions with obstacles and other robots, a composite VF is presented by composing the navigation VF and an additional VF tangential to the boundary of the dangerous area. Moreover, we propose a priority-based algorithm to deal with the motion coupling among multiple robots. Finally, numerical simulations are conducted to verify the theoretical results.

Via

Access Paper or Ask Questions

MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding

Nov 27, 2022

Meihuizi Jia, Lei Shen, Xin Shen, Lejian Liao, Meng Chen, Xiaodong He, Zhendong Chen, Jiaqi Li

Figure 1 for MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding

Figure 2 for MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding

Figure 3 for MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding

Figure 4 for MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding

Abstract:Multimodal named entity recognition (MNER) is a critical step in information extraction, which aims to detect entity spans and classify them to corresponding entity types given a sentence-image pair. Existing methods either (1) obtain named entities with coarse-grained visual clues from attention mechanisms, or (2) first detect fine-grained visual regions with toolkits and then recognize named entities. However, they suffer from improper alignment between entity types and visual regions or error propagation in the two-stage manner, which finally imports irrelevant visual information into texts. In this paper, we propose a novel end-to-end framework named MNER-QG that can simultaneously perform MRC-based multimodal named entity recognition and query grounding. Specifically, with the assistance of queries, MNER-QG can provide prior knowledge of entity types and visual regions, and further enhance representations of both texts and images. To conduct the query grounding task, we provide manual annotations and weak supervisions that are obtained via training a highly flexible visual grounding model with transfer learning. We conduct extensive experiments on two public MNER datasets, Twitter2015 and Twitter2017. Experimental results show that MNER-QG outperforms the current state-of-the-art models on the MNER task, and also improves the query grounding performance.

* 13 pages, 6 figures, published to AAAI

Via

Access Paper or Ask Questions

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Nov 27, 2022

Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, Tianrui Li

Figure 1 for SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Figure 2 for SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Figure 3 for SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Figure 4 for SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Abstract:Recently, the contrastive language-image pre-training, e.g., CLIP, has demonstrated promising results on various downstream tasks. The pre-trained model can capture enriched visual concepts for images by learning from a large scale of text-image data. However, transferring the learned visual knowledge to open-vocabulary semantic segmentation is still under-explored. In this paper, we propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation in an annotation-free manner. The SegCLIP achieves segmentation based on ViT and the main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs. The gathering operation can dynamically capture the semantic groups, which can be used to generate the final segmentation results. We further propose a reconstruction loss on masked patches and a superpixel-based KL loss with pseudo-labels to enhance the visual representation. Experimental results show that our model achieves comparable or superior segmentation accuracy on the PASCAL VOC 2012 (+1.4% mIoU), PASCAL Context (+2.4% mIoU), and COCO (+5.6% mIoU) compared with baselines. We release the code at https://github.com/ArrowLuo/SegCLIP.

Via

Access Paper or Ask Questions

Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement

Nov 22, 2022

Wei Song, Yanghao Yue, Ya-jie Zhang, Zhengchen Zhang, Youzheng Wu, Xiaodong He

Abstract:Disentanglement of a speaker's timbre and style is very important for style transfer in multi-speaker multi-style text-to-speech (TTS) scenarios. With the disentanglement of timbres and styles, TTS systems could synthesize expressive speech for a given speaker with any style which has been seen in the training corpus. However, there are still some shortcomings with the current research on timbre and style disentanglement. The current method either requires single-speaker multi-style recordings, which are difficult and expensive to collect, or uses a complex network and complicated training method, which is difficult to reproduce and control the style transfer behavior. To improve the disentanglement effectiveness of timbres and styles, and to remove the reliance on single-speaker multi-style corpus, a simple but effective timbre and style disentanglement method is proposed in this paper. The FastSpeech2 network is employed as the backbone network, with explicit duration, pitch, and energy trajectory to represent the style. Each speaker's data is considered as a separate and isolated style, then a speaker embedding and a style embedding are added to the FastSpeech2 network to learn disentangled representations. Utterance level pitch and energy normalization are utilized to improve the decoupling effect. Experimental results demonstrate that the proposed model could synthesize speech with any style seen during training with high style similarity while maintaining very high speaker similarity.

Via

Access Paper or Ask Questions

MoNET: Tackle State Momentum via Noise-Enhanced Training for Dialogue State Tracking

Nov 11, 2022

Haoning Zhang, Junwei Bao, Haipeng Sun, Youzheng Wu, Wenye Li, Shuguang Cui, Xiaodong He

Figure 1 for MoNET: Tackle State Momentum via Noise-Enhanced Training for Dialogue State Tracking

Figure 2 for MoNET: Tackle State Momentum via Noise-Enhanced Training for Dialogue State Tracking

Figure 3 for MoNET: Tackle State Momentum via Noise-Enhanced Training for Dialogue State Tracking

Figure 4 for MoNET: Tackle State Momentum via Noise-Enhanced Training for Dialogue State Tracking

Abstract:Dialogue state tracking (DST) aims to convert the dialogue history into dialogue states which consist of slot-value pairs. As condensed structural information memorizing all history information, the dialogue state in the last turn is typically adopted as the input for predicting the current state by DST models. However, these models tend to keep the predicted slot values unchanged, which is defined as state momentum in this paper. Specifically, the models struggle to update slot values that need to be changed and correct wrongly predicted slot values in the last turn. To this end, we propose MoNET to tackle state momentum via noise-enhanced training. First, the previous state of each turn in the training data is noised via replacing some of its slot values. Then, the noised previous state is used as the input to learn to predict the current state, improving the model's ability to update and correct slot values. Furthermore, a contrastive context matching framework is designed to narrow the representation distance between a state and its corresponding noised variant, which reduces the impact of noised state and makes the model better understand the dialogue history. Experimental results on MultiWOZ datasets show that MoNET outperforms previous DST methods. Ablations and analysis verify the effectiveness of MoNET in alleviating state momentum and improving anti-noise ability.

* 8 pages, 6 figures, 3 tables

Via

Access Paper or Ask Questions