Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jin-Mao Wei

A General Framework for Learning Prosodic-Enhanced Representation of Rap Lyrics

Mar 23, 2021

Hongru Liang, Haozheng Wang, Qian Li, Jun Wang, Guandong Xu, Jiawei Chen, Jin-Mao Wei, Zhenglu Yang

Figure 1 for A General Framework for Learning Prosodic-Enhanced Representation of Rap Lyrics

Figure 2 for A General Framework for Learning Prosodic-Enhanced Representation of Rap Lyrics

Figure 3 for A General Framework for Learning Prosodic-Enhanced Representation of Rap Lyrics

Figure 4 for A General Framework for Learning Prosodic-Enhanced Representation of Rap Lyrics

Abstract:Learning and analyzing rap lyrics is a significant basis for many web applications, such as music recommendation, automatic music categorization, and music information retrieval, due to the abundant source of digital music in the World Wide Web. Although numerous studies have explored the topic, knowledge in this field is far from satisfactory, because critical issues, such as prosodic information and its effective representation, as well as appropriate integration of various features, are usually ignored. In this paper, we propose a hierarchical attention variational autoencoder framework (HAVAE), which simultaneously consider semantic and prosodic features for rap lyrics representation learning. Specifically, the representation of the prosodic features is encoded by phonetic transcriptions with a novel and effective strategy~(i.e., rhyme2vec). Moreover, a feature aggregation strategy is proposed to appropriately integrate various features and generate prosodic-enhanced representation. A comprehensive empirical evaluation demonstrates that the proposed framework outperforms the state-of-the-art approaches under various metrics in different rap lyrics learning tasks.

Via

Access Paper or Ask Questions

JTAV: Jointly Learning Social Media Content Representation by Fusing Textual, Acoustic, and Visual Features

Jun 05, 2018

Hongru Liang, Haozheng Wang, Jun Wang, Shaodi You, Zhe Sun, Jin-Mao Wei, Zhenglu Yang

Figure 1 for JTAV: Jointly Learning Social Media Content Representation by Fusing Textual, Acoustic, and Visual Features

Figure 2 for JTAV: Jointly Learning Social Media Content Representation by Fusing Textual, Acoustic, and Visual Features

Figure 3 for JTAV: Jointly Learning Social Media Content Representation by Fusing Textual, Acoustic, and Visual Features

Figure 4 for JTAV: Jointly Learning Social Media Content Representation by Fusing Textual, Acoustic, and Visual Features

Abstract:Learning social media content is the basis of many real-world applications, including information retrieval and recommendation systems, among others. In contrast with previous works that focus mainly on single modal or bi-modal learning, we propose to learn social media content by fusing jointly textual, acoustic, and visual information (JTAV). Effective strategies are proposed to extract fine-grained features of each modality, that is, attBiGRU and DCRNN. We also introduce cross-modal fusion and attentive pooling techniques to integrate multi-modal information comprehensively. Extensive experimental evaluation conducted on real-world datasets demonstrates our proposed model outperforms the state-of-the-art approaches by a large margin.

Via

Access Paper or Ask Questions