Alert button
Picture for Dorien Herremans

Dorien Herremans

Alert button

Constructing Time-Series Momentum Portfolios with Deep Multi-Task Learning

Jun 08, 2023
Joel Ong, Dorien Herremans

Figure 1 for Constructing Time-Series Momentum Portfolios with Deep Multi-Task Learning
Figure 2 for Constructing Time-Series Momentum Portfolios with Deep Multi-Task Learning
Figure 3 for Constructing Time-Series Momentum Portfolios with Deep Multi-Task Learning
Figure 4 for Constructing Time-Series Momentum Portfolios with Deep Multi-Task Learning

A diversified risk-adjusted time-series momentum (TSMOM) portfolio can deliver substantial abnormal returns and offer some degree of tail risk protection during extreme market events. The performance of existing TSMOM strategies, however, relies not only on the quality of the momentum signal but also on the efficacy of the volatility estimator. Yet many of the existing studies have always considered these two factors to be independent. Inspired by recent progress in Multi-Task Learning (MTL), we present a new approach using MTL in a deep neural network architecture that jointly learns portfolio construction and various auxiliary tasks related to volatility, such as forecasting realized volatility as measured by different volatility estimators. Through backtesting from January 2000 to December 2020 on a diversified portfolio of continuous futures contracts, we demonstrate that even after accounting for transaction costs of up to 3 basis points, our approach outperforms existing TSMOM strategies. Moreover, experiments confirm that adding auxiliary tasks indeed boosts the portfolio's performance. These findings demonstrate that MTL can be a powerful tool in finance.

* Expert Systems with Applications Volume 230, 15 November 2023, 120587  
Viaarxiv icon

Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training

Feb 02, 2023
Kin Wai Cheuk, Keunwoo Choi, Qiuqiang Kong, Bochen Li, Minz Won, Ju-Chiang Wang, Yun-Ning Hung, Dorien Herremans

Figure 1 for Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training
Figure 2 for Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training
Figure 3 for Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training
Figure 4 for Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training

In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilizes instrument information and transcription results. The joint training of the transcription and source separation modules serves to improve the performance of both tasks. The instrument module is optional and can be directly controlled by human users. This makes Jointist a flexible user-controllable framework. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. Its novelty, however, necessitates a new perspective on how to evaluate such a model. In our experiments, we assess the proposed model from various aspects, providing a new evaluation perspective for multi-instrument transcription. Our subjective listening study shows that Jointist achieves state-of-the-art performance on popular music, outperforming existing multi-instrument transcription models such as MT3. We conducted experiments on several downstream tasks and found that the proposed method improved transcription by more than 1 percentage points (ppt.), source separation by 5 SDR, downbeat detection by 1.8 ppt., chord recognition by 1.4 ppt., and key estimation by 1.4 ppt., when utilizing transcription results obtained from Jointist. Demo available at \url{https://jointist.github.io/Demo}.

* arXiv admin note: text overlap with arXiv:2206.10805 
Viaarxiv icon

SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech

Nov 14, 2022
Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman, Dorien Herremans

Figure 1 for SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech
Figure 2 for SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech
Figure 3 for SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech
Figure 4 for SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech

Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary. Sparse TTS models can improve on dense models via pruning and extra retraining, or converge faster than dense models with some performance loss. Inspired by these results, we propose training TTS models using a decaying sparsity rate, i.e. a high initial sparsity to accelerate training first, followed by a progressive rate reduction to obtain better eventual performance. This decremental approach differs from current methods of incrementing sparsity to a desired target, which costs significantly more time than dense training. We call our method SNIPER training: Single-shot Initialization Pruning Evolving-Rate training. Our experiments on FastSpeech2 show that although we were only able to obtain better losses in the first few epochs before being overtaken by the baseline, the final SNIPER-trained models beat constant-sparsity models and pip dense models in performance.

Viaarxiv icon

Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

Nov 07, 2022
Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans

Figure 1 for Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder
Figure 2 for Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder
Figure 3 for Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder
Figure 4 for Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

Accent plays a significant role in speech communication, influencing understanding capabilities and also conveying a person's identity. This paper introduces a novel and efficient framework for accented Text-to-Speech (TTS) synthesis based on a Conditional Variational Autoencoder. It has the ability to synthesize a selected speaker's speech that is converted to any desired target accent. Our thorough experiments validate the effectiveness of our proposed framework using both objective and subjective evaluations. The results also show remarkable performance in terms of the ability to manipulate accents in the synthesized speech and provide a promising avenue for future accented TTS research.

* preprint submitted to a conference, under review 
Viaarxiv icon

DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

Oct 11, 2022
Kin Wai Cheuk, Ryosuke Sawata, Toshimitsu Uesaka, Naoki Murata, Naoya Takahashi, Shusuke Takahashi, Dorien Herremans, Yuki Mitsufuji

Figure 1 for DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability
Figure 2 for DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability
Figure 3 for DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability
Figure 4 for DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT). Instead of treating AMT as a discriminative task in which the model is trained to convert spectrograms into piano rolls, we think of it as a conditional generative task where we train our model to generate realistic looking piano rolls from pure Gaussian noise conditioned on spectrograms. This new AMT formulation enables DiffRoll to transcribe, generate and even inpaint music. Due to the classifier-free nature, DiffRoll is also able to be trained on unpaired datasets where only piano rolls are available. Our experiments show that DiffRoll outperforms its discriminative counterpart by 17.9 percentage points (ppt.) and our ablation studies also indicate that it outperforms similar existing methods by 3.70 ppt.

Viaarxiv icon

Jointist: Joint Learning for Multi-instrument Transcription and Its Applications

Jun 28, 2022
Kin Wai Cheuk, Keunwoo Choi, Qiuqiang Kong, Bochen Li, Minz Won, Amy Hung, Ju-Chiang Wang, Dorien Herremans

Figure 1 for Jointist: Joint Learning for Multi-instrument Transcription and Its Applications
Figure 2 for Jointist: Joint Learning for Multi-instrument Transcription and Its Applications
Figure 3 for Jointist: Joint Learning for Multi-instrument Transcription and Its Applications
Figure 4 for Jointist: Joint Learning for Multi-instrument Transcription and Its Applications

In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of the instrument recognition module that conditions the other modules: the transcription module that outputs instrument-specific piano rolls, and the source separation module that utilizes instrument information and transcription results. The instrument conditioning is designed for an explicit multi-instrument functionality while the connection between the transcription and source separation modules is for better transcription performance. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. However, its novelty necessitates a new perspective on how to evaluate such a model. During the experiment, we assess the model from various aspects, providing a new evaluation perspective for multi-instrument transcription. We also argue that transcription models can be utilized as a preprocessing module for other music analysis tasks. In the experiment on several downstream tasks, the symbolic representation provided by our transcription model turned out to be helpful to spectrograms in solving downbeat detection, chord recognition, and key estimation.

* Submitted to ISMIR 
Viaarxiv icon

A multimodal model with Twitter FinBERT embeddings for extreme price movement prediction of Bitcoin

May 30, 2022
Yanzhao Zou, Dorien Herremans

Figure 1 for A multimodal model with Twitter FinBERT embeddings for extreme price movement prediction of Bitcoin
Figure 2 for A multimodal model with Twitter FinBERT embeddings for extreme price movement prediction of Bitcoin
Figure 3 for A multimodal model with Twitter FinBERT embeddings for extreme price movement prediction of Bitcoin
Figure 4 for A multimodal model with Twitter FinBERT embeddings for extreme price movement prediction of Bitcoin

Bitcoin, with its ever-growing popularity, has demonstrated extreme price volatility since its origin. This volatility, together with its decentralised nature, make Bitcoin highly subjective to speculative trading as compared to more traditional assets. In this paper, we propose a multimodal model for predicting extreme price fluctuations. This model takes as input a variety of correlated assets, technical indicators, as well as Twitter content. In an in-depth study, we explore whether social media discussions from the general public on Bitcoin have predictive power for extreme price movements. A dataset of 5,000 tweets per day containing the keyword `Bitcoin' was collected from 2015 to 2021. This dataset, called PreBit, is made available online. In our hybrid model, we use sentence-level FinBERT embeddings, pretrained on financial lexicons, so as to capture the full contents of the tweets and feed it to the model in an understandable way. By combining these embeddings with a Convolutional Neural Network, we built a predictive model for significant market movements. The final multimodal ensemble model includes this NLP model together with a model based on candlestick data, technical indicators and correlated asset prices. In an ablation study, we explore the contribution of the individual modalities. Finally, we propose and backtest a trading strategy based on the predictions of our models with varying prediction threshold and show that it can used to build a profitable trading strategy with a reduced risk over a `hold' or moving average strategy.

* 18 pages, submitted preprint to Elsevier Expert Systems with Applications 
Viaarxiv icon

Understanding Audio Features via Trainable Basis Functions

Apr 25, 2022
Kwan Yee Heung, Kin Wai Cheuk, Dorien Herremans

Figure 1 for Understanding Audio Features via Trainable Basis Functions
Figure 2 for Understanding Audio Features via Trainable Basis Functions
Figure 3 for Understanding Audio Features via Trainable Basis Functions
Figure 4 for Understanding Audio Features via Trainable Basis Functions

In this paper we explore the possibility of maximizing the information represented in spectrograms by making the spectrogram basis functions trainable. We experiment with two different tasks, namely keyword spotting (KWS) and automatic speech recognition (ASR). For most neural network models, the architecture and hyperparameters are typically fine-tuned and optimized in experiments. Input features, however, are often treated as fixed. In the case of audio, signals can be mainly expressed in two main ways: raw waveforms (time-domain) or spectrograms (time-frequency-domain). In addition, different spectrogram types are often used and tailored to fit different applications. In our experiments, we allow for this tailoring directly as part of the network. Our experimental results show that using trainable basis functions can boost the accuracy of Keyword Spotting (KWS) by 14.2 percentage points, and lower the Phone Error Rate (PER) by 9.5 percentage points. Although models using trainable basis functions become less effective as the model complexity increases, the trained filter shapes could still provide us with insights on which frequency bins are important for that specific task. From our experiments, we can conclude that trainable basis functions are a useful tool to boost the performance when the model complexity is limited.

* under review in Interspeech 2022 
Viaarxiv icon

HEAR 2021: Holistic Evaluation of Audio Representations

Mar 26, 2022
Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu Jin, Yonatan Bisk

Figure 1 for HEAR 2021: Holistic Evaluation of Audio Representations
Figure 2 for HEAR 2021: Holistic Evaluation of Audio Representations
Figure 3 for HEAR 2021: Holistic Evaluation of Audio Representations
Figure 4 for HEAR 2021: Holistic Evaluation of Audio Representations

What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR 2021 NeurIPS challenge is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR 2021 evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music. In the spirit of shared exchange, each participant submitted an audio embedding model following a common API that is general-purpose, open-source, and freely available to use. Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets. Open evaluation code, submitted models and datasets are key contributions, enabling comprehensive and reproducible evaluation, as well as previously impossible longitudinal studies. It still remains an open question whether one single general-purpose audio representation can perform as holistically as the human ear.

* to appear in Proceedings of Machine Learning Research (PMLR): NeurIPS 2021 Competition Track 
Viaarxiv icon