We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. Training leverages recent advances in text-to-speech and speech enhancement, which combine adversarial and reconstruction losses to allow the generation of high-quality audio content from quantized embeddings. By training with structured dropout applied to quantizer layers, a single model can operate across variable bitrates from 3kbps to 18kbps, with a negligible quality loss when compared with models trained at fixed bitrates. In addition, the model is amenable to a low latency implementation, which supports streamable inference and runs in real time on a smartphone CPU. In subjective evaluations using audio at 24kHz sampling rate, SoundStream at 3kbps outperforms Opus at 12kbps and approaches EVS at 9.6kbps. Moreover, we are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency, which we demonstrate through background noise suppression for speech.
The rise of deep learning algorithms has led many researchers to withdraw from using classic signal processing methods for sound generation. Deep learning models have achieved expressive voice synthesis, realistic sound textures, and musical notes from virtual instruments. However, the most suitable deep learning architecture is still under investigation. The choice of architecture is tightly coupled to the audio representations. A sound's original waveform can be too dense and rich for deep learning models to deal with efficiently - and complexity increases training time and computational cost. Also, it does not represent sound in the manner in which it is perceived. Therefore, in many cases, the raw audio has been transformed into a compressed and more meaningful form using upsampling, feature-extraction, or even by adopting a higher level illustration of the waveform. Furthermore, conditional on the form chosen, additional conditioning representations, different model architectures, and numerous metrics for evaluating the reconstructed sound have been investigated. This paper provides an overview of audio representations applied to sound synthesis using deep learning. Additionally, it presents the most significant methods for developing and evaluating a sound synthesis architecture using deep learning models, always depending on the audio representation.
Contextual bandits algorithms have been successfully deployed to various industrial applications for the trade-off between exploration and exploitation and the state-of-art performance on minimizing online costs. However, the applicability is limited by the over-simplified assumptions on the problem, such as assuming the rewards linearly depend on the contexts, or assuming a static environment where the states are not affected by previous actions. In this work, we put forward an alternative method for general contextual bandits using actor-critic neural networks to directly optimize in the policy space, coined policy gradient for contextual bandits (PGCB). It optimizes over a class of policies in which the marginal probability of choosing an arm (in expectation of other arms) has a simple closed form so that the objective is differentiable. In particular, the gradient of this class of policies is in a succinct form. Moreover, we propose two useful heuristic techniques called Time-Dependent Greed and Actor-Dropout. The former ensures PGCB to be empirically greedy in the limit, while the later balances a trade-off between exploration and exploitation by using the actor-network with dropout as a Bayesian approximation. PGCB can solve contextual bandits in the standard case as well as the Markov Decision Process generalization where there is a state that decides the distribution of contexts of arms and affects the immediate reward when choosing an arm, therefore can be applied to a wide range of realistic settings such as personalized recommender systems and natural language generations. We evaluate PGCB on toy datasets as well as a music recommender dataset. Experiments show that PGCB has fast convergence and low regret and outperforms both classic contextual-bandits methods and vanilla policy gradient methods.
Most scene graph generators use a two-stage pipeline to detect visual relationships: the first stage detects entities, and the second predicts the predicate for each entity pair using a softmax distribution. We find that such pipelines, trained with only a cross entropy loss over predicate classes, suffer from two common errors. The first, Entity Instance Confusion, occurs when the model confuses multiple instances of the same type of entity (e.g. multiple cups). The second, Proximal Relationship Ambiguity, arises when multiple subject-predicate-object triplets appear in close proximity with the same predicate, and the model struggles to infer the correct subject-object pairings (e.g. mis-pairing musicians and their instruments). We propose a set of contrastive loss formulations that specifically target these types of errors within the scene graph generation problem, collectively termed the Graphical Contrastive Losses. These losses explicitly force the model to disambiguate related and unrelated instances through margin constraints specific to each type of confusion. We further construct a relationship detector, called RelDN, using the aforementioned pipeline to demonstrate the efficacy of our proposed losses. Our model outperforms the winning method of the OpenImages Relationship Detection Challenge by 4.7\% (16.5\% relative) on the test set. We also show improved results over the best previous methods on the Visual Genome and Visual Relationship Detection datasets.
The shuffle mode, where songs are played in a randomized order that is decided upon for all tracks at once, is widely found and known to exist in music player systems. There are only few music enthusiasts who use this mode since it either is too random to suit their mood or it keeps on repeating the same list every time. In this paper, we propose to build a convolutional deep belief network(CDBN) that is trained to perform genre recognition based on audio features retrieved from the records of the Million Song Dataset. The learned parameters shall be used to initialize a multi-layer perceptron which takes extracted features of user's playlist as input alongside the metadata to classify to various categories. These categories will be shuffled retrospectively based on the metadata to autonomously provide with a list that is efficacious in playing songs that are desired by humans in normal conditions.
One challenge with open-domain dialogue systems is the need to produce truthful, high-quality responses on any topic. We aim to improve the quality and coverage of Athena, an Alexa Prize dialogue system. We experiment with few-shot prompt-based learning, comparing GPT-Neo to Jurassic-1, for the movies, music, TV, sports, and video game domains, both within and cross-domain, with different prompt set sizes (2, 3, 10), formats, and meaning representations consisting of either sets of WikiData KG triples, or dialogue acts. Our evaluation uses BLEURT and human metrics, and shows that with 10-shot prompting, Athena-Jurassic's performance is significantly better for coherence and semantic accuracy. Experiments with 2-shot cross-domain prompts results in a huge performance drop for Athena-GPT-Neo, whose semantic accuracy falls to 0.41, and whose untrue hallucination rate increases to 12%. Experiments with dialogue acts for video games show that with 10-shot prompting, both models learn to control dialogue acts, but Athena-Jurassic has significantly higher coherence, and only 4% untrue hallucinations. Our results suggest that Athena-Jurassic produces high enough quality outputs to be useful in live systems with real users. To our knowledge, these are the first results demonstrating that few-shot semantic prompt-based learning can create NLGs that generalize to new domains, and produce high-quality, semantically-controlled, conversational responses directly from meaning representations.