A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built up to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in https://text2songMelodist.github.io/Sample/.
Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation that enables text-conditioned vocal range control while keeping melodic accuracy. Furthermore, we explore various experiment settings, including different types of text representations, text encoder fine-tuning, and introducing speech data to alleviate data scarcity, aiming to facilitate further research. Experiments show that our model achieves favorable controlling ability and audio quality. Audio samples are available at http://prompt-singer.github.io .
Domain adaptive semantic segmentation is recognized as a promising technique to alleviate the domain shift between the labeled source domain and the unlabeled target domain in many real-world applications, such as automatic pilot. However, large amounts of source domain data often introduce significant costs in storage and training, and sometimes the source data is inaccessible due to privacy policies. To address these problems, we investigate domain adaptive semantic segmentation without source data, which assumes that the model is pre-trained on the source domain, and then adapting to the target domain without accessing source data anymore. Since there is no supervision from the source domain data, many self-training methods tend to fall into the ``winner-takes-all'' dilemma, where the {\it majority} classes totally dominate the segmentation networks and the networks fail to classify the {\it minority} classes. Consequently, we propose an effective framework for this challenging problem with two components: positive learning and negative learning. In positive learning, we select the class-balanced pseudo-labeled pixels with intra-class threshold, while in negative learning, for each pixel, we investigate which category the pixel does not belong to with the proposed heuristic complementary label selection. Notably, our framework can be easily implemented and incorporated with other methods to further enhance the performance. Extensive experiments on two widely-used synthetic-to-real benchmarks demonstrate our claims and the effectiveness of our framework, which outperforms the baseline with a large margin. Code is available at \url{https://github.com/fumyou13/LDBE}.
Deep neural networks have a clear degradation when applying to the unseen environment due to the covariate shift. Conventional approaches like domain adaptation requires the pre-collected target data for iterative training, which is impractical in real-world applications. In this paper, we propose to adapt the deep models to the novel environment during inference. An previous solution is test time normalization, which substitutes the source statistics in BN layers with the target batch statistics. However, we show that test time normalization may potentially deteriorate the discriminative structures due to the mismatch between target batch statistics and source parameters. To this end, we present a general formulation $\alpha$-BN to calibrate the batch statistics by mixing up the source and target statistics for both alleviating the domain shift and preserving the discriminative structures. Based on $\alpha$-BN, we further present a novel loss function to form a unified test time adaptation framework Core, which performs the pairwise class correlation online optimization. Extensive experiments show that our approaches achieve the state-of-the-art performance on total twelve datasets from three topics, including model robustness to corruptions, domain generalization on image classification and semantic segmentation. Particularly, our $\alpha$-BN improves 28.4\% to 43.9\% on GTA5 $\rightarrow$ Cityscapes without any training, even outperforms the latest source-free domain adaptation method.