Alert button
Picture for Yi Yu

Yi Yu

Alert button

Melody-conditioned lyrics generation via fine-tuning language model and its evaluation with ChatGPT

Oct 02, 2023
Zhe Zhang, Karol Lasocki, Yi Yu, Atsuhiro Takasu

We leverage character-level language models for syllable-level lyrics generation from symbolic melody. By fine-tuning a character-level pre-trained model, we integrate language knowledge into the beam search of a syllable-level Transformer generator. Using ChatGPT-based evaluations, we demonstrate enhanced coherence and correctness in the generated lyrics.

Viaarxiv icon

Music- and Lyrics-driven Dance Synthesis

Sep 30, 2023
Wenjie Yin, Qingyuan Yao, Yi Yu, Hang Yin, Danica Kragic, Mårten Björkman

Lyrics often convey information about the songs that are beyond the auditory dimension, enriching the semantic meaning of movements and musical themes. Such insights are important in the dance choreography domain. However, most existing dance synthesis methods mainly focus on music-to-dance generation, without considering the semantic information. To complement it, we introduce JustLMD, a new multimodal dataset of 3D dance motion with music and lyrics. To the best of our knowledge, this is the first dataset with triplet information including dance motion, music, and lyrics. Additionally, we showcase a cross-modal diffusion-based network designed to generate 3D dance motion conditioned on music and lyrics. The proposed JustLMD dataset encompasses 4.6 hours of 3D dance motion in 1867 sequences, accompanied by musical tracks and their corresponding English lyrics.

Viaarxiv icon

Towards Integrated Traffic Control with Operating Decentralized Autonomous Organization

Jul 25, 2023
Shengyue Yao, Jingru Yu, Yi Yu, Jia Xu, Xingyuan Dai, Honghai Li, Fei-Yue Wang, Yilun Lin

Figure 1 for Towards Integrated Traffic Control with Operating Decentralized Autonomous Organization
Figure 2 for Towards Integrated Traffic Control with Operating Decentralized Autonomous Organization
Figure 3 for Towards Integrated Traffic Control with Operating Decentralized Autonomous Organization
Figure 4 for Towards Integrated Traffic Control with Operating Decentralized Autonomous Organization

With a growing complexity of the intelligent traffic system (ITS), an integrated control of ITS that is capable of considering plentiful heterogeneous intelligent agents is desired. However, existing control methods based on the centralized or the decentralized scheme have not presented their competencies in considering the optimality and the scalability simultaneously. To address this issue, we propose an integrated control method based on the framework of Decentralized Autonomous Organization (DAO). The proposed method achieves a global consensus on energy consumption efficiency (ECE), meanwhile to optimize the local objectives of all involved intelligent agents, through a consensus and incentive mechanism. Furthermore, an operation algorithm is proposed regarding the issue of structural rigidity in DAO. Specifically, the proposed operation approach identifies critical agents to execute the smart contract in DAO, which ultimately extends the capability of DAO-based control. In addition, a numerical experiment is designed to examine the performance of the proposed method. The experiment results indicate that the controlled agents can achieve a consensus faster on the global objective with improved local objectives by the proposed method, compare to existing decentralized control methods. In general, the proposed method shows a great potential in developing an integrated control system in the ITS

* 6 pages, 6 figures. To be published in 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC) 
Viaarxiv icon

Modify Training Directions in Function Space to Reduce Generalization Error

Jul 25, 2023
Yi Yu, Wenlian Lu, Boyu Chen

Figure 1 for Modify Training Directions in Function Space to Reduce Generalization Error
Figure 2 for Modify Training Directions in Function Space to Reduce Generalization Error
Figure 3 for Modify Training Directions in Function Space to Reduce Generalization Error
Figure 4 for Modify Training Directions in Function Space to Reduce Generalization Error

We propose theoretical analyses of a modified natural gradient descent method in the neural network function space based on the eigendecompositions of neural tangent kernel and Fisher information matrix. We firstly present analytical expression for the function learned by this modified natural gradient under the assumptions of Gaussian distribution and infinite width limit. Thus, we explicitly derive the generalization error of the learned neural network function using theoretical methods from eigendecomposition and statistics theory. By decomposing of the total generalization error attributed to different eigenspace of the kernel in function space, we propose a criterion for balancing the errors stemming from training set and the distribution discrepancy between the training set and the true data. Through this approach, we establish that modifying the training direction of the neural network in function space leads to a reduction in the total generalization error. Furthermore, We demonstrate that this theoretical framework is capable to explain many existing results of generalization enhancing methods. These theoretical results are also illustrated by numerical examples on synthetic data.

Viaarxiv icon

ExposureDiffusion: Learning to Expose for Low-light Image Enhancement

Jul 15, 2023
Yufei Wang, Yi Yu, Wenhan Yang, Lanqing Guo, Lap-Pui Chau, Alex C. Kot, Bihan Wen

Figure 1 for ExposureDiffusion: Learning to Expose for Low-light Image Enhancement
Figure 2 for ExposureDiffusion: Learning to Expose for Low-light Image Enhancement
Figure 3 for ExposureDiffusion: Learning to Expose for Low-light Image Enhancement
Figure 4 for ExposureDiffusion: Learning to Expose for Low-light Image Enhancement

Previous raw image-based low-light image enhancement methods predominantly relied on feed-forward neural networks to learn deterministic mappings from low-light to normally-exposed images. However, they failed to capture critical distribution information, leading to visually undesirable results. This work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model. Different from a vanilla diffusion model that has to perform Gaussian denoising, with the injected physics-based exposure model, our restoration process can directly start from a noisy image instead of pure noise. As such, our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models. To make full use of the advantages of different intermediate steps, we further propose an adaptive residual layer that effectively screens out the side-effect in the iterative refinement when the intermediate results have been already well-exposed. The proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks. Note that, the proposed framework is compatible with real-paired datasets, real/synthetic noise models, and different backbone networks. We evaluate the proposed method on various public benchmarks, achieving promising results with consistent improvements using different exposure models and backbones. Besides, the proposed method achieves better generalization capacity for unseen amplifying ratios and better performance than a larger feedforward neural model when few parameters are adopted.

* accepted by ICCV2023 
Viaarxiv icon

Beyond Learned Metadata-based Raw Image Reconstruction

Jun 21, 2023
Yufei Wang, Yi Yu, Wenhan Yang, Lanqing Guo, Lap-Pui Chau, Alex C. Kot, Bihan Wen

Figure 1 for Beyond Learned Metadata-based Raw Image Reconstruction
Figure 2 for Beyond Learned Metadata-based Raw Image Reconstruction
Figure 3 for Beyond Learned Metadata-based Raw Image Reconstruction
Figure 4 for Beyond Learned Metadata-based Raw Image Reconstruction

While raw images have distinct advantages over sRGB images, e.g., linearity and fine-grained quantization levels, they are not widely adopted by general users due to their substantial storage requirements. Very recent studies propose to compress raw images by designing sampling masks within the pixel space of the raw image. However, these approaches often leave space for pursuing more effective image representations and compact metadata. In this work, we propose a novel framework that learns a compact representation in the latent space, serving as metadata, in an end-to-end manner. Compared with lossy image compression, we analyze the intrinsic difference of the raw image reconstruction task caused by rich information from the sRGB image. Based on the analysis, a novel backbone design with asymmetric and hybrid spatial feature resolutions is proposed, which significantly improves the rate-distortion performance. Besides, we propose a novel design of the context model, which can better predict the order masks of encoding/decoding based on both the sRGB image and the masks of already processed features. Benefited from the better modeling of the correlation between order masks, the already processed information can be better utilized. Moreover, a novel sRGB-guided adaptive quantization precision strategy, which dynamically assigns varying levels of quantization precision to different regions, further enhances the representation ability of the model. Finally, based on the iterative properties of the proposed context model, we propose a novel strategy to achieve variable bit rates using a single model. This strategy allows for the continuous convergence of a wide range of bit rates. Extensive experimental results demonstrate that the proposed method can achieve better reconstruction quality with a smaller metadata size.

Viaarxiv icon

Controllable Lyrics-to-Melody Generation

Jun 05, 2023
Zhe Zhang, Yi Yu, Atsuhiro Takasu

Figure 1 for Controllable Lyrics-to-Melody Generation
Figure 2 for Controllable Lyrics-to-Melody Generation
Figure 3 for Controllable Lyrics-to-Melody Generation
Figure 4 for Controllable Lyrics-to-Melody Generation

Lyrics-to-melody generation is an interesting and challenging topic in AI music research field. Due to the difficulty of learning the correlations between lyrics and melody, previous methods suffer from low generation quality and lack of controllability. Controllability of generative models enables human interaction with models to generate desired contents, which is especially important in music generation tasks towards human-centered AI that can facilitate musicians in creative activities. To address these issues, we propose a controllable lyrics-to-melody generation network, ConL2M, which is able to generate realistic melodies from lyrics in user-desired musical style. Our work contains three main novelties: 1) To model the dependencies of music attributes cross multiple sequences, inter-branch memory fusion (Memofu) is proposed to enable information flow between multi-branch stacked LSTM architecture; 2) Reference style embedding (RSE) is proposed to improve the quality of generation as well as control the musical style of generated melodies; 3) Sequence-level statistical loss (SeqLoss) is proposed to help the model learn sequence-level features of melodies given lyrics. Verified by evaluation metrics for music quality and controllability, initial study of controllable lyrics-to-melody generation shows better generation quality and the feasibility of interacting with users to generate the melodies in desired musical styles when given lyrics.

Viaarxiv icon

MIPI 2023 Challenge on RGB+ToF Depth Completion: Methods and Results

Apr 27, 2023
Qingpeng Zhu, Wenxiu Sun, Yuekun Dai, Chongyi Li, Shangchen Zhou, Ruicheng Feng, Qianhui Sun, Chen Change Loy, Jinwei Gu, Yi Yu, Yangke Huang, Kang Zhang, Meiya Chen, Yu Wang, Yongchao Li, Hao Jiang, Amrit Kumar Muduli, Vikash Kumar, Kunal Swami, Pankaj Kumar Bajpai, Yunchao Ma, Jiajun Xiao, Zhi Ling

Figure 1 for MIPI 2023 Challenge on RGB+ToF Depth Completion: Methods and Results
Figure 2 for MIPI 2023 Challenge on RGB+ToF Depth Completion: Methods and Results
Figure 3 for MIPI 2023 Challenge on RGB+ToF Depth Completion: Methods and Results
Figure 4 for MIPI 2023 Challenge on RGB+ToF Depth Completion: Methods and Results

Depth completion from RGB images and sparse Time-of-Flight (ToF) measurements is an important problem in computer vision and robotics. While traditional methods for depth completion have relied on stereo vision or structured light techniques, recent advances in deep learning have enabled more accurate and efficient completion of depth maps from RGB images and sparse ToF measurements. To evaluate the performance of different depth completion methods, we organized an RGB+sparse ToF depth completion competition. The competition aimed to encourage research in this area by providing a standardized dataset and evaluation metrics to compare the accuracy of different approaches. In this report, we present the results of the competition and analyze the strengths and weaknesses of the top-performing methods. We also discuss the implications of our findings for future research in RGB+sparse ToF depth completion. We hope that this competition and report will help to advance the state-of-the-art in this important area of research. More details of this challenge and the link to the dataset can be found at https://mipi-challenge.org/MIPI2023.

* arXiv admin note: substantial text overlap with arXiv:2209.07057 
Viaarxiv icon

H2RBox-v2: Boosting HBox-supervised Oriented Object Detection via Symmetric Learning

Apr 11, 2023
Yi Yu, Xue Yang, Qingyun Li, Yue Zhou, Gefan Zhang, Feipeng Da, Junchi Yan

Figure 1 for H2RBox-v2: Boosting HBox-supervised Oriented Object Detection via Symmetric Learning
Figure 2 for H2RBox-v2: Boosting HBox-supervised Oriented Object Detection via Symmetric Learning
Figure 3 for H2RBox-v2: Boosting HBox-supervised Oriented Object Detection via Symmetric Learning
Figure 4 for H2RBox-v2: Boosting HBox-supervised Oriented Object Detection via Symmetric Learning

With the increasing demand for oriented object detection e.g. in autonomous driving and remote sensing, the oriented annotation has become a labor-intensive work. To make full use of existing horizontally annotated datasets and reduce the annotation cost, a weakly-supervised detector H2RBox for learning the rotated box (RBox) from the horizontal box (HBox) has been proposed and received great attention. This paper presents a new version, H2RBox-v2, to further bridge the gap between HBox-supervised and RBox-supervised oriented object detection. While exploiting axisymmetry via flipping and rotating consistencies is available through our theoretical analysis, H2RBox-v2, using a weakly-supervised branch similar to H2RBox, is embedded with a novel self-supervised branch that learns orientations from the symmetry inherent in the image of objects. Complemented by modules to cope with peripheral issues, e.g. angular periodicity, a stable and effective solution is achieved. To our knowledge, H2RBox-v2 is the first symmetry-supervised paradigm for oriented object detection. Compared to H2RBox, our method is less susceptible to low annotation quality and insufficient training data, which in such cases is expected to give a competitive performance much closer to fully-supervised oriented object detectors. Specifically, the performance comparison between H2RBox-v2 and Rotated FCOS on DOTA-v1.0/1.5/2.0 is 72.31%/64.76%/50.33% vs. 72.44%/64.53%/51.77%, 89.66% vs. 88.99% on HRSC, and 42.27% vs. 41.25% on FAIR1M.

* 13 pages, 4 figures, 7 tables, the source code is available at https://github.com/open-mmlab/mmrotate 
Viaarxiv icon

Emotionally Enhanced Talking Face Generation

Mar 26, 2023
Sahil Goyal, Shagun Uppal, Sarthak Bhagat, Yi Yu, Yifang Yin, Rajiv Ratn Shah

Figure 1 for Emotionally Enhanced Talking Face Generation
Figure 2 for Emotionally Enhanced Talking Face Generation
Figure 3 for Emotionally Enhanced Talking Face Generation
Figure 4 for Emotionally Enhanced Talking Face Generation

Several works have developed end-to-end pipelines for generating lip-synced talking faces with various real-world applications, such as teaching and language translation in videos. However, these prior works fail to create realistic-looking videos since they focus little on people's expressions and emotions. Moreover, these methods' effectiveness largely depends on the faces in the training dataset, which means they may not perform well on unseen faces. To mitigate this, we build a talking face generation framework conditioned on a categorical emotion to generate videos with appropriate expressions, making them more realistic and convincing. With a broad range of six emotions, i.e., \emph{happiness}, \emph{sadness}, \emph{fear}, \emph{anger}, \emph{disgust}, and \emph{neutral}, we show that our model can adapt to arbitrary identities, emotions, and languages. Our proposed framework is equipped with a user-friendly web interface with a real-time experience for talking face generation with emotions. We also conduct a user study for subjective evaluation of our interface's usability, design, and functionality. Project page: https://midas.iiitd.edu.in/emo/

Viaarxiv icon