Abstract:Automatic Singing Assessment and Singing Information Processing have evolved over the past three decades to support singing pedagogy, performance analysis, and vocal training. While the first approach objectively evaluates a singer's performance through computational metrics ranging from real-time visual feedback and acoustical biofeedback to sophisticated pitch tracking and spectral analysis, the latter method compares a predictor vocal signal with a target reference to capture nuanced data embedded in the singing voice. Notable advancements include the development of interactive systems that have significantly improved real-time visual feedback, and the integration of machine learning and deep neural network architectures that enhance the precision of vocal signal processing. This survey critically examines the literature to map the historical evolution of these technologies, while identifying and discussing key gaps. The analysis reveals persistent challenges, such as the lack of standardized evaluation frameworks, difficulties in reliably separating vocal signals from various noise sources, and the underutilization of advanced digital signal processing and artificial intelligence methodologies for capturing artistic expressivity. By detailing these limitations and the corresponding technological advances, this review demonstrates how addressing these issues can bridge the gap between objective computational assessments and subjective human-like evaluations of singing performance, ultimately enhancing both the technical accuracy and pedagogical relevance of automated singing evaluation systems.
Abstract:One key aspect differentiating data-driven single- and multi-channel speech enhancement and dereverberation methods is that both the problem formulation and complexity of the solutions are considerably more challenging in the latter case. Additionally, with limited computational resources, it is cumbersome to train models that require the management of larger datasets or those with more complex designs. In this scenario, an unverified hypothesis that single-channel methods can be adapted to multi-channel scenarios simply by processing each channel independently holds significant implications, boosting compatibility between sound scene capture and system input-output formats, while also allowing modern research to focus on other challenging aspects, such as full-bandwidth audio enhancement, competitive noise suppression, and unsupervised learning. This study verifies this hypothesis by comparing the enhancement promoted by a basic single-channel speech enhancement and dereverberation model with two other multi-channel models tailored to separate clean speech from noisy 3D mixes. A direction of arrival estimation model was used to objectively evaluate its capacity to preserve spatial information by comparing the output signals with ground-truth coordinate values. Consequently, a trade-off arises between preserving spatial information with a more straightforward single-channel solution at the cost of obtaining lower gains in intelligibility scores.