Recently, researchers have shown an increasing interest in automatically predicting the subjective evaluation for speech synthesis systems. This prediction is a challenging task, especially on the out-of-domain test set. In this paper, we proposed a novel fusion model for MOS prediction that combines supervised and unsupervised approaches. In the supervised aspect, we developed an SSL-based predictor called LE-SSL-MOS. The LE-SSL-MOS utilizes pre-trained self-supervised learning models and further improves prediction accuracy by utilizing the opinion scores of each utterance in the listener enhancement branch. In the unsupervised aspect, two steps are contained: we fine-tuned the unit language model (ULM) using highly intelligible domain data to improve the correlation of an unsupervised metric - SpeechLMScore. Another is that we utilized ASR confidence as a new metric with the help of ensemble learning. To our knowledge, this is the first architecture that fuses supervised and unsupervised methods for MOS prediction. With these approaches, our experimental results on the VoiceMOS Challenge 2023 show that LE-SSL-MOS performs better than the baseline. Our fusion system achieved an absolute improvement of 13% over LE-SSL-MOS on the noisy and enhanced speech track. Our system ranked 1st and 2nd, respectively, in the French speech synthesis track and the challenge's noisy and enhanced speech track.
Current deep models provide remarkable object detection in terms of object classification and localization. However, estimating object rotation with respect to other visual objects in the visual context of an input image still lacks deep studies due to the unavailability of object datasets with rotation annotations. This paper tackles these two challenges to solve the rotation estimation of a parked bike with respect to its parking area. First, we leverage the power of 3D graphics to build a camera-agnostic well-annotated Synthetic Bike Rotation Dataset (SynthBRSet). Then, we propose an object-to-spot rotation estimator (OSRE) by extending the object detection task to further regress the bike rotations in two axes. Since our model is purely trained on synthetic data, we adopt image smoothing techniques when deploying it on real-world images. The proposed OSRE is evaluated on synthetic and real-world data providing promising results. Our data and code are available at \href{https://github.com/saghiralfasly/OSRE-Project}{https://github.com/saghiralfasly/OSRE-Project}.
In many applications, it is necessary to retrieve the sub-signal building blocks of a multi-component signal, which is usually non-stationary in real-world and real-life applications. Empirical mode decomposition (EMD), synchrosqueezing transform (SST), signal separation operation (SSO), and iterative filtering decomposition (IFD) have been proposed and developed for this purpose. However, these computational methods are restricted by the specification of well-separation of the sub-signal frequency curves for multi-component signals. On the other hand, the chirplet transform-based signal separation scheme (CT3S) that extends SSO from the two-dimensional "time-frequency" plane to the three-dimensional "time-frequency-chirp rate" space was recently proposed in our recent work to remove the frequency-separation specification, and thereby allowing "frequency crossing". The main objective of this present paper is to carry out an in-depth error analysis study of instantaneous frequency estimation and component recovery for the CT3S method.
With the assumption that a video dataset is multimodality annotated in which auditory and visual modalities both are labeled or class-relevant, current multimodal methods apply modality fusion or cross-modality attention. However, effectively leveraging the audio modality in vision-specific annotated videos for action recognition is of particular challenge. To tackle this challenge, we propose a novel audio-visual framework that effectively leverages the audio modality in any solely vision-specific annotated dataset. We adopt the language models (e.g., BERT) to build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels in which SAVLD serves as a bridge between audio and video datasets. Then, SAVLD along with a pretrained audio multi-label model are used to estimate the audio-visual modality relevance during the training phase. Accordingly, a novel learnable irrelevant modality dropout (IMD) is proposed to completely drop out the irrelevant audio modality and fuse only the relevant modalities. Moreover, we present a new two-stream video Transformer for efficiently modeling the visual modalities. Results on several vision-specific annotated datasets including Kinetics400 and UCF-101 validated our framework as it outperforms most relevant action recognition methods.
In this paper, we propose a new GPU implementation of the Katsevich algorithm for helical CT reconstruction. Our implementation divides the sinograms and reconstructs the CT images pitch by pitch. By utilizing the periodic properties of the parameters of the Katsevich algorithm, our method only needs to calculate these parameters once for all the pitches and so has lower GPU-memory burdens and is very suitable for deep learning. By embedding our implementation into the network, we propose an end-to-end deep network for the high pitch helical CT reconstruction with sparse detectors. Since our network utilizes the features extracted from both sinograms and CT images, it can simultaneously reduce the streak artifacts caused by the sparsity of sinograms and preserve fine details in the CT images. Experiments show that our network outperforms the related methods both in subjective and objective evaluations.
Recently, it has been demonstrated that the performance of a deep convolutional neural network can be effectively improved by embedding an attention module into it. In this work, a novel lightweight and effective attention method named Pyramid Split Attention (PSA) module is proposed. By replacing the 3x3 convolution with the PSA module in the bottleneck blocks of the ResNet, a novel representational block named Efficient Pyramid Split Attention (EPSA) is obtained. The EPSA block can be easily added as a plug-and-play component into a well-established backbone network, and significant improvements on model performance can be achieved. Hence, a simple and efficient backbone architecture named EPSANet is developed in this work by stacking these ResNet-style EPSA blocks. Correspondingly, a stronger multi-scale representation ability can be offered by the proposed EPSANet for various computer vision tasks including but not limited to, image classification, object detection, instance segmentation, etc. Without bells and whistles, the performance of the proposed EPSANet outperforms most of the state-of-the-art channel attention methods. As compared to the SENet-50, the Top-1 accuracy is improved by 1.93 % on ImageNet dataset, a larger margin of +2.7 box AP for object detection and an improvement of +1.7 mask AP for instance segmentation by using the Mask-RCNN on MS-COCO dataset are obtained. Our source code is available at:https://github.com/murufeng/EPSANet.
Graphs are playing a crucial role in different fields since they are powerful tools to unveil intrinsic relationships among signals. In many scenarios, an accurate graph structure representing signals is not available at all and that motivates people to learn a reliable graph structure directly from observed signals. However, in real life, it is inevitable that there exists uncertainty in the observed signals due to noise measurements or limited observability, which causes a reduction in reliability of the learned graph. To this end, we propose a graph learning framework using Wasserstein distributionally robust optimization (WDRO) which handles uncertainty in data by defining an uncertainty set on distributions of the observed data. Specifically, two models are developed, one of which assumes all distributions in uncertainty set are Gaussian distributions and the other one has no prior distributional assumption. Instead of using interior point method directly, we propose two algorithms to solve the corresponding models and show that our algorithms are more time-saving. In addition, we also reformulate both two models into Semi-Definite Programming (SDP), and illustrate that they are intractable in the scenario of large-scale graph. Experiments on both synthetic and real world data are carried out to validate the proposed framework, which show that our scheme can learn a reliable graph in the context of uncertainty.
In this paper, we first present an arc based algorithm for fan-beam computed tomography (CT) reconstruction via applying Katsevich's helical CT formula to 2D fan-beam CT reconstruction. Then, we propose a new weighting function to deal with the redundant projection data. By extending the weighted arc based fan-beam algorithm to circle cone-beam geometry, we also obtain a new FDK-similar algorithm for circle cone-beam CT reconstruction. Experiments show that our methods can obtain higher PSNR and SSIM compared to the Parker-weighted conventional fan-beam algorithm and the FDK algorithm for super-short-scan trajectories.