With the rapid development of smart manufacturing, data-driven machinery health management has been of growing attention. In situations where some classes are more difficult to be distinguished compared to others and where classes might be organised in a hierarchy of categories, current DL methods can not work well. In this study, a novel hierarchical cavitation intensity recognition framework using Sub-Main Transfer Network, termed SMTNet, is proposed to classify acoustic signals of valve cavitation. SMTNet model outputs multiple predictions ordered from coarse to fine along a network corresponding to a hierarchy of target cavitation states. Firstly, a data augmentation method based on Sliding Window with Fast Fourier Transform (Swin-FFT) is developed to solve few-shot problem. Secondly, a 1-D double hierarchical residual block (1-D DHRB) is presented to capture sensitive features of the frequency domain valve acoustic signals. Thirdly, hierarchical multi-label tree is proposed to assist the embedding of the semantic structure of target cavitation states into SMTNet. Fourthly, experience filtering mechanism is proposed to fully learn a prior knowledge of cavitation detection model. Finally, SMTNet has been evaluated on two cavitation datasets without noise (Dataset 1 and Dataset 2), and one cavitation dataset with real noise (Dataset 3) provided by SAMSON AG (Frankfurt). The prediction accurcies of SMTNet for cavitation intensity recognition are as high as 95.32%, 97.16% and 100%, respectively. At the same time, the testing accuracies of SMTNet for cavitation detection are as high as 97.02%, 97.64% and 100%. In addition, SMTNet has also been tested for different frequencies of samples and has achieved excellent results of the highest frequency of samples of mobile phones.
With the rapid development of smart manufacturing, data-driven machinery health management has received a growing attention. As one of the most popular methods in machinery health management, deep learning (DL) has achieved remarkable successes. However, due to the issues of limited samples and poor separability of different cavitation states of acoustic signals, which greatly hinder the eventual performance of DL modes for cavitation intensity recognition and cavitation detection. In this work, a novel multi-task learning framework for simultaneous cavitation detection and cavitation intensity recognition framework using 1-D double hierarchical residual networks (1-D DHRN) is proposed for analyzing valves acoustic signals. Firstly, a data augmentation method based on sliding window with fast Fourier transform (Swin-FFT) is developed to alleviate the small-sample issue confronted in this study. Secondly, a 1-D double hierarchical residual block (1-D DHRB) is constructed to capture sensitive features from the frequency domain acoustic signals of valve. Then, a new structure of 1-D DHRN is proposed. Finally, the devised 1-D DHRN is evaluated on two datasets of valve acoustic signals without noise (Dataset 1 and Dataset 2) and one dataset of valve acoustic signals with realistic surrounding noise (Dataset 3) provided by SAMSON AG (Frankfurt). Our method has achieved state-of-the-art results. The prediction accurcies of 1-D DHRN for cavitation intensitys recognition are as high as 93.75%, 94.31% and 100%, which indicates that 1-D DHRN outperforms other DL models and conventional methods. At the same time, the testing accuracies of 1-D DHRN for cavitation detection are as high as 97.02%, 97.64% and 100%. In addition, 1-D DHRN has also been tested for different frequencies of samples and shows excellent results for frequency of samples that mobile phones can accommodate.
Valves are widely used in industrial and domestic pipeline systems. However, during their operation, they may suffer from the occurrence of the cavitation, which can cause loud noise, vibration and damage to the internal components of the valve. Therefore, monitoring the flow status inside valves is significantly beneficial to prevent the additional cost induced by cavitation. In this paper, a novel acoustic signal cavitation detection framework--based on XGBoost with adaptive selection feature engineering--is proposed. Firstly, a data augmentation method with non-overlapping sliding window (NOSW) is developed to solve small-sample problem involved in this study. Then, the each segmented piece of time-domain acoustic signal is transformed by fast Fourier transform (FFT) and its statistical features are extracted to be the input to the adaptive selection feature engineering (ASFE) procedure, where the adaptive feature aggregation and feature crosses are performed. Finally, with the selected features the XGBoost algorithm is trained for cavitation detection and tested on valve acoustic signal data provided by Samson AG (Frankfurt). Our method has achieved state-of-the-art results. The prediction performance on the binary classification (cavitation and no-cavitation) and the four-class classification (cavitation choked flow, constant cavitation, incipient cavitation and no-cavitation) are satisfactory and outperform the traditional XGBoost by 4.67% and 11.11% increase of the accuracy.
Deep learning-based pronunciation scoring models highly rely on the availability of the annotated non-native data, which is costly and has scalability issues. To deal with the data scarcity problem, data augmentation is commonly used for model pretraining. In this paper, we propose a phone-level mixup, a simple yet effective data augmentation method, to improve the performance of word-level pronunciation scoring. Specifically, given a phoneme sequence from lexicon, the artificial augmented word sample can be generated by randomly sampling from the corresponding phone-level features in training data, while the word score is the average of their GOP scores. Benefit from the arbitrary phone-level combination, the mixup is able to generate any word with various pronunciation scores. Moreover, we utilize multi-source information (e.g., MFCC and deep features) to further improve the scoring system performance. The experiments conducted on the Speechocean762 show that the proposed system outperforms the baseline by adding the mixup data for pretraining, with Pearson correlation coefficients (PCC) increasing from 0.567 to 0.61. The results also indicate that proposed method achieves similar performance by using 1/10 unlabeled data of baseline. In addition, the experimental results also demonstrate the efficiency of our proposed multi-source approach.
Anomalous sound detection (ASD) is one of the most significant tasks of mechanical equipment monitoring and maintaining in complex industrial systems. In practice, it is vital to precisely identify abnormal status of the working mechanical system, which can further facilitate the failure troubleshooting. In this paper, we propose a multi-pattern adversarial learning one-class classification framework, which allows us to use both the generator and the discriminator of an adversarial model for efficient ASD. The core idea is learning to reconstruct the normal patterns of acoustic data through two different patterns of auto-encoding generators, which succeeds in extending the fundamental role of a discriminator from identifying real and fake data to distinguishing between regional and local pattern reconstructions. Furthermore, we present a global filter layer for long-term interactions in the frequency domain space, which directly learns from the original data without introducing any human priors. Extensive experiments performed on four real-world datasets from different industrial domains (three cavitation datasets provided by SAMSON AG, and one existing publicly) for anomaly detection show superior results, and outperform recent state-of-the-art ASD methods.
Chorus detection is a challenging problem in musical signal processing as the chorus often repeats more than once in popular songs, usually with rich instruments and complex rhythm forms. Most of the existing works focus on the receptiveness of chorus sections based on some explicit features such as loudness and occurrence frequency. These pre-assumptions for chorus limit the generalization capacity of these methods, causing misdetection on other repeated sections such as verse. To solve the problem, in this paper we propose an end-to-end chorus detection model DeepChorus, reducing the engineering effort and the need for prior knowledge. The proposed model includes two main structures: i) a Multi-Scale Network to derive preliminary representations of chorus segments, and ii) a Self-Attention Convolution Network to further process the features into probability curves representing chorus presence. To obtain the final results, we apply an adaptive threshold to binarize the original curve. The experimental results show that DeepChorus outperforms existing state-of-the-art methods in most cases.
Singing melody extraction is an important problem in the field of music information retrieval. Existing methods typically rely on frequency-domain representations to estimate the sung frequencies. However, this design does not lead to human-level performance in the perception of melody information for both tone (pitch-class) and octave. In this paper, we propose TONet, a plug-and-play model that improves both tone and octave perceptions by leveraging a novel input representation and a novel network architecture. First, we present an improved input representation, the Tone-CFP, that explicitly groups harmonics via a rearrangement of frequency-bins. Second, we introduce an encoder-decoder architecture that is designed to obtain a salience feature map, a tone feature map, and an octave feature map. Third, we propose a tone-octave fusion mechanism to improve the final salience feature map. Experiments are done to verify the capability of TONet with various baseline backbone models. Our results show that tone-octave fusion with Tone-CFP can significantly improve the singing voice extraction performance across various datasets -- with substantial gains in octave and tone accuracy.
Multi-temporal hyperspectral images can be used to detect changed information, which has gradually attracted researchers' attention. However, traditional change detection algorithms have not deeply explored the relevance of spatial and spectral changed features, which leads to low detection accuracy. To better excavate both spectral and spatial information of changed features, a joint morphology and patch-tensor change detection (JMPT) method is proposed. Initially, a patch-based tensor strategy is adopted to exploit similar property of spatial structure, where the non-overlapping local patch image is reshaped into a new tensor cube, and then three-order Tucker decompositon and image reconstruction strategies are adopted to obtain more robust multi-temporal hyperspectral datasets. Meanwhile, multiple morphological profiles including max-tree and min-tree are applied to extract different attributes of multi-temporal images. Finally, these results are fused to general a final change detection map. Experiments conducted on two real hyperspectral datasets demonstrate that the proposed detector achieves better detection performance.
Evolutionary Algorithms (EAs) and Deep Reinforcement Learning (DRL) have recently been combined to integrate the advantages of the two solutions for better policy learning. However, in existing hybrid methods, EA is used to directly train the policy network, which will lead to sample inefficiency and unpredictable impact on the policy performance. To better integrate these two approaches and avoid the drawbacks caused by the introduction of EA, we devote ourselves to devising a more efficient and reasonable method of combining EA and DRL. In this paper, we propose Evolutionary Action Selection-Twin Delayed Deep Deterministic Policy Gradient (EAS-TD3), a novel combination of EA and DRL. In EAS, we focus on optimizing the action chosen by the policy network and attempt to obtain high-quality actions to guide policy learning through an evolutionary algorithm. We conduct several experiments on challenging continuous control tasks. The result shows that EAS-TD3 shows superior performance over other state-of-art methods.
Extracting roads from high-resolution remote sensing images (HRSIs) is vital in a wide variety of applications, such as autonomous driving, path planning, and road navigation. Due to the long and thin shape as well as the shades induced by vegetation and buildings, small-sized roads are more difficult to discern. In order to improve the reliability and accuracy of small-sized road extraction when roads of multiple sizes coexist in an HRSI, an enhanced deep neural network model termed Dual-Decoder-U-Net (DDU-Net) is proposed in this paper. Motivated by the U-Net model, a small decoder is added to form a dual-decoder structure for more detailed features. In addition, we introduce the dilated convolution attention module (DCAM) between the encoder and decoders to increase the receptive field as well as to distill multi-scale features through cascading dilated convolution and global average pooling. The convolutional block attention module (CBAM) is also embedded in the parallel dilated convolution and pooling branches to capture more attention-aware features. Extensive experiments are conducted on the Massachusetts Roads dataset with experimental results showing that the proposed model outperforms the state-of-the-art DenseUNet, DeepLabv3+ and D-LinkNet by 6.5%, 3.3%, and 2.1% in the mean Intersection over Union (mIoU), and by 4%, 4.8%, and 3.1% in the F1 score, respectively. Both ablation and heatmap analyses are presented to validate the effectiveness of the proposed model.