Most of the current deep learning-based approaches for speech enhancement only operate in the spectrogram or waveform domain. Although a cross-domain transformer combining waveform- and spectrogram-domain inputs has been proposed, its performance can be further improved. In this paper, we present a novel deep complex hybrid transformer that integrates both spectrogram and waveform domains approaches to improve the performance of speech enhancement. The proposed model consists of two parts: a complex Swin-Unet in the spectrogram domain and a dual-path transformer network (DPTnet) in the waveform domain. We first construct a complex Swin-Unet network in the spectrogram domain and perform speech enhancement in the complex audio spectrum. We then introduce improved DPT by adding memory-compressed attention. Our model is capable of learning multi-domain features to reduce existing noise on different domains in a complementary way. The experimental results on the BirdSoundsDenoising dataset and the VCTK+DEMAND dataset indicate that our method can achieve better performance compared to state-of-the-art methods.
Recent high-performance transformer-based speech enhancement models demonstrate that time domain methods could achieve similar performance as time-frequency domain methods. However, time-domain speech enhancement systems typically receive input audio sequences consisting of a large number of time steps, making it challenging to model extremely long sequences and train models to perform adequately. In this paper, we utilize smaller audio chunks as input to achieve efficient utilization of audio information to address the above challenges. We propose a dual-phase audio transformer for denoising (DPATD), a novel model to organize transformer layers in a deep structure to learn clean audio sequences for denoising. DPATD splits the audio input into smaller chunks, where the input length can be proportional to the square root of the original sequence length. Our memory-compressed explainable attention is efficient and converges faster compared to the frequently used self-attention module. Extensive experiments demonstrate that our model outperforms state-of-the-art methods.
Achieving high-performance audio denoising is still a challenging task in real-world applications. Existing time-frequency methods often ignore the quality of generated frequency domain images. This paper converts the audio denoising problem into an image generation task. We first develop a complex image generation SwinTransformer network to capture more information from the complex Fourier domain. We then impose structure similarity and detailed loss functions to generate high-quality images and develop an SDR loss to minimize the difference between denoised and clean audios. Extensive experiments on two benchmark datasets demonstrate that our proposed model is better than state-of-the-art methods.
The majority of road accidents occur because of human errors, including distraction, recklessness, and drunken driving. One of the effective ways to overcome this dangerous situation is by implementing self-driving technologies in vehicles. In this paper, we focus on building an efficient deep-learning model for self-driving cars. We propose a new and effective convolutional neural network model called `LaksNet' consisting of four convolutional layers and two fully connected layers. We conduct extensive experiments using our LaksNet model with the training data generated from the Udacity simulator. Our model outperforms many existing pre-trained ImageNet and NVIDIA models in terms of the duration of the car for which it drives without going off the track on the simulator.
While unsupervised domain adaptation has been explored to leverage the knowledge from a labeled source domain to an unlabeled target domain, existing methods focus on the distribution alignment between two domains. However, how to better align source and target features is not well addressed. In this paper, we propose a deep feature registration (DFR) model to generate registered features that maintain domain invariant features and simultaneously minimize the domain-dissimilarity of registered features and target features via histogram matching. We further employ a pseudo label refinement process, which considers both probabilistic soft selection and center-based hard selection to improve the quality of pseudo labels in the target domain. Extensive experiments on multiple UDA benchmarks demonstrate the effectiveness of our DFR model, resulting in new state-of-the-art performance.
In this paper, we present a small cow stall number dataset named CowStallNumbers, which is extracted from cow teat videos with the goal of advancing cow stall number detection. This dataset contains 1042 training images and 261 test images with the stall number ranging from 0 to 60. In addition, we fine-tuned a ResNet34 model and augmented the dataset with the random crop, center crop, and random rotation. The experimental result achieves a 92% accuracy in stall number recognition and a 40.1% IoU score in stall number position prediction.
Lung cancer has emerged as a severe disease that threatens human life and health. The precise segmentation of lung regions is a crucial prerequisite for localizing tumors, which can provide accurate information for lung image analysis. In this work, we first propose a lung image segmentation model using the NASNet-Large as an encoder and then followed by a decoder architecture, which is one of the most commonly used architectures in deep learning for image segmentation. The proposed NASNet-Large-decoder architecture can extract high-level information and expand the feature map to recover the segmentation map. To further improve the segmentation results, we propose a post-processing layer to remove the irrelevant portion of the segmentation map. Experimental results show that an accurate segmentation model with 0.92 dice scores outperforms state-of-the-art performance.
Audio denoising has been explored for decades using both traditional and deep learning-based methods. However, these methods are still limited to either manually added artificial noise or lower denoised audio quality. To overcome these challenges, we collect a large-scale natural noise bird sound dataset. We are the first to transfer the audio denoising problem into an image segmentation problem and propose a deep visual audio denoising (DVAD) model. With a total of 14,120 audio images, we develop an audio ImageMask tool and propose to use a few-shot generalization strategy to label these images. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance. We also show that our method can be easily generalized to speech denoising, audio separation, audio enhancement, and noise estimation.
Since ancient times, what Chinese people have been pursuing is very simple, which is nothing more than "to live and work happily, to eat and dress comfortable". Today, more than 40 years after the reform and opening, people have basically solved the problem of food and clothing, and the urgent problem is housing. Nowadays, due to the storm of long-term rental apartment intermediary platforms such as eggshell, increasing the sense of insecurity of renters, as well as the urbanization in recent years and the scramble for people in major cities, this will make the future real estate market competition more intense. In order to better grasp the real estate price, let consumers buy a house reasonably, and provide a reference for the government to formulate policies, this paper summarizes the existing methods of house price prediction and proposes a house price prediction method based on mixed depth vision and text features.
While huge volumes of unlabeled data are generated and made available in many domains, the demand for automated understanding of visual data is higher than ever before. Most existing machine learning models typically rely on massive amounts of labeled training data to achieve high performance. Unfortunately, such a requirement cannot be met in real-world applications. The number of labels is limited and manually annotating data is expensive and time-consuming. It is often necessary to transfer knowledge from an existing labeled domain to a new domain. However, model performance degrades because of the differences between domains (domain shift or dataset bias). To overcome the burden of annotation, Domain Adaptation (DA) aims to mitigate the domain shift problem when transferring knowledge from one domain into another similar but different domain. Unsupervised DA (UDA) deals with a labeled source domain and an unlabeled target domain. The principal objective of UDA is to reduce the domain discrepancy between the labeled source data and unlabeled target data and to learn domain-invariant representations across the two domains during training. In this paper, we first define UDA problem. Secondly, we overview the state-of-the-art methods for different categories of UDA from both traditional methods and deep learning based methods. Finally, we collect frequently used benchmark datasets and report results of the state-of-the-art methods of UDA on visual recognition problem.