In speech enhancement, complex neural network has shown promising performance due to their effectiveness in processing complex-valued spectrum. Most of the recent speech enhancement approaches mainly focus on wide-band signal with a sampling rate of 16K Hz. However, research on super wide band (e.g., 32K Hz) or even full-band (48K) denoising is still lacked due to the difficulty of modeling more frequency bands and particularly high frequency components. In this paper, we extend our previous deep complex convolution recurrent neural network (DCCRN) substantially to a super wide band version -- S-DCCRN, to perform speech denoising on speech of 32K Hz sampling rate. We first employ a cascaded sub-band and full-band processing module, which consists of two small-footprint DCCRNs -- one operates on sub-band signal and one operates on full-band signal, aiming at benefiting from both local and global frequency information. Moreover, instead of simply adopting the STFT feature as input, we use a complex feature encoder trained in an end-to-end manner to refine the information of different frequency bands. We also use a complex feature decoder to revert the feature to time-frequency domain. Finally, a learnable spectrum compression method is adopted to adjust the energy of different frequency bands, which is beneficial for neural network learning. The proposed model, S-DCCRN, has surpassed PercepNet as well as several competitive models and achieves state-of-the-art performance in terms of speech quality and intelligibility. Ablation studies also demonstrate the effectiveness of different contributions.
Detailed derivations of two bounds of the minimum mean-square error (MMSE) of complex-valued multiple-input multiple-output (MIMO) systems are proposed for performance evaluation. Particularly, the lower bound is derived based on a genie-aided MMSE estimator, whereas the upper bound is derived based on a maximum-likelihood (ML) estimator. Using the famous relationship between the mutual information (MI) and MMSE, two bounds for the MI are also derived, based on which we discuss the asymptotic behaviours of the average MI in the high-signal-to-noise ratio (SNR) regime. Theoretical analyses suggest that the average MI will converge its maximum as the SNR increases and the diversity order is the same as receive antenna number.
Recently, using different channels to model social semantic information, and using self-supervised learning tasks to maintain the characteristics of each channel when fusing the information, which has been proven to be a very promising work. However, how to deeply dig out the relationship between different channels and make full use of it while maintaining the uniqueness of each channel is a problem that has not been well studied and resolved in this field. Under such circumstances, this paper explores and verifies the deficiency of directly constructing contrastive learning tasks on different channels with practical experiments and proposes the scheme of interactive modeling and matching representation across different channels. This is the first attempt in the field of recommender systems, we believe the insight of this paper is inspirational to future self-supervised learning research based on multi-channel information. To solve this problem, we propose a cross-channel matching representation model based on attentive interaction, which realizes efficient modeling of the relationship between cross-channel information. Based on this, we also proposed a hierarchical self-supervised learning model, which realized two levels of self-supervised learning within and between channels and improved the ability of self-supervised tasks to autonomously mine different levels of potential information. We have conducted abundant experiments, and many experimental metrics on multiple public data sets show that the method proposed in this paper has a significant improvement compared with the state-of-the-art methods, no matter in the general or cold-start scenario. And in the experiment of model variant analysis, the benefits of the cross-channel matching representation model and the hierarchical self-supervised model proposed in this paper are also fully verified.
In the context of smart grids and load balancing, daily peak load forecasting has become a critical activity for stakeholders of the energy industry. An understanding of peak magnitude and timing is paramount for the implementation of smart grid strategies such as peak shaving. The modelling approach proposed in this paper leverages high-resolution and low-resolution information to forecast daily peak demand size and timing. The resulting multi-resolution modelling framework can be adapted to different model classes. The key contributions of this paper are a) a general and formal introduction to the multi-resolution modelling approach, b) a discussion on modelling approaches at different resolutions implemented via Generalised Additive Models and Neural Networks and c) experimental results on real data from the UK electricity market. The results confirm that the predictive performance of the proposed modelling approach is competitive with that of low- and high-resolution alternatives.
Conversational semantic role labeling (CSRL) is believed to be a crucial step towards dialogue understanding. However, it remains a major challenge for existing CSRL parser to handle conversational structural information. In this paper, we present a simple and effective architecture for CSRL which aims to address this problem. Our model is based on a conversational structure-aware graph network which explicitly encodes the speaker dependent information. We also propose a multi-task learning method to further improve the model. Experimental results on benchmark datasets show that our model with our proposed training objectives significantly outperforms previous baselines.
We propose an end-to-end architecture for facial expression recognition. Our model learns an optimal tree topology for facial landmarks, whose traversal generates a sequence from which we obtain an embedding to feed a sequential learner. The proposed architecture incorporates two main streams, one focusing on landmark positions to learn the structure of the face, while the other focuses on patches around the landmarks to learn texture information. Each stream is followed by an attention mechanism and the outputs are fed to a two-stream fusion component to perform the final classification. We conduct extensive experiments on two large-scale publicly available facial expression datasets, AffectNet and FER2013, to evaluate the efficacy of our approach. Our method outperforms other solutions in the area and sets new state-of-the-art expression recognition rates on these datasets.
Single-stage instance segmentation approaches have recently gained popularity due to their speed and simplicity, but are still lagging behind in accuracy, compared to two-stage methods. We propose a fast single-stage instance segmentation method, called SipMask, that preserves instance-specific spatial information by separating mask prediction of an instance to different sub-regions of a detected bounding-box. Our main contribution is a novel light-weight spatial preservation (SP) module that generates a separate set of spatial coefficients for each sub-region within a bounding-box, leading to improved mask predictions. It also enables accurate delineation of spatially adjacent instances. Further, we introduce a mask alignment weighting loss and a feature alignment scheme to better correlate mask prediction with object detection. On COCO test-dev, our SipMask outperforms the existing single-stage methods. Compared to the state-of-the-art single-stage TensorMask, SipMask obtains an absolute gain of 1.0% (mask AP), while providing a four-fold speedup. In terms of real-time capabilities, SipMask outperforms YOLACT with an absolute gain of 3.0% (mask AP) under similar settings, while operating at comparable speed on a Titan Xp. We also evaluate our SipMask for real-time video instance segmentation, achieving promising results on YouTube-VIS dataset. The source code is available at https://github.com/JialeCao001/SipMask.
The rise in urbanization throughout the United States (US) in recent years has required urban planners and transportation engineers to have greater consideration for the transportation services available to residents of a metropolitan region. This compels transportation authorities to provide better and more reliable modes of public transit through improved technologies and increased service quality. These improvements can be achieved by identifying and understanding the factors that influence urban public transit demand. Common factors that can influence urban public transit demand can be internal and/or external factors. Internal factors include policy measures such as transit fares, service headways, and travel times. External factors can include geographic, socioeconomic, and highway facility characteristics. There is inherent simultaneity between transit supply and demand, thus a two-stage least squares (2SLS) regression modeling procedure should be conducted to forecast urban transit supply and demand. As such, two multiple linear regression models should be developed: one to predict transit supply and a second to predict transit demand. It was found that service area density, total average cost per trip, and the average number of vehicles operated in maximum service can be used to forecast transit supply, expressed as vehicle revenue hours. Furthermore, estimated vehicle revenue hours and total average fares per trip can be used to forecast transit demand, expressed as unlinked passenger trips. Additional data such as socioeconomic information of the surrounding areas for each transit agency and travel time information of the various transit systems would be useful to improve upon the models developed.
We present the design of a new passive communication method that does not rely on ambient or generated RF sources. Instead, we exploit the Johnson (thermal) noise generated by a resistor to transmit information bits wirelessly. By switching the load connected to an antenna between a resistor and open circuit, we can achieve data rates of up to 26bps and distances of up to 7.3 meters. This communication method is orders of magnitude less power consuming than conventional communication schemes and presents the opportunity to enable wireless communication in areas with a complete lack of connectivity.
Learning to optimize the area under the receiver operating characteristics curve (AUC) performance for imbalanced data has attracted much attention in recent years. Although there have been several methods of AUC optimization, scaling up AUC optimization is still an open issue due to its pairwise learning style. Maximizing AUC in the large-scale dataset can be considered as a non-convex and expensive problem. Inspired by the characteristic of pairwise learning, the cheap AUC optimization task with a small-scale dataset sampled from the large-scale dataset is constructed to promote the AUC accuracy of the original, large-scale, and expensive AUC optimization task. This paper develops an evolutionary multitasking framework (termed EMTAUC) to make full use of information among the constructed cheap and expensive tasks to obtain higher performance. In EMTAUC, one mission is to optimize AUC from the sampled dataset, and the other is to maximize AUC from the original dataset. Moreover, due to the cheap task containing limited knowledge, a strategy for dynamically adjusting the data structure of inexpensive tasks is proposed to introduce more knowledge into the multitasking AUC optimization environment. The performance of the proposed method is evaluated on a series of binary classification datasets. The experimental results demonstrate that EMTAUC is highly competitive to single task methods and online methods. Supplementary materials and source code implementation of EMTAUC can be accessed at https://github.com/xiaofangxd/EMTAUC.