Training speaker-discriminative and robust speaker verification systems without speaker labels is still challenging and worthwhile to explore. In this study, we propose an effective self-supervised learning framework and a novel regularization strategy to facilitate self-supervised speaker representation learning. Different from contrastive learning-based self-supervised learning methods, the proposed self-supervised regularization (SSReg) focuses exclusively on the similarity between the latent representations of positive data pairs. We also explore the effectiveness of alternative online data augmentation strategies on both the time domain and frequency domain. With our strong online data augmentation strategy, the proposed SSReg shows the potential of self-supervised learning without using negative pairs and it can significantly improve the performance of self-supervised speaker representation learning with a simple Siamese network architecture. Comprehensive experiments on the VoxCeleb datasets demonstrate that our proposed self-supervised approach obtains a 23.4% relative improvement by adding the effective self-supervised regularization and outperforms other previous works.
The COVID-19 pandemic has affected societies and human health and well-being in various ways. In this study, we collected Reddit data from 2019 (pre-pandemic) and 2020 (pandemic) from the subreddits communities associated with 8 universities, applied natural language processing (NLP) techniques, and trained graphical neural networks with social media data, to study how the pandemic has affected people's emotions and psychological states compared to the pre-pandemic era. Specifically, we first applied a pre-trained Robustly Optimized BERT pre-training approach (RoBERTa) to learn embedding from the semantic information of Reddit messages and trained a graph attention network (GAT) for sentiment classification. The usage of GAT allows us to leverage the relational information among the messages during training. We then applied subgroup-adaptive model stacking to combine the prediction probabilities from RoBERTa and GAT to yield the final classification on sentiment. With the manually labeled and model-predicted sentiment labels on the collected data, we applied a generalized linear mixed-effects model to estimate the effects of pandemic and online teaching on people's sentiment in a statistically significant manner. The results suggest the odds of negative sentiments in 2020 is $14.6\%$ higher than the odds in 2019 ($p$-value $<0.001$), and the odds of negative sentiments are $41.6\%$ higher with in-person teaching than with online teaching in 2020 ($p$-value $=0.037$) in the studied population.
In recent years, Transformer has achieved good results in Natural Language Processing (NLP) and has also started to expand into Computer Vision (CV). Excellent models such as the Vision Transformer and Swin Transformer have emerged. At the same time, the platform for Transformer models was extended to embedded devices to meet some resource-sensitive application scenarios. However, due to the large number of parameters, the complex computational flow and the many different structural variants of Transformer models, there are a number of issues that need to be addressed in its hardware design. This is both an opportunity and a challenge. We propose Vis-TOP (Visual Transformer Overlay Processor), an overlay processor for various visual Transformer models. It differs from coarse-grained overlay processors such as CPU, GPU, NPE, and from fine-grained customized designs for a specific model. Vis-TOP summarizes the characteristics of all visual Transformer models and implements a three-layer and two-level transformation structure that allows the model to be switched or changed freely without changing the hardware architecture. At the same time, the corresponding instruction bundle and hardware architecture are designed in three-layer and two-level transformation structure. After quantization of Swin Transformer tiny model using 8-bit fixed points (fix_8), we implemented an overlay processor on the ZCU102. Compared to GPU, the TOP throughput is 1.5x higher. Compared to the existing Transformer accelerators, our throughput per DSP is between 2.2x and 11.7x higher than others. In a word, the approach in this paper meets the requirements of real-time AI in terms of both resource consumption and inference speed. Vis-TOP provides a cost-effective and power-effective solution based on reconfigurable devices for computer vision at the edge.
We propose Noise-Augmented Privacy-Preserving Empirical Risk Minimization (NAPP-ERM) that solves ERM with differential privacy guarantees. Existing privacy-preserving ERM approaches may be subject to over-regularization with the employment of an l2 term to achieve strong convexity on top of the target regularization. NAPP-ERM improves over the current approaches and mitigates over-regularization by iteratively realizing target regularization through appropriately designed augmented data and delivering strong convexity via a single adaptively weighted dual-purpose l2 regularizer. When the target regularization is for variable selection, we propose a new regularizer that achieves both privacy and sparsity guarantees simultaneously. Finally, we propose a strategy to retrieve privacy budget when the strong convexity requirement is met, which can be returned to users such that the DP of ERM is guaranteed at a lower privacy cost than originally planned, or be recycled to the ERM optimization procedure to reduce the injected DP noise and improve the utility of DP-ERM. From an implementation perspective, NAPP-ERM can be achieved by optimizing a non-perturbed object function given noise-augmented data and can thus leverage existing tools for non-private ERM optimization. We illustrate through extensive experiments the mitigation effect of the over-regularization and private budget retrieval by NAPP-ERM on variable selection and prediction.
This report summarizes the results of Learning to Understand Aerial Images (LUAI) 2021 challenge held on ICCV 2021, which focuses on object detection and semantic segmentation in aerial images. Using DOTA-v2.0 and GID-15 datasets, this challenge proposes three tasks for oriented object detection, horizontal object detection, and semantic segmentation of common categories in aerial images. This challenge received a total of 146 registrations on the three tasks. Through the challenge, we hope to draw attention from a wide range of communities and call for more efforts on the problems of learning to understand aerial images.
Fast inference of numerical model parameters from data is an important prerequisite to generate predictive models for a wide range of applications. Use of sampling-based approaches such as Markov chain Monte Carlo may become intractable when each likelihood evaluation is computationally expensive. New approaches combining variational inference with normalizing flow are characterized by a computational cost that grows only linearly with the dimensionality of the latent variable space, and rely on gradient-based optimization instead of sampling, providing a more efficient approach for Bayesian inference about the model parameters. Moreover, the cost of frequently evaluating an expensive likelihood can be mitigated by replacing the true model with an offline trained surrogate model, such as neural networks. However, this approach might generate significant bias when the surrogate is insufficiently accurate around the posterior modes. To reduce the computational cost without sacrificing inferential accuracy, we propose Normalizing Flow with Adaptive Surrogate (NoFAS), an optimization strategy that alternatively updates the normalizing flow parameters and the weights of a neural network surrogate model. We also propose an efficient sample weighting scheme for surrogate model training that ensures some global accuracy of the surrogate while capturing the likely regions of the parameters that yield the observed data. We demonstrate the inferential and computational superiority of NoFAS against various benchmarks, including cases where the underlying model lacks identifiability. The source code and numerical experiments used for this study are available at https://github.com/cedricwangyu/NoFAS.
Computational virtual high-throughput screening (VHTS) with density functional theory (DFT) and machine-learning (ML)-acceleration is essential in rapid materials discovery. By necessity, efficient DFT-based workflows are carried out with a single density functional approximation (DFA). Nevertheless, properties evaluated with different DFAs can be expected to disagree for the cases with challenging electronic structure (e.g., open shell transition metal complexes, TMCs) for which rapid screening is most needed and accurate benchmarks are often unavailable. To quantify the effect of DFA bias, we introduce an approach to rapidly obtain property predictions from 23 representative DFAs spanning multiple families and "rungs" (e.g., semi-local to double hybrid) and basis sets on over 2,000 TMCs. Although computed properties (e.g., spin-state ordering and frontier orbital gap) naturally differ by DFA, high linear correlations persist across all DFAs. We train independent ML models for each DFA and observe convergent trends in feature importance; these features thus provide DFA-invariant, universal design rules. We devise a strategy to train ML models informed by all 23 DFAs and use them to predict properties (e.g., spin-splitting energy) of over 182k TMCs. By requiring consensus of the ANN-predicted DFA properties, we improve correspondence of these computational lead compounds with literature-mined, experimental compounds over the single-DFA approach typically employed. Both feature analysis and consensus-based ML provide efficient, alternative paths to overcome accuracy limitations of practical DFT.
As an effective technique to achieve the implementation of deep neural networks in edge devices, model quantization has been successfully applied in many practical applications. No matter the methods of quantization aware training (QAT) or post-training quantization (PTQ), they all depend on the target bit-widths. When the precision of quantization is adjusted, it is necessary to fine-tune the quantized model or minimize the quantization noise, which brings inconvenience in practical applications. In this work, we propose a method to train a model for all quantization that supports diverse bit-widths (e.g., form 8-bit to 1-bit) to satisfy the online quantization bit-width adjustment. It is hot-swappable that can provide specific quantization strategies for different candidates through multiscale quantization. We use wavelet decomposition and reconstruction to increase the diversity of weights, thus significantly improving the performance of each quantization candidate, especially at ultra-low bit-widths (e.g., 3-bit, 2-bit, and 1-bit). Experimental results on ImageNet and COCO show that our method can achieve accuracy comparable performance to dedicated models trained at the same precision.
An optical neural network is proposed and demonstrated with programmable matrix transformation and nonlinear activation function of photodetection (square-law detection). Based on discrete phase-coherent spatial modes, the dimensionality of programmable optical matrix operations is 30~37, which is implemented by spatial light modulators. With this architecture, all-optical classification tasks of handwritten digits, objects and depth images are performed on the same platform with high accuracy. Due to the parallel nature of matrix multiplication, the processing speed of our proposed architecture is potentially as high as7.4T~74T FLOPs per second (with 10~100GHz detector)
Model quantization can reduce the model size and computational latency, it has become an essential technique for the deployment of deep neural networks on resourceconstrained hardware (e.g., mobile phones and embedded devices). The existing quantization methods mainly consider the numerical elements of the weights and activation values, ignoring the relationship between elements. The decline of representation ability and information loss usually lead to the performance degradation. Inspired by the characteristics of images in the frequency domain, we propose a novel multiscale wavelet quantization (MWQ) method. This method decomposes original data into multiscale frequency components by wavelet transform, and then quantizes the components of different scales, respectively. It exploits the multiscale frequency and spatial information to alleviate the information loss caused by quantization in the spatial domain. Because of the flexibility of MWQ, we demonstrate three applications (e.g., model compression, quantized network optimization, and information enhancement) on the ImageNet and COCO datasets. Experimental results show that our method has stronger representation ability and can play an effective role in quantized neural networks.