Kronecker-factored Approximate Curvature (K-FAC) has recently been shown to converge faster in deep neural network (DNN) training than stochastic gradient descent (SGD); however, K-FAC's larger memory footprint hinders its applicability to large models. We present KAISA, a K-FAC-enabled, Adaptable, Improved, and ScAlable second-order optimizer framework that adapts the memory footprint, communication, and computation given specific models and hardware to achieve maximized performance and enhanced scalability. We quantify the tradeoffs between memory and communication cost and evaluate KAISA on large models, including ResNet-50, Mask R-CNN, U-Net, and BERT, on up to 128 NVIDIA A100 GPUs. Compared to the original optimizers, KAISA converges 18.1-36.3% faster across applications with the same global batch size. Under a fixed memory budget, KAISA converges 32.5% and 41.6% faster in ResNet-50 and BERT-Large, respectively. KAISA can balance memory and communication to achieve scaling efficiency equal to or better than the baseline optimizers.
One-bit analog-to-digital converter (ADC), performing signal sampling as an extreme simple comparator, is an overwhelming technology for spectrum sensing due to its low-cost, low-power consumptions and high sampling rate. In this letter, we propose a novel one-bit sensing approach based on the eigenvalue moment ratio (EMR), which has been proved to be highly efficient for conventional multi-antenna spectrum sensing in $\infty$-bit situation. Particularly, we determine the asymptotic distribution of one-bit EMR under null hypothesis via the central limited theorem (CLT), allowing us to perform spectrum sensing with one-bit samples directly. Theoretical and simulation analysis show the new approach can provide reasonably good sensing performance at a low hardware cost.
Chest X-rays are an important and accessible clinical imaging tool for the detection of many thoracic diseases. Over the past decade, deep learning, with a focus on the convolutional neural network (CNN), has become the most powerful computer-aided diagnosis technology for improving disease identification performance. However, training an effective and robust deep CNN usually requires a large amount of data with high annotation quality. For chest X-ray imaging, annotating large-scale data requires professional domain knowledge and is time-consuming. Thus, existing public chest X-ray datasets usually adopt language pattern based methods to automatically mine labels from reports. However, this results in label uncertainty and inconsistency. In this paper, we propose many-to-one distribution learning (MODL) and K-nearest neighbor smoothing (KNNS) methods from two perspectives to improve a single model's disease identification performance, rather than focusing on an ensemble of models. MODL integrates multiple models to obtain a soft label distribution for optimizing the single target model, which can reduce the effects of original label uncertainty. Moreover, KNNS aims to enhance the robustness of the target model to provide consistent predictions on images with similar medical findings. Extensive experiments on the public NIH Chest X-ray and CheXpert datasets show that our model achieves consistent improvements over the state-of-the-art methods.
Previous studies dominantly target at self-supervised learning on real-valued networks and have achieved many promising results. However, on the more challenging binary neural networks (BNNs), this task has not yet been fully explored in the community. In this paper, we focus on this more difficult scenario: learning networks where both weights and activations are binary, meanwhile, without any human annotated labels. We observe that the commonly used contrastive objective is not satisfying on BNNs for competitive accuracy, since the backbone network contains relatively limited capacity and representation ability. Hence instead of directly applying existing self-supervised methods, which cause a severe decline in performance, we present a novel guided learning paradigm from real-valued to distill binary networks on the final prediction distribution, to minimize the loss and obtain desirable accuracy. Our proposed method can boost the simple contrastive learning baseline by an absolute gain of 5.5~15% on BNNs. We further reveal that it is difficult for BNNs to recover the similar predictive distributions as real-valued models when training without labels. Thus, how to calibrate them is key to address the degradation in performance. Extensive experiments are conducted on the large-scale ImageNet and downstream datasets. Our method achieves substantial improvement over the simple contrastive learning baseline, and is even comparable to many mainstream supervised BNN methods. Code will be made available.
The rapid growth in literature accumulates diverse and yet comprehensive biomedical knowledge hidden to be mined such as drug interactions. However, it is difficult to extract the heterogeneous knowledge to retrieve or even discover the latest and novel knowledge in an efficient manner. To address such a problem, we propose EGFI for extracting and consolidating drug interactions from large-scale medical literature text data. Specifically, EGFI consists of two parts: classification and generation. In the classification part, EGFI encompasses the language model BioBERT which has been comprehensively pre-trained on biomedical corpus. In particular, we propose the multi-head attention mechanism and pack BiGRU to fuse multiple semantic information for rigorous context modeling. In the generation part, EGFI utilizes another pre-trained language model BioGPT-2 where the generation sentences are selected based on filtering rules. We evaluated the classification part on "DDIs 2013" dataset and "DTIs" dataset, achieving the FI score of 0.842 and 0.720 respectively. Moreover, we applied the classification part to distinguish high-quality generated sentences and verified with the exiting growth truth to confirm the filtered sentences. The generated sentences that are not recorded in DrugBank and DDIs 2013 dataset also demonstrate the potential of EGFI to identify novel drug relationships.
Target detection is an important problem in multiple-input multiple-output (MIMO) radar. Many existing target detection algorithms were proposed without taking into consideration the quantization error caused by analog-to-digital converters (ADCs). This paper addresses the problem of target detection for MIMO radar with one-bit ADCs and derives a Rao's test-based detector. The proposed method has several appealing features: 1) it is a closed-form detector; 2) it allows us to handle sign measurements straightforwardly; 3) there are closed-form approximations of the detector's distributions, which allow us to theoretically evaluate its performance. Moreover, the closed-form distributions allow us to study the performance degradation due to the one-bit ADCs, yielding an approximate $2$ dB loss in the low-signal-to-noise-ratio (SNR) regime compared to $\infty$-bit ADCs. Simulation results are included to showcase the advantage of the proposed detector and validate the accuracy of the theoretical results.
Generative adversarial networks (GANs) have achieved remarkable progress in recent years, but the continuously growing scale of models make them challenging to deploy widely in practical applications. In particular, for real-time generation tasks, different devices require generators of different sizes due to varying computing power. In this paper, we introduce slimmable GANs (SlimGANs), which can flexibly switch the width of the generator to accommodate various quality-efficiency trade-offs at runtime. Specifically, we leverage multiple discriminators that share partial parameters to train the slimmable generator. To facilitate the consistency between generators of different widths, we present a stepwise inplace distillation technique that encourages narrow generators to learn from wide ones. As for class-conditional generation, we propose a sliceable conditional batch normalization that incorporates the label information into different widths. Our methods are validated, both quantitatively and qualitatively, by extensive experiments and a detailed ablation study.
There is now extensive evidence demonstrating that deep neural networks are vulnerable to adversarial examples, motivating the development of defenses against adversarial attacks. However, existing adversarial defenses typically improve model robustness against individual specific perturbation types. Some recent methods improve model robustness against adversarial attacks in multiple $\ell_p$ balls, but their performance against each perturbation type is still far from satisfactory. To better understand this phenomenon, we propose the \emph{multi-domain} hypothesis, stating that different types of adversarial perturbations are drawn from different domains. Guided by the multi-domain hypothesis, we propose \emph{Gated Batch Normalization (GBN)}, a novel building block for deep neural networks that improves robustness against multiple perturbation types. GBN consists of a gated sub-network and a multi-branch batch normalization (BN) layer, where the gated sub-network separates different perturbation types, and each BN branch is in charge of a single perturbation type and learns domain-specific statistics for input transformation. Then, features from different branches are aligned as domain-invariant representations for the subsequent layers. We perform extensive evaluations of our approach on MNIST, CIFAR-10, and Tiny-ImageNet, and demonstrate that GBN outperforms previous defense proposals against multiple perturbation types, i.e, $\ell_1$, $\ell_2$, and $\ell_{\infty}$ perturbations, by large margins of 10-20\%.
The nonlinearity of activation functions used in deep learning models are crucial for the success of predictive models. There are several commonly used simple nonlinear functions, including Rectified Linear Unit (ReLU) and Leaky-ReLU (L-ReLU). In practice, these functions remarkably enhance the model accuracy. However, there is limited insight into the functionality of these nonlinear activation functions in terms of why certain models perform better than others. Here, we investigate the model performance when using ReLU or L-ReLU as activation functions in different model architectures and data domains. Interestingly, we found that the application of L-ReLU is mostly effective when the number of trainable parameters in a model is relatively small. Furthermore, we found that the image classification models seem to perform well with L-ReLU in fully connected layers, especially when pre-trained models such as the VGG-16 are used for the transfer learning.