Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shujian Yu

Deep Deterministic Information Bottleneck with Matrix-based Entropy Functional

Jan 31, 2021

Xi Yu, Shujian Yu, Jose C. Principe

Figure 1 for Deep Deterministic Information Bottleneck with Matrix-based Entropy Functional

Figure 2 for Deep Deterministic Information Bottleneck with Matrix-based Entropy Functional

Figure 3 for Deep Deterministic Information Bottleneck with Matrix-based Entropy Functional

Figure 4 for Deep Deterministic Information Bottleneck with Matrix-based Entropy Functional

Abstract:We introduce the matrix-based Renyi's $\alpha$-order entropy functional to parameterize Tishby et al. information bottleneck (IB) principle with a neural network. We term our methodology Deep Deterministic Information Bottleneck (DIB), as it avoids variational inference and distribution assumption. We show that deep neural networks trained with DIB outperform the variational objective counterpart and those that are trained with other forms of regularization, in terms of generalization performance and robustness to adversarial attack.Code available at https://github.com/yuxi120407/DIB

* Accepted at ICASSP-21. Code available at https://github.com/yuxi120407/DIB. Extended version of the suppelementary material in "Measuring the Dependence with Matrix-based Entropy Functional", AAAI-21, arXiv:2101.10160

Via

Access Paper or Ask Questions

Measuring Dependence with Matrix-based Entropy Functional

Jan 25, 2021

Shujian Yu, Francesco Alesiani, Xi Yu, Robert Jenssen, Jose C. Principe

Figure 1 for Measuring Dependence with Matrix-based Entropy Functional

Figure 2 for Measuring Dependence with Matrix-based Entropy Functional

Figure 3 for Measuring Dependence with Matrix-based Entropy Functional

Figure 4 for Measuring Dependence with Matrix-based Entropy Functional

Abstract:Measuring the dependence of data plays a central role in statistics and machine learning. In this work, we summarize and generalize the main idea of existing information-theoretic dependence measures into a higher-level perspective by the Shearer's inequality. Based on our generalization, we then propose two measures, namely the matrix-based normalized total correlation ($T_\alpha^*$) and the matrix-based normalized dual total correlation ($D_\alpha^*$), to quantify the dependence of multiple variables in arbitrary dimensional space, without explicit estimation of the underlying data distributions. We show that our measures are differentiable and statistically more powerful than prevalent ones. We also show the impact of our measures in four different machine learning problems, namely the gene regulatory network inference, the robust machine learning under covariate shift and non-Gaussian noises, the subspace outlier detection, and the understanding of the learning dynamics of convolutional neural networks (CNNs), to demonstrate their utilities, advantages, as well as implications to those problems. Code of our dependence measure is available at: https://bit.ly/AAAI-dependence

* Accepted at AAAI-21. An interpretable and differentiable dependence (or independence) measure that can be used to 1) train deep network under covariate shift and non-Gaussian noise; 2) implement a deep deterministic information bottleneck; and 3) understand the dynamics of learning of CNN. Code available at https://bit.ly/AAAI-dependence

Via

Access Paper or Ask Questions

Modular-Relatedness for Continual Learning

Nov 02, 2020

Ammar Shaker, Shujian Yu, Francesco Alesiani

Figure 1 for Modular-Relatedness for Continual Learning

Figure 2 for Modular-Relatedness for Continual Learning

Figure 3 for Modular-Relatedness for Continual Learning

Figure 4 for Modular-Relatedness for Continual Learning

Abstract:In this paper, we propose a continual learning (CL) technique that is beneficial to sequential task learners by improving their retained accuracy and reducing catastrophic forgetting. The principal target of our approach is the automatic extraction of modular parts of the neural network and then estimating the relatedness between the tasks given these modular components. This technique is applicable to different families of CL methods such as regularization-based (e.g., the Elastic Weight Consolidation) or the rehearsal-based (e.g., the Gradient Episodic Memory) approaches where episodic memory is needed. Empirical results demonstrate remarkable performance gain (in terms of robustness to forgetting) for methods such as EWC and GEM based on our technique, especially when the memory budget is very limited.

Via

Access Paper or Ask Questions

Bilevel Continual Learning

Nov 02, 2020

Ammar Shaker, Francesco Alesiani, Shujian Yu, Wenzhe Yin

Abstract:Continual learning (CL) studies the problem of learning a sequence of tasks, one at a time, such that the learning of each new task does not lead to the deterioration in performance on the previously seen ones while exploiting previously learned features. This paper presents Bilevel Continual Learning (BiCL), a general framework for continual learning that fuses bilevel optimization and recent advances in meta-learning for deep neural networks. BiCL is able to train both deep discriminative and generative models under the conservative setting of the online continual learning. Experimental results show that BiCL provides competitive performance in terms of accuracy for the current task while reducing the effect of catastrophic forgetting. This is a concurrent work with [1]. We submitted it to AAAI 2020 and IJCAI 2020. Now we put it on the arxiv for record. Different from [1], we also consider continual generative model as well. At the same time, the authors are aware of a recent proposal on bilevel optimization based coreset construction for continual learning [2]. [1] Q. Pham, D. Sahoo, C. Liu, and S. C. Hoi. Bilevel continual learning. arXiv preprint arXiv:2007.15553, 2020. [2] Z. Borsos, M. Mutny, and A. Krause. Coresets via bilevel optimization for continual learning and streaming. arXiv preprint arXiv:2006.03875, 2020

Via

Access Paper or Ask Questions

Learning an Interpretable Graph Structure in Multi-Task Learning

Sep 11, 2020

Shujian Yu, Francesco Alesiani, Ammar Shaker, Wenzhe Yin

Figure 1 for Learning an Interpretable Graph Structure in Multi-Task Learning

Figure 2 for Learning an Interpretable Graph Structure in Multi-Task Learning

Figure 3 for Learning an Interpretable Graph Structure in Multi-Task Learning

Figure 4 for Learning an Interpretable Graph Structure in Multi-Task Learning

Abstract:We present a novel methodology to jointly perform multi-task learning and infer intrinsic relationship among tasks by an interpretable and sparse graph. Unlike existing multi-task learning methodologies, the graph structure is not assumed to be known a priori or estimated separately in a preprocessing step. Instead, our graph is learned simultaneously with model parameters of each task, thus it reflects the critical relationship among tasks in the specific prediction problem. We characterize graph structure with its weighted adjacency matrix and show that the overall objective can be optimized alternatively until convergence. We also show that our methodology can be simply extended to a nonlinear form by being embedded into a multi-head radial basis function network (RBFN). Extensive experiments, against six state-of-the-art methodologies, on both synthetic data and real-world applications suggest that our methodology is able to reduce generalization error, and, at the same time, reveal a sparse graph over tasks that is much easier to interpret.

* 11 pages, 7 figures

Via

Access Paper or Ask Questions

Towards Interpretable Multi-Task Learning Using Bilevel Programming

Sep 11, 2020

Francesco Alesiani, Shujian Yu, Ammar Shaker, Wenzhe Yin

Figure 1 for Towards Interpretable Multi-Task Learning Using Bilevel Programming

Figure 2 for Towards Interpretable Multi-Task Learning Using Bilevel Programming

Figure 3 for Towards Interpretable Multi-Task Learning Using Bilevel Programming

Figure 4 for Towards Interpretable Multi-Task Learning Using Bilevel Programming

Abstract:Interpretable Multi-Task Learning can be expressed as learning a sparse graph of the task relationship based on the prediction performance of the learned models. Since many natural phenomenon exhibit sparse structures, enforcing sparsity on learned models reveals the underlying task relationship. Moreover, different sparsification degrees from a fully connected graph uncover various types of structures, like cliques, trees, lines, clusters or fully disconnected graphs. In this paper, we propose a bilevel formulation of multi-task learning that induces sparse graphs, thus, revealing the underlying task relationships, and an efficient method for its computation. We show empirically how the induced sparse graph improves the interpretability of the learned models and their relationship on synthetic and real data, without sacrificing generalization performance. Code at https://bit.ly/GraphGuidedMTL

* Manuscript accepted at ECML PKDD 2020

Via

Access Paper or Ask Questions

PRI-VAE: Principle-of-Relevant-Information Variational Autoencoders

Jul 13, 2020

Yanjun Li, Shujian Yu, Jose C. Principe, Xiaolin Li, Dapeng Wu

Figure 1 for PRI-VAE: Principle-of-Relevant-Information Variational Autoencoders

Figure 2 for PRI-VAE: Principle-of-Relevant-Information Variational Autoencoders

Figure 3 for PRI-VAE: Principle-of-Relevant-Information Variational Autoencoders

Figure 4 for PRI-VAE: Principle-of-Relevant-Information Variational Autoencoders

Abstract:Although substantial efforts have been made to learn disentangled representations under the variational autoencoder (VAE) framework, the fundamental properties to the dynamics of learning of most VAE models still remain unknown and under-investigated. In this work, we first propose a novel learning objective, termed the principle-of-relevant-information variational autoencoder (PRI-VAE), to learn disentangled representations. We then present an information-theoretic perspective to analyze existing VAE models by inspecting the evolution of some critical information-theoretic quantities across training epochs. Our observations unveil some fundamental properties associated with VAEs. Empirical results also demonstrate the effectiveness of PRI-VAE on four benchmark data sets.

Via

Access Paper or Ask Questions

Modularizing Deep Learning via Pairwise Learning With Kernels

May 12, 2020

Shiyu Duan, Shujian Yu, Jose Principe

Figure 1 for Modularizing Deep Learning via Pairwise Learning With Kernels

Figure 2 for Modularizing Deep Learning via Pairwise Learning With Kernels

Figure 3 for Modularizing Deep Learning via Pairwise Learning With Kernels

Figure 4 for Modularizing Deep Learning via Pairwise Learning With Kernels

Abstract:By redefining the conventional notions of layers, we present an alternative view on finitely wide, fully trainable deep neural networks as stacked linear models in feature spaces, leading to a kernel machine interpretation. Based on this construction, we then propose a provably optimal modular learning framework for classification, avoiding between-module backpropagation. This modular training approach brings new insights into the label requirement of deep learning: It leverages weak pairwise labels when learning the hidden modules. When training the output module, on the other hand, it requires full supervision but achieves high label efficiency, needing as few as 10 randomly selected labeled examples (one from each class) to achieve 94.88\% accuracy on CIFAR-10 using a ResNet-18 backbone. Moreover, modular training enables fully modularized deep learning workflows, which then simplify the design and implementation of pipelines and improve the maintainability and reusability of models. To showcase the advantages of such a modularized workflow, we describe a simple yet reliable method for estimating reusability of pre-trained modules as well as task transferability in a transfer learning setting. At practically no computation overhead, it precisely described the task space structure of 15 binary classification tasks from CIFAR-10.

Via

Access Paper or Ask Questions

Measuring the Discrepancy between Conditional Distributions: Methods, Properties and Applications

May 05, 2020

Shujian Yu, Ammar Shaker, Francesco Alesiani, Jose C. Principe

Figure 1 for Measuring the Discrepancy between Conditional Distributions: Methods, Properties and Applications

Figure 2 for Measuring the Discrepancy between Conditional Distributions: Methods, Properties and Applications

Figure 3 for Measuring the Discrepancy between Conditional Distributions: Methods, Properties and Applications

Figure 4 for Measuring the Discrepancy between Conditional Distributions: Methods, Properties and Applications

Abstract:We propose a simple yet powerful test statistic to quantify the discrepancy between two conditional distributions. The new statistic avoids the explicit estimation of the underlying distributions in highdimensional space and it operates on the cone of symmetric positive semidefinite (SPS) matrix using the Bregman matrix divergence. Moreover, it inherits the merits of the correntropy function to explicitly incorporate high-order statistics in the data. We present the properties of our new statistic and illustrate its connections to prior art. We finally show the applications of our new statistic on three different machine learning problems, namely the multi-task learning over graphs, the concept drift detection, and the information-theoretic feature selection, to demonstrate its utility and advantage. Code of our statistic is available at https://bit.ly/BregmanCorrentropy.

* accepted at IJCAI 20, code is available at https://github.com/SJYuCNEL/Bregman-Correntropy-Conditional-Divergence

Via

Access Paper or Ask Questions

Information Plane Analysis of Deep Neural Networks via Matrix-Based Renyi's Entropy and Tensor Kernels

Sep 25, 2019

Kristoffer Wickstrøm, Sigurd Løkse, Michael Kampffmeyer, Shujian Yu, Jose Principe, Robert Jenssen

Figure 1 for Information Plane Analysis of Deep Neural Networks via Matrix-Based Renyi's Entropy and Tensor Kernels

Figure 2 for Information Plane Analysis of Deep Neural Networks via Matrix-Based Renyi's Entropy and Tensor Kernels

Figure 3 for Information Plane Analysis of Deep Neural Networks via Matrix-Based Renyi's Entropy and Tensor Kernels

Figure 4 for Information Plane Analysis of Deep Neural Networks via Matrix-Based Renyi's Entropy and Tensor Kernels

Abstract:Analyzing deep neural networks (DNNs) via information plane (IP) theory has gained tremendous attention recently as a tool to gain insight into, among others, their generalization ability. However, it is by no means obvious how to estimate mutual information (MI) between each hidden layer and the input/desired output, to construct the IP. For instance, hidden layers with many neurons require MI estimators with robustness towards the high dimensionality associated with such layers. MI estimators should also be able to naturally handle convolutional layers, while at the same time being computationally tractable to scale to large networks. None of the existing IP methods to date have been able to study truly deep Convolutional Neural Networks (CNNs), such as the e.g.\ VGG-16. In this paper, we propose an IP analysis using the new matrix--based R\'enyi's entropy coupled with tensor kernels over convolutional layers, leveraging the power of kernel methods to represent properties of the probability distribution independently of the dimensionality of the data. The obtained results shed new light on the previous literature concerning small-scale DNNs, however using a completely new approach. Importantly, the new framework enables us to provide the first comprehensive IP analysis of contemporary large-scale DNNs and CNNs, investigating the different training phases and providing new insights into the training dynamics of large-scale neural networks.

* 15 pages, 8 figures

Via

Access Paper or Ask Questions