For Bayesian optimization (BO) on high-dimensional data with complex structure, neural network-based kernels for Gaussian processes (GPs) have been used to learn flexible surrogate functions by the high representation power of deep learning. However, existing methods train neural networks by maximizing the marginal likelihood, which do not directly improve the BO performance. In this paper, we propose a meta-learning method for BO with neural network-based kernels that minimizes the expected gap between the true optimum value and the best value found by BO. We model a policy, which takes the current evaluated data points as input and outputs the next data point to be evaluated, by a neural network, where neural network-based kernels, GPs, and mutual information-based acquisition functions are used as its layers. With our model, the neural network-based kernel is trained to be appropriate for the acquisition function by backpropagating the gap through the acquisition function and GP. Our model is trained by a reinforcement learning framework from multiple tasks. Since the neural network is shared across different tasks, we can gather knowledge on BO from multiple training tasks, and use the knowledge for unseen test tasks. In experiments using three text document datasets, we demonstrate that the proposed method achieves better BO performance than the existing methods.
We propose a few-shot learning method for unsupervised feature selection, which is a task to select a subset of relevant features in unlabeled data. Existing methods usually require many instances for feature selection. However, sufficient instances are often unavailable in practice. The proposed method can select a subset of relevant features in a target task given a few unlabeled target instances by training with unlabeled instances in multiple source tasks. Our model consists of a feature selector and decoder. The feature selector outputs a subset of relevant features taking a few unlabeled instances as input such that the decoder can reconstruct the original features of unseen instances from the selected ones. The feature selector uses the Concrete random variables to select features via gradient descent. To encode task-specific properties from a few unlabeled instances to the model, the Concrete random variables and decoder are modeled using permutation-invariant neural networks that take a few unlabeled instances as input. Our model is trained by minimizing the expected test reconstruction error given a few unlabeled instances that is calculated with datasets in source tasks. We experimentally demonstrate that the proposed method outperforms existing feature selection methods.
The ratio of two probability densities, called a density-ratio, is a vital quantity in machine learning. In particular, a relative density-ratio, which is a bounded extension of the density-ratio, has received much attention due to its stability and has been used in various applications such as outlier detection and dataset comparison. Existing methods for (relative) density-ratio estimation (DRE) require many instances from both densities. However, sufficient instances are often unavailable in practice. In this paper, we propose a meta-learning method for relative DRE, which estimates the relative density-ratio from a few instances by using knowledge in related datasets. Specifically, given two datasets that consist of a few instances, our model extracts the datasets' information by using neural networks and uses it to obtain instance embeddings appropriate for the relative DRE. We model the relative density-ratio by a linear model on the embedded space, whose global optimum solution can be obtained as a closed-form solution. The closed-form solution enables fast and effective adaptation to a few instances, and its differentiability enables us to train our model such that the expected test error for relative DRE can be explicitly minimized after adapting to a few instances. We empirically demonstrate the effectiveness of the proposed method by using three problems: relative DRE, dataset comparison, and outlier detection.
We propose a method that meta-learns a knowledge on matrix factorization from various matrices, and uses the knowledge for factorizing unseen matrices. The proposed method uses a neural network that takes a matrix as input, and generates prior distributions of factorized matrices of the given matrix. The neural network is meta-learned such that the expected imputation error is minimized when the factorized matrices are adapted to each matrix by a maximum a posteriori (MAP) estimation. We use a gradient descent method for the MAP estimation, which enables us to backpropagate the expected imputation error through the gradient descent steps for updating neural network parameters since each gradient descent step is written in a closed form and is differentiable. The proposed method can meta-learn from matrices even when their rows and columns are not shared, and their sizes are different from each other. In our experiments with three user-item rating datasets, we demonstrate that our proposed method can impute the missing values from a limited number of observations in unseen matrices after being trained with different matrices.
Bayesian model averaging, obtained as the expectation of a likelihood function by a posterior distribution, has been widely used for prediction, evaluation of uncertainty, and model selection. Various approaches have been developed to efficiently capture the information in the posterior distribution; one such approach is the optimization of a set of models simultaneously with interaction to ensure the diversity of the individual models in the same way as ensemble learning. A representative approach is particle variational inference (PVI), which uses an ensemble of models as an empirical approximation for the posterior distribution. PVI iteratively updates each model with a repulsion force to ensure the diversity of the optimized models. However, despite its promising performance, a theoretical understanding of this repulsion and its association with the generalization ability remains unclear. In this paper, we tackle this problem in light of PAC-Bayesian analysis. First, we provide a new second-order Jensen inequality, which has the repulsion term based on the loss function. Thanks to the repulsion term, it is tighter than the standard Jensen inequality. Then, we derive a novel generalization error bound and show that it can be reduced by enhancing the diversity of models. Finally, we derive a new PVI that optimizes the generalization error bound directly. Numerical experiments demonstrate that the performance of the proposed PVI compares favorably with existing methods in the experiment.
Sequences of events including infectious disease outbreaks, social network activities, and crimes are ubiquitous and the data on such events carry essential information about the underlying diffusion processes between communities (e.g., regions, online user groups). Modeling diffusion processes and predicting future events are crucial in many applications including epidemic control, viral marketing, and predictive policing. Hawkes processes offer a central tool for modeling the diffusion processes, in which the influence from the past events is described by the triggering kernel. However, the triggering kernel parameters, which govern how each community is influenced by the past events, are assumed to be static over time. In the real world, the diffusion processes depend not only on the influences from the past, but also the current (time-evolving) states of the communities, e.g., people's awareness of the disease and people's current interests. In this paper, we propose a novel Hawkes process model that is able to capture the underlying dynamics of community states behind the diffusion processes and predict the occurrences of events based on the dynamics. Specifically, we model the latent dynamic function that encodes these hidden dynamics by a mixture of neural networks. Then we design the triggering kernel using the latent dynamic function and its integral. The proposed method, termed DHP (Dynamic Hawkes Processes), offers a flexible way to learn complex representations of the time-evolving communities' states, while at the same time it allows to computing the exact likelihood, which makes parameter learning tractable. Extensive experiments on four real-world event datasets show that DHP outperforms five widely adopted methods for event prediction.
Topic models have been successfully used for analyzing text documents. However, with existing topic models, many documents are required for training. In this paper, we propose a neural network-based few-shot learning method that can learn a topic model from just a few documents. The neural networks in our model take a small number of documents as inputs, and output topic model priors. The proposed method trains the neural networks such that the expected test likelihood is improved when topic model parameters are estimated by maximizing the posterior probability using the priors based on the EM algorithm. Since each step in the EM algorithm is differentiable, the proposed method can backpropagate the loss through the EM algorithm to train the neural networks. The expected test likelihood is maximized by a stochastic gradient descent method using a set of multiple text corpora with an episodic training framework. In our experiments, we demonstrate that the proposed method achieves better perplexity than existing methods using three real-world text document sets.
For better clustering performance, appropriate representations are critical. Although many neural network-based metric learning methods have been proposed, they do not directly train neural networks to improve clustering performance. We propose a meta-learning method that train neural networks for obtaining representations such that clustering performance improves when the representations are clustered by the variational Bayesian (VB) inference with an infinite Gaussian mixture model. The proposed method can cluster unseen unlabeled data using knowledge meta-learned with labeled data that are different from the unlabeled data. For the objective function, we propose a continuous approximation of the adjusted Rand index (ARI), by which we can evaluate the clustering performance from soft clustering assignments. Since the approximated ARI and the VB inference procedure are differentiable, we can backpropagate the objective function through the VB inference procedure to train the neural networks. With experiments using text and image data sets, we demonstrate that our proposed method has a higher adjusted Rand index than existing methods do.