Random forest is effective for prediction tasks but the randomness of tree generation hinders interpretability in feature importance analysis. To address this, we proposed DT-Sampler, a SAT-based method for measuring feature importance in tree-based model. Our method has fewer parameters than random forest and provides higher interpretability and stability for the analysis in real-world problems. An implementation of DT-Sampler is available at https://github.com/tsudalab/DT-sampler.
Predictive pattern mining is an approach used to construct prediction models when the input is represented by structured data, such as sets, graphs, and sequences. The main idea behind predictive pattern mining is to build a prediction model by considering substructures, such as subsets, subgraphs, and subsequences (referred to as patterns), present in the structured data as features of the model. The primary challenge in predictive pattern mining lies in the exponential growth of the number of patterns with the complexity of the structured data. In this study, we propose the Safe Pattern Pruning (SPP) method to address the explosion of pattern numbers in predictive pattern mining. We also discuss how it can be effectively employed throughout the entire model building process in practical data analysis. To demonstrate the effectiveness of the proposed method, we conduct numerical experiments on regression and classification problems involving sets, graphs, and sequences.
NIMS-OS (NIMS Orchestration System) is a Python library created to realize a closed loop of robotic experiments and artificial intelligence (AI) without human intervention for automated materials exploration. It uses various combinations of modules to operate autonomously. Each module acts as an AI for materials exploration or a controller for a robotic experiments. As AI techniques, Bayesian optimization (PHYSBO), boundless objective-free exploration (BLOX), phase diagram construction (PDC), and random exploration (RE) methods can be used. Moreover, a system called NIMS automated robotic electrochemical experiments (NAREE) is available as a set of robotic experimental equipment. Visualization tools for the results are also included, which allows users to check the optimization results in real time. Newly created modules for AI and robotic experiments can be added easily to extend the functionality of the system. In addition, we developed a GUI application to control NIMS-OS.To demonstrate the operation of NIMS-OS, we consider an automated exploration for new electrolytes. NIMS-OS is available at https://github.com/nimsos-dev/nimsos.
We present a framework for embedding graph structured data into a vector space, taking into account node features and topology of a graph into the optimal transport (OT) problem. Then we propose a novel distance between two graphs, named linearFGW, defined as the Euclidean distance between their embeddings. The advantages of the proposed distance are twofold: 1) it can take into account node feature and structure of graphs for measuring the similarity between graphs in a kernel-based framework, 2) it can be much faster for computing kernel matrix than pairwise OT-based distances, particularly fused Gromov-Wasserstein, making it possible to deal with large-scale data sets. After discussing theoretical properties of linearFGW, we demonstrate experimental results on classification and clustering tasks, showing the effectiveness of the proposed linearFGW.
Automated high-stake decision-making such as medical diagnosis requires models with high interpretability and reliability. As one of the interpretable and reliable models with good prediction ability, we consider Sparse High-order Interaction Model (SHIM) in this study. However, finding statistically significant high-order interactions is challenging due to the intrinsic high dimensionality of the combinatorial effects. Another problem in data-driven modeling is the effect of "cherry-picking" a.k.a. selection bias. Our main contribution is to extend the recently developed parametric programming approach for selective inference to high-order interaction models. Exhaustive search over the cherry tree (all possible interactions) can be daunting and impractical even for a small-sized problem. We introduced an efficient pruning strategy and demonstrated the computational efficiency and statistical power of the proposed method using both synthetic and real data.
Deep generative models have been shown powerful in generating novel molecules with desired chemical properties via their representations such as strings, trees or graphs. However, these models are limited in recommending synthetic routes for the generated molecules in practice. We propose a generative model to generate molecules via multi-step chemical reaction trees. Specifically, our model first propose a chemical reaction tree with predicted reaction templates and commercially available molecules (starting molecules), and then perform forward synthetic steps to obtain product molecules. Experiments show that our model can generate chemical reactions whose product molecules are with desired chemical properties. Also, the complete synthetic routes for these product molecules are provided.
A black-box optimization algorithm such as Bayesian optimization finds extremum of an unknown function by alternating inference of the underlying function and optimization of an acquisition function. In a high-dimensional space, such algorithms perform poorly due to the difficulty of acquisition function optimization. Herein, we apply quantum annealing (QA) to overcome the difficulty in the continuous black-box optimization. As QA specializes in optimization of binary problems, a continuous vector has to be encoded to binary, and the solution of QA has to be translated back. Our method has the following three parts: 1) Random subspace coding based on axis-parallel hyperrectangles from continuous vector to binary vector. 2) A quadratic unconstrained binary optimization (QUBO) defined by acquisition function based on nonnegative-weighted linear regression model which is solved by QA. 3) A penalization scheme to ensure that the QA solution can be translated back. It is shown in benchmark tests that its performance using D-Wave Advantage$^{\rm TM}$ quantum annealer is competitive with a state-of-the-art method based on the Gaussian process in high-dimensional problems. Our method may open up a new possibility of quantum annealing and other QUBO solvers including quantum approximate optimization algorithm (QAOA) using a gated-quantum computers, and expand its range of application to continuous-valued problems.
Machine learning applications in materials science are often hampered by shortage of experimental data. Integration with legacy data from past experiments is a viable way to solve the problem, but complex calibration is often necessary to use the data obtained under different conditions. In this paper, we present a novel calibration-free strategy to enhance the performance of Bayesian optimization with preference learning. The entire learning process is solely based on pairwise comparison of quantities (i.e., higher or lower) in the same dataset, and experimental design can be done without comparing quantities in different datasets. We demonstrate that Bayesian optimization is significantly enhanced via addition of legacy data for organic molecules and inorganic solid-state materials.
We solve tensor balancing, rescaling an Nth order nonnegative tensor by multiplying N tensors of order N - 1 so that every fiber sums to one. This generalizes a fundamental process of matrix balancing used to compare matrices in a wide range of applications from biology to economics. We present an efficient balancing algorithm with quadratic convergence using Newton's method and show in numerical experiments that the proposed algorithm is several orders of magnitude faster than existing ones. To theoretically prove the correctness of the algorithm, we model tensors as probability distributions in a statistical manifold and realize tensor balancing as projection onto a submanifold. The key to our algorithm is that the gradient of the manifold, used as a Jacobian matrix in Newton's method, can be analytically obtained using the Moebius inversion formula, the essential of combinatorial mathematics. Our model is not limited to tensor balancing, but has a wide applicability as it includes various statistical and machine learning models such as weighted DAGs and Boltzmann machines.
We present a novel nonnegative tensor decomposition method, called Legendre decomposition, which factorizes an input tensor into a multiplicative combination of parameters. Thanks to the well-developed theory of information geometry, the reconstructed tensor is unique and always minimizes the KL divergence from an input tensor. We empirically show that Legendre decomposition can more accurately reconstruct tensors than other nonnegative tensor decomposition methods.