Alert button
Picture for Caihua Shan

Caihua Shan

Alert button

Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads

Aug 14, 2023
Dingheng Mo, Fanchao Chen, Siqiang Luo, Caihua Shan

LSM-trees are widely adopted as the storage backend of key-value stores. However, optimizing the system performance under dynamic workloads has not been sufficiently studied or evaluated in previous work. To fill the gap, we present RusKey, a key-value store with the following new features: (1) RusKey is a first attempt to orchestrate LSM-tree structures online to enable robust performance under the context of dynamic workloads; (2) RusKey is the first study to use Reinforcement Learning (RL) to guide LSM-tree transformations; (3) RusKey includes a new LSM-tree design, named FLSM-tree, for an efficient transition between different compaction policies -- the bottleneck of dynamic key-value stores. We justify the superiority of the new design with theoretical analysis; (4) RusKey requires no prior workload knowledge for system adjustment, in contrast to state-of-the-art techniques. Experiments show that RusKey exhibits strong performance robustness in diverse workloads, achieving up to 4x better end-to-end performance than the RocksDB system under various settings.

* 25 pages, 13 figures 
Viaarxiv icon

Biological Factor Regulatory Neural Network

Apr 11, 2023
Xinnan Dai, Caihua Shan, Jie Zheng, Xiaoxiao Li, Dongsheng Li

Figure 1 for Biological Factor Regulatory Neural Network
Figure 2 for Biological Factor Regulatory Neural Network
Figure 3 for Biological Factor Regulatory Neural Network
Figure 4 for Biological Factor Regulatory Neural Network

Genes are fundamental for analyzing biological systems and many recent works proposed to utilize gene expression for various biological tasks by deep learning models. Despite their promising performance, it is hard for deep neural networks to provide biological insights for humans due to their black-box nature. Recently, some works integrated biological knowledge with neural networks to improve the transparency and performance of their models. However, these methods can only incorporate partial biological knowledge, leading to suboptimal performance. In this paper, we propose the Biological Factor Regulatory Neural Network (BFReg-NN), a generic framework to model relations among biological factors in cell systems. BFReg-NN starts from gene expression data and is capable of merging most existing biological knowledge into the model, including the regulatory relations among genes or proteins (e.g., gene regulatory networks (GRN), protein-protein interaction networks (PPI)) and the hierarchical relations among genes, proteins and pathways (e.g., several genes/proteins are contained in a pathway). Moreover, BFReg-NN also has the ability to provide new biologically meaningful insights because of its white-box characteristics. Experimental results on different gene expression-based tasks verify the superiority of BFReg-NN compared with baselines. Our case studies also show that the key insights found by BFReg-NN are consistent with the biological literature.

Viaarxiv icon

SeeGera: Self-supervised Semi-implicit Graph Variational Auto-encoders with Masking

Feb 07, 2023
Xiang Li, Tiandi Ye, Caihua Shan, Dongsheng Li, Ming Gao

Figure 1 for SeeGera: Self-supervised Semi-implicit Graph Variational Auto-encoders with Masking
Figure 2 for SeeGera: Self-supervised Semi-implicit Graph Variational Auto-encoders with Masking
Figure 3 for SeeGera: Self-supervised Semi-implicit Graph Variational Auto-encoders with Masking
Figure 4 for SeeGera: Self-supervised Semi-implicit Graph Variational Auto-encoders with Masking

Generative graph self-supervised learning (SSL) aims to learn node representations by reconstructing the input graph data. However, most existing methods focus on unsupervised learning tasks only and very few work has shown its superiority over the state-of-the-art graph contrastive learning (GCL) models, especially on the classification task. While a very recent model has been proposed to bridge the gap, its performance on unsupervised learning tasks is still unknown. In this paper, to comprehensively enhance the performance of generative graph SSL against other GCL models on both unsupervised and supervised learning tasks, we propose the SeeGera model, which is based on the family of self-supervised variational graph auto-encoder (VGAE). Specifically, SeeGera adopts the semi-implicit variational inference framework, a hierarchical variational framework, and mainly focuses on feature reconstruction and structure/feature masking. On the one hand, SeeGera co-embeds both nodes and features in the encoder and reconstructs both links and features in the decoder. Since feature embeddings contain rich semantic information on features, they can be combined with node embeddings to provide fine-grained knowledge for feature reconstruction. On the other hand, SeeGera adds an additional layer for structure/feature masking to the hierarchical variational framework, which boosts the model generalizability. We conduct extensive experiments comparing SeeGera with 9 other state-of-the-art competitors. Our results show that SeeGera can compare favorably against other state-of-the-art GCL methods in a variety of unsupervised and supervised learning tasks.

* Accepted by WebConf 2023 
Viaarxiv icon

Self-supervised Semi-implicit Graph Variational Auto-encoders with Masking

Feb 06, 2023
Xiang Li, Tiandi Ye, Caihua Shan, Dongsheng Li, Ming Gao

Figure 1 for Self-supervised Semi-implicit Graph Variational Auto-encoders with Masking
Figure 2 for Self-supervised Semi-implicit Graph Variational Auto-encoders with Masking
Figure 3 for Self-supervised Semi-implicit Graph Variational Auto-encoders with Masking
Figure 4 for Self-supervised Semi-implicit Graph Variational Auto-encoders with Masking

Generative graph self-supervised learning (SSL) aims to learn node representations by reconstructing the input graph data. However, most existing methods focus on unsupervised learning tasks only and very few work has shown its superiority over the state-of-the-art graph contrastive learning (GCL) models, especially on the classification task. While a very recent model has been proposed to bridge the gap, its performance on unsupervised learning tasks is still unknown. In this paper, to comprehensively enhance the performance of generative graph SSL against other GCL models on both unsupervised and supervised learning tasks, we propose the SeeGera model, which is based on the family of self-supervised variational graph auto-encoder (VGAE). Specifically, SeeGera adopts the semi-implicit variational inference framework, a hierarchical variational framework, and mainly focuses on feature reconstruction and structure/feature masking. On the one hand, SeeGera co-embeds both nodes and features in the encoder and reconstructs both links and features in the decoder. Since feature embeddings contain rich semantic information on features, they can be combined with node embeddings to provide fine-grained knowledge for feature reconstruction. On the other hand, SeeGera adds an additional layer for structure/feature masking to the hierarchical variational framework, which boosts the model generalizability. We conduct extensive experiments comparing SeeGera with 9 other state-of-the-art competitors. Our results show that SeeGera can compare favorably against other state-of-the-art GCL methods in a variety of unsupervised and supervised learning tasks.

* Accepted by WebConf 2023 
Viaarxiv icon

RuDi: Explaining Behavior Sequence Models by Automatic Statistics Generation and Rule Distillation

Aug 16, 2022
Yao Zhang, Yun Xiong, Yiheng Sun, Caihua Shan, Tian Lu, Hui Song, Yangyong Zhu

Figure 1 for RuDi: Explaining Behavior Sequence Models by Automatic Statistics Generation and Rule Distillation
Figure 2 for RuDi: Explaining Behavior Sequence Models by Automatic Statistics Generation and Rule Distillation
Figure 3 for RuDi: Explaining Behavior Sequence Models by Automatic Statistics Generation and Rule Distillation
Figure 4 for RuDi: Explaining Behavior Sequence Models by Automatic Statistics Generation and Rule Distillation

Risk scoring systems have been widely deployed in many applications, which assign risk scores to users according to their behavior sequences. Though many deep learning methods with sophisticated designs have achieved promising results, the black-box nature hinders their applications due to fairness, explainability, and compliance consideration. Rule-based systems are considered reliable in these sensitive scenarios. However, building a rule system is labor-intensive. Experts need to find informative statistics from user behavior sequences, design rules based on statistics and assign weights to each rule. In this paper, we bridge the gap between effective but black-box models and transparent rule models. We propose a two-stage method, RuDi, that distills the knowledge of black-box teacher models into rule-based student models. We design a Monte Carlo tree search-based statistics generation method that can provide a set of informative statistics in the first stage. Then statistics are composed into logical rules with our proposed neural logical networks by mimicking the outputs of teacher models. We evaluate RuDi on three real-world public datasets and an industrial dataset to demonstrate its effectiveness.

* CIKM'2022. Codes: https://github.com/yzhang1918/cikm2022rudi 
Viaarxiv icon

RendNet: Unified 2D/3D Recognizer With Latent Space Rendering

Jun 21, 2022
Ruoxi Shi, Xinyang Jiang, Caihua Shan, Yansen Wang, Dongsheng Li

Figure 1 for RendNet: Unified 2D/3D Recognizer With Latent Space Rendering
Figure 2 for RendNet: Unified 2D/3D Recognizer With Latent Space Rendering
Figure 3 for RendNet: Unified 2D/3D Recognizer With Latent Space Rendering
Figure 4 for RendNet: Unified 2D/3D Recognizer With Latent Space Rendering

Vector graphics (VG) have been ubiquitous in our daily life with vast applications in engineering, architecture, designs, etc. The VG recognition process of most existing methods is to first render the VG into raster graphics (RG) and then conduct recognition based on RG formats. However, this procedure discards the structure of geometries and loses the high resolution of VG. Recently, another category of algorithms is proposed to recognize directly from the original VG format. But it is affected by the topological errors that can be filtered out by RG rendering. Instead of looking at one format, it is a good solution to utilize the formats of VG and RG together to avoid these shortcomings. Besides, we argue that the VG-to-RG rendering process is essential to effectively combine VG and RG information. By specifying the rules on how to transfer VG primitives to RG pixels, the rendering process depicts the interaction and correlation between VG and RG. As a result, we propose RendNet, a unified architecture for recognition on both 2D and 3D scenarios, which considers both VG/RG representations and exploits their interaction by incorporating the VG-to-RG rasterization process. Experiments show that RendNet can achieve state-of-the-art performance on 2D and 3D object recognition tasks on various VG datasets.

* CVPR 2022 Oral 
Viaarxiv icon

Finding Global Homophily in Graph Neural Networks When Meeting Heterophily

May 15, 2022
Xiang Li, Renyu Zhu, Yao Cheng, Caihua Shan, Siqiang Luo, Dongsheng Li, Weining Qian

Figure 1 for Finding Global Homophily in Graph Neural Networks When Meeting Heterophily
Figure 2 for Finding Global Homophily in Graph Neural Networks When Meeting Heterophily
Figure 3 for Finding Global Homophily in Graph Neural Networks When Meeting Heterophily
Figure 4 for Finding Global Homophily in Graph Neural Networks When Meeting Heterophily

We investigate graph neural networks on graphs with heterophily. Some existing methods amplify a node's neighborhood with multi-hop neighbors to include more nodes with homophily. However, it is a significant challenge to set personalized neighborhood sizes for different nodes. Further, for other homophilous nodes excluded in the neighborhood, they are ignored for information aggregation. To address these problems, we propose two models GloGNN and GloGNN++, which generate a node's embedding by aggregating information from global nodes in the graph. In each layer, both models learn a coefficient matrix to capture the correlations between nodes, based on which neighborhood aggregation is performed. The coefficient matrix allows signed values and is derived from an optimization problem that has a closed-form solution. We further accelerate neighborhood aggregation and derive a linear time complexity. We theoretically explain the models' effectiveness by proving that both the coefficient matrix and the generated node embedding matrix have the desired grouping effect. We conduct extensive experiments to compare our models against 11 other competitors on 15 benchmark datasets in a wide range of domains, scales and graph heterophilies. Experimental results show that our methods achieve superior performance and are also very efficient.

* To appear in ICML 2022 
Viaarxiv icon

CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis

Mar 30, 2022
Shifu Yan, Caihua Shan, Wenyi Yang, Bixiong Xu, Dongsheng Li, Lili Qiu, Jie Tong, Qi Zhang

Figure 1 for CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis
Figure 2 for CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis
Figure 3 for CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis
Figure 4 for CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis

In large-scale online services, crucial metrics, a.k.a., key performance indicators (KPIs), are monitored periodically to check their running statuses. Generally, KPIs are aggregated along multiple dimensions and derived by complex calculations among fundamental metrics from the raw data. Once abnormal KPI values are observed, root cause analysis (RCA) can be applied to identify the reasons for anomalies, so that we can troubleshoot quickly. Recently, several automatic RCA techniques were proposed to localize the related dimensions (or a combination of dimensions) to explain the anomalies. However, their analyses are limited to the data on the abnormal metric and ignore the data of other metrics which may be also related to the anomalies, leading to imprecise or even incorrect root causes. To this end, we propose a cross-metric multi-dimensional root cause analysis method, named CMMD, which consists of two key components: 1) relationship modeling, which utilizes graph neural network (GNN) to model the unknown complex calculation among metrics and aggregation function among dimensions from historical data; 2) root cause localization, which adopts the genetic algorithm to efficiently and effectively dive into the raw data and localize the abnormal dimension(s) once the KPI anomalies are detected. Experiments on synthetic datasets, public datasets and online production environment demonstrate the superiority of our proposed CMMD method compared with baselines. Currently, CMMD is running as an online service in Microsoft Azure.

Viaarxiv icon

Recognizing Vector Graphics without Rasterization

Nov 12, 2021
Xinyang Jiang, Lu Liu, Caihua Shan, Yifei Shen, Xuanyi Dong, Dongsheng Li

Figure 1 for Recognizing Vector Graphics without Rasterization
Figure 2 for Recognizing Vector Graphics without Rasterization
Figure 3 for Recognizing Vector Graphics without Rasterization
Figure 4 for Recognizing Vector Graphics without Rasterization

In this paper, we consider a different data format for images: vector graphics. In contrast to raster graphics which are widely used in image recognition, vector graphics can be scaled up or down into any resolution without aliasing or information loss, due to the analytic representation of the primitives in the document. Furthermore, vector graphics are able to give extra structural information on how low-level elements group together to form high level shapes or structures. These merits of graphic vectors have not been fully leveraged in existing methods. To explore this data format, we target on the fundamental recognition tasks: object localization and classification. We propose an efficient CNN-free pipeline that does not render the graphic into pixels (i.e. rasterization), and takes textual document of the vector graphics as input, called YOLaT (You Only Look at Text). YOLaT builds multi-graphs to model the structural and spatial information in vector graphics, and a dual-stream graph neural network is proposed to detect objects from the graph. Our experiments show that by directly operating on vector graphics, YOLaT out-performs raster-graphic based object detection baselines in terms of both average precision and efficiency.

* Accepted by NeurIPS2021 
Viaarxiv icon