Alert button
Picture for Xiaochen Wang

Xiaochen Wang

Alert button

Mitigating Pooling Bias in E-commerce Search via False Negative Estimation

Nov 18, 2023
Xiaochen Wang, Xiao Xiao, Ruhan Zhang, Xuan Zhang, Taesik Na, Tejaswi Tenneti, Haixun Wang, Fenglong Ma

Efficient and accurate product relevance assessment is critical for user experiences and business success. Training a proficient relevance assessment model requires high-quality query-product pairs, often obtained through negative sampling strategies. Unfortunately, current methods introduce pooling bias by mistakenly sampling false negatives, diminishing performance and business impact. To address this, we present Bias-mitigating Hard Negative Sampling (BHNS), a novel negative sampling strategy tailored to identify and adjust for false negatives, building upon our original False Negative Estimation algorithm. Our experiments in the Instacart search setting confirm BHNS as effective for practical e-commerce use. Furthermore, comparative analyses on public dataset showcase its domain-agnostic potential for diverse applications.

* Submitted to WWW'24 Industry Track 
Viaarxiv icon

Hierarchical Pretraining on Multimodal Electronic Health Records

Oct 20, 2023
Xiaochen Wang, Junyu Luo, Jiaqi Wang, Ziyi Yin, Suhan Cui, Yuan Zhong, Yaqing Wang, Fenglong Ma

Figure 1 for Hierarchical Pretraining on Multimodal Electronic Health Records
Figure 2 for Hierarchical Pretraining on Multimodal Electronic Health Records
Figure 3 for Hierarchical Pretraining on Multimodal Electronic Health Records
Figure 4 for Hierarchical Pretraining on Multimodal Electronic Health Records

Pretraining has proven to be a powerful technique in natural language processing (NLP), exhibiting remarkable success in various NLP downstream tasks. However, in the medical domain, existing pretrained models on electronic health records (EHR) fail to capture the hierarchical nature of EHR data, limiting their generalization capability across diverse downstream tasks using a single pretrained model. To tackle this challenge, this paper introduces a novel, general, and unified pretraining framework called MEDHMP, specifically designed for hierarchically multimodal EHR data. The effectiveness of the proposed MEDHMP is demonstrated through experimental results on eight downstream tasks spanning three levels. Comparisons against eighteen baselines further highlight the efficacy of our approach.

* Accepted by EMNLP 2023 
Viaarxiv icon

MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation

Oct 05, 2023
Yuan Zhong, Suhan Cui, Jiaqi Wang, Xiaochen Wang, Ziyi Yin, Yaqing Wang, Houping Xiao, Mengdi Huai, Ting Wang, Fenglong Ma

Figure 1 for MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation
Figure 2 for MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation
Figure 3 for MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation
Figure 4 for MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation

Health risk prediction is one of the fundamental tasks under predictive modeling in the medical domain, which aims to forecast the potential health risks that patients may face in the future using their historical Electronic Health Records (EHR). Researchers have developed several risk prediction models to handle the unique challenges of EHR data, such as its sequential nature, high dimensionality, and inherent noise. These models have yielded impressive results. Nonetheless, a key issue undermining their effectiveness is data insufficiency. A variety of data generation and augmentation methods have been introduced to mitigate this issue by expanding the size of the training data set through the learning of underlying data distributions. However, the performance of these methods is often limited due to their task-unrelated design. To address these shortcomings, this paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion. It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space. Furthermore, MedDiffusion discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data. Experimental evaluation on four real-world medical datasets demonstrates that MedDiffusion outperforms 14 cutting-edge baselines in terms of PR-AUC, F1, and Cohen's Kappa. We also conduct ablation studies and benchmark our model against GAN-based alternatives to further validate the rationality and adaptability of our model design. Additionally, we analyze generated data to offer fresh insights into the model's interpretability.

Viaarxiv icon

Abnormal Event Detection via Hypergraph Contrastive Learning

Apr 02, 2023
Bo Yan, Cheng Yang, Chuan Shi, Jiawei Liu, Xiaochen Wang

Figure 1 for Abnormal Event Detection via Hypergraph Contrastive Learning
Figure 2 for Abnormal Event Detection via Hypergraph Contrastive Learning
Figure 3 for Abnormal Event Detection via Hypergraph Contrastive Learning
Figure 4 for Abnormal Event Detection via Hypergraph Contrastive Learning

Abnormal event detection, which refers to mining unusual interactions among involved entities, plays an important role in many real applications. Previous works mostly over-simplify this task as detecting abnormal pair-wise interactions. However, real-world events may contain multi-typed attributed entities and complex interactions among them, which forms an Attributed Heterogeneous Information Network (AHIN). With the boom of social networks, abnormal event detection in AHIN has become an important, but seldom explored task. In this paper, we firstly study the unsupervised abnormal event detection problem in AHIN. The events are considered as star-schema instances of AHIN and are further modeled by hypergraphs. A novel hypergraph contrastive learning method, named AEHCL, is proposed to fully capture abnormal event patterns. AEHCL designs the intra-event and inter-event contrastive modules to exploit self-supervised AHIN information. The intra-event contrastive module captures the pair-wise and multivariate interaction anomalies within an event, and the inter-event module captures the contextual anomalies among events. These two modules collaboratively boost the performance of each other and improve the detection results. During the testing phase, a contrastive learning-based abnormal event score function is further proposed to measure the abnormality degree of events. Extensive experiments on three datasets in different scenarios demonstrate the effectiveness of AEHCL, and the results improve state-of-the-art baselines up to 12.0% in Average Precision (AP) and 4.6% in Area Under Curve (AUC) respectively.

Viaarxiv icon

Statistical Dataset Evaluation: Reliability, Difficulty, and Validity

Dec 19, 2022
Chengwen Wang, Qingxiu Dong, Xiaochen Wang, Haitao Wang, Zhifang Sui

Figure 1 for Statistical Dataset Evaluation: Reliability, Difficulty, and Validity
Figure 2 for Statistical Dataset Evaluation: Reliability, Difficulty, and Validity
Figure 3 for Statistical Dataset Evaluation: Reliability, Difficulty, and Validity
Figure 4 for Statistical Dataset Evaluation: Reliability, Difficulty, and Validity

Datasets serve as crucial training resources and model performance trackers. However, existing datasets have exposed a plethora of problems, inducing biased models and unreliable evaluation results. In this paper, we propose a model-agnostic dataset evaluation framework for automatic dataset quality evaluation. We seek the statistical properties of the datasets and address three fundamental dimensions: reliability, difficulty, and validity, following a classical testing theory. Taking the Named Entity Recognition (NER) datasets as a case study, we introduce $9$ statistical metrics for a statistical dataset evaluation framework. Experimental results and human evaluation validate that our evaluation framework effectively assesses various aspects of the dataset quality. Furthermore, we study how the dataset scores on our statistical metrics affect the model performance, and appeal for dataset quality evaluation or targeted dataset improvement before training or testing models.

Viaarxiv icon

BoXHED 2.0: Scalable boosting of functional data in survival analysis

Mar 23, 2021
Arash Pakbin, Xiaochen Wang, Bobak J. Mortazavi, Donald K. K. Lee

Figure 1 for BoXHED 2.0: Scalable boosting of functional data in survival analysis
Figure 2 for BoXHED 2.0: Scalable boosting of functional data in survival analysis
Figure 3 for BoXHED 2.0: Scalable boosting of functional data in survival analysis
Figure 4 for BoXHED 2.0: Scalable boosting of functional data in survival analysis

Modern applications of survival analysis increasingly involve time-dependent covariates, which constitute a form of functional data. Learning from functional data generally involves repeated evaluations of time integrals which is numerically expensive. In this work we propose a lightweight data preprocessing step that transforms functional data into nonfunctional data. Boosting implementations for nonfunctional data can then be used, whereby the required numerical integration comes for free as part of the training phase. We use this to develop BoXHED 2.0, a quantum leap over the tree-boosted hazard package BoXHED 1.0. BoXHED 2.0 extends BoXHED 1.0 to Aalen's multiplicative intensity model, which covers censoring schemes far beyond right-censoring and also supports recurrent events data. It is also massively scalable because of preprocessing and also because it borrows from the core components of XGBoost. BoXHED 2.0 supports the use of GPUs and multicore CPUs, and is available from GitHub: www.github.com/BoXHED.

* 9 pages, 2 tables, 2 figures 
Viaarxiv icon

BoXHED: Boosted eXact Hazard Estimator with Dynamic covariates

Jun 26, 2020
Xiaochen Wang, Arash Pakbin, Bobak J. Mortazavi, Hongyu Zhao, Donald K. K. Lee

Figure 1 for BoXHED: Boosted eXact Hazard Estimator with Dynamic covariates
Figure 2 for BoXHED: Boosted eXact Hazard Estimator with Dynamic covariates
Figure 3 for BoXHED: Boosted eXact Hazard Estimator with Dynamic covariates
Figure 4 for BoXHED: Boosted eXact Hazard Estimator with Dynamic covariates

The proliferation of medical monitoring devices makes it possible to track health vitals at high frequency, enabling the development of dynamic health risk scores that change with the underlying readings. Survival analysis, in particular hazard estimation, is well-suited to analyzing this stream of data to predict disease onset as a function of the time-varying vitals. This paper introduces the software package BoXHED (pronounced 'box-head') for nonparametrically estimating hazard functions via gradient boosting. BoXHED 1.0 is a novel tree-based implementation of the generic estimator proposed in Lee, Chen, Ishwaran (2017), which was designed for handling time-dependent covariates in a fully nonparametric manner. BoXHED is also the first publicly available software implementation for Lee, Chen, Ishwaran (2017). Applying BoXHED to cardiovascular disease onset data from the Framingham Heart Study reveals novel interaction effects among known risk factors, potentially resolving an open question in clinical literature.

* 10 pages, 3 figures, 5 tables 
Viaarxiv icon

Fast Top-k Area Topics Extraction with Knowledge Base

Dec 04, 2017
Fang Zhang, Xiaochen Wang, Jingfei Han, Jie Tang, Shiyin Wang, Marie-Francine Moens

Figure 1 for Fast Top-k Area Topics Extraction with Knowledge Base
Figure 2 for Fast Top-k Area Topics Extraction with Knowledge Base
Figure 3 for Fast Top-k Area Topics Extraction with Knowledge Base

What are the most popular research topics in Artificial Intelligence (AI)? We formulate the problem as extracting top-$k$ topics that can best represent a given area with the help of knowledge base. We theoretically prove that the problem is NP-hard and propose an optimization model, FastKATE, to address this problem by combining both explicit and latent representations for each topic. We leverage a large-scale knowledge base (Wikipedia) to generate topic embeddings using neural networks and use this kind of representations to help capture the representativeness of topics for given areas. We develop a fast heuristic algorithm to efficiently solve the problem with a provable error bound. We evaluate the proposed model on three real-world datasets. Experimental results demonstrate our model's effectiveness, robustness, real-timeness (return results in $<1$s), and its superiority over several alternative methods.

Viaarxiv icon