Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Recommendation": models, code, and papers

Net benefit, calibration, threshold selection, and training objectives for algorithmic fairness in healthcare

Feb 03, 2022
Stephen R. Pfohl, Yizhe Xu, Agata Foryciarz, Nikolaos Ignatiadis, Julian Genkins, Nigam H. Shah

A growing body of work uses the paradigm of algorithmic fairness to frame the development of techniques to anticipate and proactively mitigate the introduction or exacerbation of health inequities that may follow from the use of model-guided decision-making. We evaluate the interplay between measures of model performance, fairness, and the expected utility of decision-making to offer practical recommendations for the operationalization of algorithmic fairness principles for the development and evaluation of predictive models in healthcare. We conduct an empirical case-study via development of models to estimate the ten-year risk of atherosclerotic cardiovascular disease to inform statin initiation in accordance with clinical practice guidelines. We demonstrate that approaches that incorporate fairness considerations into the model training objective typically do not improve model performance or confer greater net benefit for any of the studied patient populations compared to the use of standard learning paradigms followed by threshold selection concordant with patient preferences, evidence of intervention effectiveness, and model calibration. These results hold when the measured outcomes are not subject to differential measurement error across patient populations and threshold selection is unconstrained, regardless of whether differences in model performance metrics, such as in true and false positive error rates, are present. In closing, we argue for focusing model development efforts on developing calibrated models that predict outcomes well for all patient populations while emphasizing that such efforts are complementary to transparent reporting, participatory design, and reasoning about the impact of model-informed interventions in context.

  Access Paper or Ask Questions

Differential Privacy for Credit Risk Model

Jun 24, 2021
Tabish Maniar, Alekhya Akkinepally, Anantha Sharma

The use of machine learning algorithms to model user behavior and drive business decisions has become increasingly commonplace, specifically providing intelligent recommendations to automated decision making. This has led to an increase in the use of customers personal data to analyze customer behavior and predict their interests in a companys products. Increased use of this customer personal data can lead to better models but also to the potential of customer data being leaked, reverse engineered, and mishandled. In this paper, we assess differential privacy as a solution to address these privacy problems by building privacy protections into the data engineering and model training stages of predictive model development. Our interest is a pragmatic implementation in an operational environment, which necessitates a general purpose differentially private modeling framework, and we evaluate one such tool from LeapYear as applied to the Credit Risk modeling domain. Credit Risk Model is a major modeling methodology in banking and finance where user data is analyzed to determine the total Expected Loss to the bank. We examine the application of differential privacy on the credit risk model and evaluate the performance of a Differentially Private Model with a Non Differentially Private Model. Credit Risk Model is a major modeling methodology in banking and finance where users data is analyzed to determine the total Expected Loss to the bank. In this paper, we explore the application of differential privacy on the credit risk model and evaluate the performance of a Non Differentially Private Model with Differentially Private Model.

* 7 pages, 3 figures, 2 tables 

  Access Paper or Ask Questions

Enabling Efficiency-Precision Trade-offs for Label Trees in Extreme Classification

Jun 01, 2021
Tavor Z. Baharav, Daniel L. Jiang, Kedarnath Kolluri, Sujay Sanghavi, Inderjit S. Dhillon

Extreme multi-label classification (XMC) aims to learn a model that can tag data points with a subset of relevant labels from an extremely large label set. Real world e-commerce applications like personalized recommendations and product advertising can be formulated as XMC problems, where the objective is to predict for a user a small subset of items from a catalog of several million products. For such applications, a common approach is to organize these labels into a tree, enabling training and inference times that are logarithmic in the number of labels. While training a model once a label tree is available is well studied, designing the structure of the tree is a difficult task that is not yet well understood, and can dramatically impact both model latency and statistical performance. Existing approaches to tree construction fall at an extreme point, either optimizing exclusively for statistical performance, or for latency. We propose an efficient information theory inspired algorithm to construct intermediary operating points that trade off between the benefits of both. Our algorithm enables interpolation between these objectives, which was not previously possible. We corroborate our theoretical analysis with numerical results, showing that on the Wiki-500K benchmark dataset our method can reduce a proxy for expected latency by up to 28% while maintaining the same accuracy as Parabel. On several datasets derived from e-commerce customer logs, our modified label tree is able to improve this expected latency metric by up to 20% while maintaining the same accuracy. Finally, we discuss challenges in realizing these latency improvements in deployed models.

  Access Paper or Ask Questions

Kernel Dependence Regularizers and Gaussian Processes with Applications to Algorithmic Fairness

Nov 11, 2019
Zhu Li, Adrian Perez-Suay, Gustau Camps-Valls, Dino Sejdinovic

Current adoption of machine learning in industrial, societal and economical activities has raised concerns about the fairness, equity and ethics of automated decisions. Predictive models are often developed using biased datasets and thus retain or even exacerbate biases in their decisions and recommendations. Removing the sensitive covariates, such as gender or race, is insufficient to remedy this issue since the biases may be retained due to other related covariates. We present a regularization approach to this problem that trades off predictive accuracy of the learned models (with respect to biased labels) for the fairness in terms of statistical parity, i.e. independence of the decisions from the sensitive covariates. In particular, we consider a general framework of regularized empirical risk minimization over reproducing kernel Hilbert spaces and impose an additional regularizer of dependence between predictors and sensitive covariates using kernel-based measures of dependence, namely the Hilbert-Schmidt Independence Criterion (HSIC) and its normalized version. This approach leads to a closed-form solution in the case of squared loss, i.e. ridge regression. Moreover, we show that the dependence regularizer has an interpretation as modifying the corresponding Gaussian process (GP) prior. As a consequence, a GP model with a prior that encourages fairness to sensitive variables can be derived, allowing principled hyperparameter selection and studying of the relative relevance of covariates under fairness constraints. Experimental results in synthetic examples and in real problems of income and crime prediction illustrate the potential of the approach to improve fairness of automated decisions.

  Access Paper or Ask Questions

Graph Neural Networks for User Identity Linkage

Mar 06, 2019
Wen Zhang, Kai Shu, Huan Liu, Yalin Wang

The increasing popularity and diversity of social media sites has encouraged more and more people to participate in multiple online social networks to enjoy their services. Each user may create a user identity to represent his or her unique public figure in every social network. User identity linkage across online social networks is an emerging task and has attracted increasing attention, which could potentially impact various domains such as recommendations and link predictions. The majority of existing work focuses on mining network proximity or user profile data for discovering user identity linkages. With the recent advancements in graph neural networks (GNNs), it provides great potential to advance user identity linkage since users are connected in social graphs, and learning latent factors of users and items is the key. However, predicting user identity linkages based on GNNs faces challenges. For example, the user social graphs encode both \textit{local} structure such as users' neighborhood signals, and \textit{global} structure with community properties. To address these challenges simultaneously, in this paper, we present a novel graph neural network framework ({\m}) for user identity linkage. In particular, we provide a principled approach to jointly capture local and global information in the user-user social graph and propose the framework {\m}, which jointly learning user representations for user identity linkage. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed framework.

* 7 pages, 3 figures 

  Access Paper or Ask Questions

Localized Algorithm of Community Detection on Large-Scale Decentralized Social Networks

Dec 27, 2012
Pili Hu, Wing Cheong Lau

Despite the overwhelming success of the existing Social Networking Services (SNS), their centralized ownership and control have led to serious concerns in user privacy, censorship vulnerability and operational robustness of these services. To overcome these limitations, Distributed Social Networks (DSN) have recently been proposed and implemented. Under these new DSN architectures, no single party possesses the full knowledge of the entire social network. While this approach solves the above problems, the lack of global knowledge for the DSN nodes makes it much more challenging to support some common but critical SNS services like friends discovery and community detection. In this paper, we tackle the problem of community detection for a given user under the constraint of limited local topology information as imposed by common DSN architectures. By considering the Personalized Page Rank (PPR) approach as an ink spilling process, we justify its applicability for decentralized community detection using limited local topology information.Our proposed PPR-based solution has a wide range of applications such as friends recommendation, targeted advertisement, automated social relationship labeling and sybil defense. Using data collected from a large-scale SNS in practice, we demonstrate our adapted version of PPR can significantly outperform the basic PR as well as two other commonly used heuristics. The inclusion of a few manually labeled friends in the Escape Vector (EV) can boost the performance considerably (64.97% relative improvement in terms of Area Under the ROC Curve (AUC)).

  Access Paper or Ask Questions

Two heads are better than one: Enhancing medical representations by pre-training over structured and unstructured electronic health records

Jan 25, 2022
Sicen Liu, Xiaolong Wang, Yongshuai Hou, Ge Li, Hui Wang, Hui Xu, Yang Xiang, Buzhou Tang

The massive context of electronic health records (EHRs) has created enormous potentials for improving healthcare, among which structured (coded) data and unstructured (text) data are two important textual modalities. They do not exist in isolation and can complement each other in most real-life clinical scenarios. Most existing researches in medical informatics, however, either only focus on a particular modality or straightforwardly concatenate the information from different modalities, which ignore the interaction and information sharing between them. To address these issues, we proposed a unified deep learning-based medical pre-trained language model, named UMM-PLM, to automatically learn representative features from multimodal EHRs that consist of both structured data and unstructured data. Specifically, we first developed parallel unimodal information representation modules to capture the unimodal-specific characteristic, where unimodal representations were learned from each data source separately. A cross-modal module was further introduced to model the interactions between different modalities. We pre-trained the model on a large EHRs dataset containing both structured data and unstructured data and verified the effectiveness of the model on three downstream clinical tasks, i.e., medication recommendation, 30-day readmission and ICD coding through extensive experiments. The results demonstrate the power of UMM-PLM compared with benchmark methods and state-of-the-art baselines. Analyses show that UMM-PLM can effectively concern with multimodal textual information and has the potential to provide more comprehensive interpretations for clinical decision making.

* 31 pages, 5 figures 

  Access Paper or Ask Questions

Outlier Detection using AI: A Survey

Dec 01, 2021
Md Nazmul Kabir Sikder, Feras A. Batarseh

An outlier is an event or observation that is defined as an unusual activity, intrusion, or a suspicious data point that lies at an irregular distance from a population. The definition of an outlier event, however, is subjective and depends on the application and the domain (Energy, Health, Wireless Network, etc.). It is important to detect outlier events as carefully as possible to avoid infrastructure failures because anomalous events can cause minor to severe damage to infrastructure. For instance, an attack on a cyber-physical system such as a microgrid may initiate voltage or frequency instability, thereby damaging a smart inverter which involves very expensive repairing. Unusual activities in microgrids can be mechanical faults, behavior changes in the system, human or instrument errors or a malicious attack. Accordingly, and due to its variability, Outlier Detection (OD) is an ever-growing research field. In this chapter, we discuss the progress of OD methods using AI techniques. For that, the fundamental concepts of each OD model are introduced via multiple categories. Broad range of OD methods are categorized into six major categories: Statistical-based, Distance-based, Density-based, Clustering-based, Learning-based, and Ensemble methods. For every category, we discuss recent state-of-the-art approaches, their application areas, and performances. After that, a brief discussion regarding the advantages, disadvantages, and challenges of each technique is provided with recommendations on future research directions. This survey aims to guide the reader to better understand recent progress of OD methods for the assurance of AI.

* Chapter 7 in book: AI Assurance, by Elsevier Academic Press. Edited by: Feras A. Batarseh and Laura Freeman Publication year: 2022 

  Access Paper or Ask Questions

Measuring Wikipedia Article Quality in One Dimension by Extending ORES with Ordinal Regression

Aug 31, 2021
Nathan TeBlunthuis

Organizing complex peer production projects and advancing scientific knowledge of open collaboration each depend on the ability to measure quality. Article quality ratings on English language Wikipedia have been widely used by both Wikipedia community members and academic researchers for purposes like tracking knowledge gaps and studying how political polarization shapes collaboration. Even so, measuring quality presents many methodological challenges. The most widely used systems use labels on discrete ordinal scales when assessing quality, but such labels can be inconvenient for statistics and machine learning. Prior work handles this by assuming that different levels of quality are "evenly spaced" from one another. This assumption runs counter to intuitions about the relative degrees of effort needed to raise Wikipedia encyclopedia articles to different quality levels. Furthermore, models from prior work are fit to datasets that oversample high-quality articles. This limits their accuracy for representative samples of articles or revisions. I describe a technique extending the Wikimedia Foundations' ORES article quality model to address these limitations. My method uses weighted ordinal regression models to construct one-dimensional continuous measures of quality. While scores from my technique and from prior approaches are correlated, my approach improves accuracy for research datasets and provides evidence that the "evenly spaced" assumption is unfounded in practice on English Wikipedia. I conclude with recommendations for using quality scores in future research and include the full code, data, and models.

* 15 pages, 4 figures, Accepted to OpenSym 2021 

  Access Paper or Ask Questions