Alert button
Picture for Aparna Balagopalan

Aparna Balagopalan

Alert button

The Role of Relevance in Fair Ranking

May 09, 2023
Aparna Balagopalan, Abigail Z. Jacobs, Asia Biega

Figure 1 for The Role of Relevance in Fair Ranking
Figure 2 for The Role of Relevance in Fair Ranking
Figure 3 for The Role of Relevance in Fair Ranking
Figure 4 for The Role of Relevance in Fair Ranking

Online platforms mediate access to opportunity: relevance-based rankings create and constrain options by allocating exposure to job openings and job candidates in hiring platforms, or sellers in a marketplace. In order to do so responsibly, these socially consequential systems employ various fairness measures and interventions, many of which seek to allocate exposure based on worthiness. Because these constructs are typically not directly observable, platforms must instead resort to using proxy scores such as relevance and infer them from behavioral signals such as searcher clicks. Yet, it remains an open question whether relevance fulfills its role as such a worthiness score in high-stakes fair rankings. In this paper, we combine perspectives and tools from the social sciences, information retrieval, and fairness in machine learning to derive a set of desired criteria that relevance scores should satisfy in order to meaningfully guide fairness interventions. We then empirically show that not all of these criteria are met in a case study of relevance inferred from biased user click data. We assess the impact of these violations on the estimated system fairness and analyze whether existing fairness interventions may mitigate the identified issues. Our analyses and results surface the pressing need for new approaches to relevance collection and generation that are suitable for use in fair ranking.

* Published in SIGIR 2023 
Viaarxiv icon

The Road to Explainability is Paved with Bias: Measuring the Fairness of Explanations

May 06, 2022
Aparna Balagopalan, Haoran Zhang, Kimia Hamidieh, Thomas Hartvigsen, Frank Rudzicz, Marzyeh Ghassemi

Figure 1 for The Road to Explainability is Paved with Bias: Measuring the Fairness of Explanations
Figure 2 for The Road to Explainability is Paved with Bias: Measuring the Fairness of Explanations
Figure 3 for The Road to Explainability is Paved with Bias: Measuring the Fairness of Explanations
Figure 4 for The Road to Explainability is Paved with Bias: Measuring the Fairness of Explanations

Machine learning models in safety-critical settings like healthcare are often blackboxes: they contain a large number of parameters which are not transparent to users. Post-hoc explainability methods where a simple, human-interpretable model imitates the behavior of these blackbox models are often proposed to help users trust model predictions. In this work, we audit the quality of such explanations for different protected subgroups using real data from four settings in finance, healthcare, college admissions, and the US justice system. Across two different blackbox model architectures and four popular explainability methods, we find that the approximation quality of explanation models, also known as the fidelity, differs significantly between subgroups. We also demonstrate that pairing explainability methods with recent advances in robust machine learning can improve explanation fairness in some settings. However, we highlight the importance of communicating details of non-zero fidelity gaps to users, since a single solution might not exist across all settings. Finally, we discuss the implications of unfair explanation models as a challenging and understudied problem facing the machine learning community.

* Published in FAccT 2022 
Viaarxiv icon

Quantifying the Task-Specific Information in Text-Based Classifications

Oct 17, 2021
Zining Zhu, Aparna Balagopalan, Marzyeh Ghassemi, Frank Rudzicz

Figure 1 for Quantifying the Task-Specific Information in Text-Based Classifications
Figure 2 for Quantifying the Task-Specific Information in Text-Based Classifications
Figure 3 for Quantifying the Task-Specific Information in Text-Based Classifications
Figure 4 for Quantifying the Task-Specific Information in Text-Based Classifications

Recently, neural natural language models have attained state-of-the-art performance on a wide variety of tasks, but the high performance can result from superficial, surface-level cues (Bender and Koller, 2020; Niven and Kao, 2020). These surface cues, as the ``shortcuts'' inherent in the datasets, do not contribute to the *task-specific information* (TSI) of the classification tasks. While it is essential to look at the model performance, it is also important to understand the datasets. In this paper, we consider this question: Apart from the information introduced by the shortcut features, how much task-specific information is required to classify a dataset? We formulate this quantity in an information-theoretic framework. While this quantity is hard to compute, we approximate it with a fast and stable method. TSI quantifies the amount of linguistic knowledge modulo a set of predefined shortcuts -- that contributes to classifying a sample from each dataset. This framework allows us to compare across datasets, saying that, apart from a set of ``shortcut features'', classifying each sample in the Multi-NLI task involves around 0.4 nats more TSI than in the Quora Question Pair.

Viaarxiv icon

Comparing Acoustic-based Approaches for Alzheimer's Disease Detection

Jun 03, 2021
Aparna Balagopalan, Jekaterina Novikova

Figure 1 for Comparing Acoustic-based Approaches for Alzheimer's Disease Detection
Figure 2 for Comparing Acoustic-based Approaches for Alzheimer's Disease Detection
Figure 3 for Comparing Acoustic-based Approaches for Alzheimer's Disease Detection
Figure 4 for Comparing Acoustic-based Approaches for Alzheimer's Disease Detection

In this paper, we study the performance and generalizability of three approaches for AD detection from speech on the recent ADReSSo challenge dataset: 1) using conventional acoustic features 2) using novel pre-trained acoustic embeddings 3) combining acoustic features and embeddings. We find that while feature-based approaches have a higher precision, classification approaches relying on the combination of embeddings and features prove to have a higher, and more balanced performance across multiple metrics of performance. Our best model, using such a combined approach, outperforms the acoustic baseline in the challenge by 2.8\%.

* Accepted to INTERSPEECH 2021 
Viaarxiv icon

Augmenting BERT Carefully with Underrepresented Linguistic Features

Nov 12, 2020
Aparna Balagopalan, Jekaterina Novikova

Figure 1 for Augmenting BERT Carefully with Underrepresented Linguistic Features
Figure 2 for Augmenting BERT Carefully with Underrepresented Linguistic Features

Fine-tuned Bidirectional Encoder Representations from Transformers (BERT)-based sequence classification models have proven to be effective for detecting Alzheimer's Disease (AD) from transcripts of human speech. However, previous research shows it is possible to improve BERT's performance on various tasks by augmenting the model with additional information. In this work, we use probing tasks as introspection techniques to identify linguistic information not well-represented in various layers of BERT, but important for the AD detection task. We supplement these linguistic features in which representations from BERT are found to be insufficient with hand-crafted features externally, and show that jointly fine-tuning BERT in combination with these features improves the performance of AD classification by upto 5\% over fine-tuned BERT alone.

* Machine Learning for Health (ML4H) at NeurIPS 2020 - Extended Abstract 
Viaarxiv icon

Fantastic Features and Where to Find Them: Detecting Cognitive Impairment with a Subsequence Classification Guided Approach

Oct 13, 2020
Benjamin Eyre, Aparna Balagopalan, Jekaterina Novikova

Figure 1 for Fantastic Features and Where to Find Them: Detecting Cognitive Impairment with a Subsequence Classification Guided Approach
Figure 2 for Fantastic Features and Where to Find Them: Detecting Cognitive Impairment with a Subsequence Classification Guided Approach
Figure 3 for Fantastic Features and Where to Find Them: Detecting Cognitive Impairment with a Subsequence Classification Guided Approach
Figure 4 for Fantastic Features and Where to Find Them: Detecting Cognitive Impairment with a Subsequence Classification Guided Approach

Despite the widely reported success of embedding-based machine learning methods on natural language processing tasks, the use of more easily interpreted engineered features remains common in fields such as cognitive impairment (CI) detection. Manually engineering features from noisy text is time and resource consuming, and can potentially result in features that do not enhance model performance. To combat this, we describe a new approach to feature engineering that leverages sequential machine learning models and domain knowledge to predict which features help enhance performance. We provide a concrete example of this method on a standard data set of CI speech and demonstrate that CI classification accuracy improves by 2.3% over a strong baseline when using features produced by this method. This demonstration provides an ex-ample of how this method can be used to assist classification in fields where interpretability is important, such as health care.

* EMNLP Workshop on Noisy User-generated Text (W-NUT 2020) 
Viaarxiv icon

To BERT or Not To BERT: Comparing Speech and Language-based Approaches for Alzheimer's Disease Detection

Jul 26, 2020
Aparna Balagopalan, Benjamin Eyre, Frank Rudzicz, Jekaterina Novikova

Figure 1 for To BERT or Not To BERT: Comparing Speech and Language-based Approaches for Alzheimer's Disease Detection
Figure 2 for To BERT or Not To BERT: Comparing Speech and Language-based Approaches for Alzheimer's Disease Detection
Figure 3 for To BERT or Not To BERT: Comparing Speech and Language-based Approaches for Alzheimer's Disease Detection
Figure 4 for To BERT or Not To BERT: Comparing Speech and Language-based Approaches for Alzheimer's Disease Detection

Research related to automatically detecting Alzheimer's disease (AD) is important, given the high prevalence of AD and the high cost of traditional methods. Since AD significantly affects the content and acoustics of spontaneous speech, natural language processing and machine learning provide promising techniques for reliably detecting AD. We compare and contrast the performance of two such approaches for AD detection on the recent ADReSS challenge dataset: 1) using domain knowledge-based hand-crafted features that capture linguistic and acoustic phenomena, and 2) fine-tuning Bidirectional Encoder Representations from Transformer (BERT)-based sequence classification models. We also compare multiple feature-based regression models for a neuropsychological score task in the challenge. We observe that fine-tuned BERT models, given the relative importance of linguistics in cognitive impairment detection, outperform feature-based approaches on the AD detection task.

* accepted to INTERSPEECH 2020 
Viaarxiv icon

Cross-Language Aphasia Detection using Optimal Transport Domain Adaptation

Dec 04, 2019
Aparna Balagopalan, Jekaterina Novikova, Matthew B. A. McDermott, Bret Nestor, Tristan Naumann, Marzyeh Ghassemi

Figure 1 for Cross-Language Aphasia Detection using Optimal Transport Domain Adaptation
Figure 2 for Cross-Language Aphasia Detection using Optimal Transport Domain Adaptation
Figure 3 for Cross-Language Aphasia Detection using Optimal Transport Domain Adaptation
Figure 4 for Cross-Language Aphasia Detection using Optimal Transport Domain Adaptation

Multi-language speech datasets are scarce and often have small sample sizes in the medical domain. Robust transfer of linguistic features across languages could improve rates of early diagnosis and therapy for speakers of low-resource languages when detecting health conditions from speech. We utilize out-of-domain, unpaired, single-speaker, healthy speech data for training multiple Optimal Transport (OT) domain adaptation systems. We learn mappings from other languages to English and detect aphasia from linguistic characteristics of speech, and show that OT domain adaptation improves aphasia detection over unilingual baselines for French (6% increased F1) and Mandarin (5% increased F1). Further, we show that adding aphasic data to the domain adaptation system significantly increases performance for both French and Mandarin, increasing the F1 scores further (10% and 8% increase in F1 scores for French and Mandarin, respectively, over unilingual baselines).

* Accepted to ML4H at NeurIPS 2019 
Viaarxiv icon

Lexical Features Are More Vulnerable, Syntactic Features Have More Predictive Power

Sep 30, 2019
Jekaterina Novikova, Aparna Balagopalan, Ksenia Shkaruta, Frank Rudzicz

Figure 1 for Lexical Features Are More Vulnerable, Syntactic Features Have More Predictive Power
Figure 2 for Lexical Features Are More Vulnerable, Syntactic Features Have More Predictive Power
Figure 3 for Lexical Features Are More Vulnerable, Syntactic Features Have More Predictive Power
Figure 4 for Lexical Features Are More Vulnerable, Syntactic Features Have More Predictive Power

Understanding the vulnerability of linguistic features extracted from noisy text is important for both developing better health text classification models and for interpreting vulnerabilities of natural language models. In this paper, we investigate how generic language characteristics, such as syntax or the lexicon, are impacted by artificial text alterations. The vulnerability of features is analysed from two perspectives: (1) the level of feature value change, and (2) the level of change of feature predictive power as a result of text modifications. We show that lexical features are more sensitive to text modifications than syntactic ones. However, we also demonstrate that these smaller changes of syntactic features have a stronger influence on classification performance downstream, compared to the impact of changes to lexical features. Results are validated across three datasets representing different text-classification tasks, with different levels of lexical and syntactic complexity of both conversational and written language.

* EMNLP Workshop on Noisy User-generated Text (W-NUT 2019) 
Viaarxiv icon

Impact of ASR on Alzheimer's Disease Detection: All Errors are Equal, but Deletions are More Equal than Others

Apr 08, 2019
Aparna Balagopalan, Ksenia Shkaruta, Jekaterina Novikova

Figure 1 for Impact of ASR on Alzheimer's Disease Detection: All Errors are Equal, but Deletions are More Equal than Others
Figure 2 for Impact of ASR on Alzheimer's Disease Detection: All Errors are Equal, but Deletions are More Equal than Others
Figure 3 for Impact of ASR on Alzheimer's Disease Detection: All Errors are Equal, but Deletions are More Equal than Others
Figure 4 for Impact of ASR on Alzheimer's Disease Detection: All Errors are Equal, but Deletions are More Equal than Others

Automatic Speech Recognition (ASR) is a critical component of any fully-automated speech-based Alzheimer's disease (AD) detection model. However, despite years of speech recognition research, little is known about the impact of ASR performance on AD detection. In this paper, we experiment with controlled amounts of artificially generated ASR errors and investigate their influence on AD detection. We find that deletion errors affect AD detection performance the most, due to their impact on the features of syntactic complexity and discourse representation in speech. We show the trend to be generalisable across two different datasets and two different speech-related tasks. As a conclusion, we propose changing the ASR optimization functions to reflect a higher penalty for deletion errors when using ASR for AD detection.

* Submitted to INTERSPEECH 
Viaarxiv icon