Alert button
Picture for Chris Welty

Chris Welty

Alert button

Metrology for AI: From Benchmarks to Instruments

Nov 05, 2019
Chris Welty, Praveen Paritosh, Lora Aroyo

Figure 1 for Metrology for AI: From Benchmarks to Instruments
Figure 2 for Metrology for AI: From Benchmarks to Instruments
Figure 3 for Metrology for AI: From Benchmarks to Instruments
Figure 4 for Metrology for AI: From Benchmarks to Instruments

In this paper we present the first steps towards hardening the science of measuring AI systems, by adopting metrology, the science of measurement and its application, and applying it to human (crowd) powered evaluations. We begin with the intuitive observation that evaluating the performance of an AI system is a form of measurement. In all other science and engineering disciplines, the devices used to measure are called instruments, and all measurements are recorded with respect to the characteristics of the instruments used. One does not report mass, speed, or length, for example, of a studied object without disclosing the precision (measurement variance) and resolution (smallest detectable change) of the instrument used. It is extremely common in the AI literature to compare the performance of two systems by using a crowd-sourced dataset as an instrument, but failing to report if the performance difference lies within the capability of that instrument to measure. To illustrate the adoption of metrology to benchmark datasets we use the word similarity benchmark WS353 and several previously published experiments that use it for evaluation.

Viaarxiv icon

What is Fair? Exploring Pareto-Efficiency for Fairness Constrained Classifiers

Oct 30, 2019
Ananth Balashankar, Alyssa Lees, Chris Welty, Lakshminarayanan Subramanian

Figure 1 for What is Fair? Exploring Pareto-Efficiency for Fairness Constrained Classifiers
Figure 2 for What is Fair? Exploring Pareto-Efficiency for Fairness Constrained Classifiers
Figure 3 for What is Fair? Exploring Pareto-Efficiency for Fairness Constrained Classifiers
Figure 4 for What is Fair? Exploring Pareto-Efficiency for Fairness Constrained Classifiers

The potential for learned models to amplify existing societal biases has been broadly recognized. Fairness-aware classifier constraints, which apply equality metrics of performance across subgroups defined on sensitive attributes such as race and gender, seek to rectify inequity but can yield non-uniform degradation in performance for skewed datasets. In certain domains, imbalanced degradation of performance can yield another form of unintentional bias. In the spirit of constructing fairness-aware algorithms as societal imperative, we explore an alternative: Pareto-Efficient Fairness (PEF). Theoretically, we prove that PEF identifies the operating point on the Pareto curve of subgroup performances closest to the fairness hyperplane, maximizing multiple subgroup accuracy. Empirically we demonstrate that PEF outperforms by achieving Pareto levels in accuracy for all subgroups compared to strict fairness constraints in several UCI datasets.

Viaarxiv icon

A Crowdsourced Frame Disambiguation Corpus with Ambiguity

Apr 12, 2019
Anca Dumitrache, Lora Aroyo, Chris Welty

Figure 1 for A Crowdsourced Frame Disambiguation Corpus with Ambiguity
Figure 2 for A Crowdsourced Frame Disambiguation Corpus with Ambiguity
Figure 3 for A Crowdsourced Frame Disambiguation Corpus with Ambiguity
Figure 4 for A Crowdsourced Frame Disambiguation Corpus with Ambiguity

We present a resource for the task of FrameNet semantic frame disambiguation of over 5,000 word-sentence pairs from the Wikipedia corpus. The annotations were collected using a novel crowdsourcing approach with multiple workers per sentence to capture inter-annotator disagreement. In contrast to the typical approach of attributing the best single frame to each word, we provide a list of frames with disagreement-based scores that express the confidence with which each frame applies to the word. This is based on the idea that inter-annotator disagreement is at least partly caused by ambiguity that is inherent to the text and frames. We have found many examples where the semantics of individual frames overlap sufficiently to make them acceptable alternatives for interpreting a sentence. We have argued that ignoring this ambiguity creates an overly arbitrary target for training and evaluating natural language processing systems - if humans cannot agree, why would we expect the correct answer from a machine to be any different? To process this data we also utilized an expanded lemma-set provided by the Framester system, which merges FN with WordNet to enhance coverage. Our dataset includes annotations of 1,000 sentence-word pairs whose lemmas are not part of FN. Finally we present metrics for evaluating frame disambiguation systems that account for ambiguity.

* Accepted to NAACL-HLT2019 
Viaarxiv icon

Crowdsourcing Semantic Label Propagation in Relation Classification

Sep 03, 2018
Anca Dumitrache, Lora Aroyo, Chris Welty

Figure 1 for Crowdsourcing Semantic Label Propagation in Relation Classification
Figure 2 for Crowdsourcing Semantic Label Propagation in Relation Classification
Figure 3 for Crowdsourcing Semantic Label Propagation in Relation Classification
Figure 4 for Crowdsourcing Semantic Label Propagation in Relation Classification

Distant supervision is a popular method for performing relation extraction from text that is known to produce noisy labels. Most progress in relation extraction and classification has been made with crowdsourced corrections to distant-supervised labels, and there is evidence that indicates still more would be better. In this paper, we explore the problem of propagating human annotation signals gathered for open-domain relation classification through the CrowdTruth methodology for crowdsourcing, that captures ambiguity in annotations by measuring inter-annotator disagreement. Our approach propagates annotations to sentences that are similar in a low dimensional embedding space, expanding the number of labels by two orders of magnitude. Our experiments show significant improvement in a sentence-level multi-class relation classifier.

* In publication at the First Workshop on Fact Extraction and Verification (FeVer) at EMNLP 2018 
Viaarxiv icon

Capturing Ambiguity in Crowdsourcing Frame Disambiguation

May 01, 2018
Anca Dumitrache, Lora Aroyo, Chris Welty

Figure 1 for Capturing Ambiguity in Crowdsourcing Frame Disambiguation
Figure 2 for Capturing Ambiguity in Crowdsourcing Frame Disambiguation
Figure 3 for Capturing Ambiguity in Crowdsourcing Frame Disambiguation
Figure 4 for Capturing Ambiguity in Crowdsourcing Frame Disambiguation

FrameNet is a computational linguistics resource composed of semantic frames, high-level concepts that represent the meanings of words. In this paper, we present an approach to gather frame disambiguation annotations in sentences using a crowdsourcing approach with multiple workers per sentence to capture inter-annotator disagreement. We perform an experiment over a set of 433 sentences annotated with frames from the FrameNet corpus, and show that the aggregated crowd annotations achieve an F1 score greater than 0.67 as compared to expert linguists. We highlight cases where the crowd annotation was correct even though the expert is in disagreement, arguing for the need to have multiple annotators per sentence. Most importantly, we examine cases in which crowd workers could not agree, and demonstrate that these cases exhibit ambiguity, either in the sentence, frame, or the task itself, and argue that collapsing such cases to a single, discrete truth value (i.e. correct or incorrect) is inappropriate, creating arbitrary targets for machine learning.

* in publication at the sixth AAAI Conference on Human Computation and Crowdsourcing (HCOMP) 2018 
Viaarxiv icon

False Positive and Cross-relation Signals in Distant Supervision Data

Nov 29, 2017
Anca Dumitrache, Lora Aroyo, Chris Welty

Figure 1 for False Positive and Cross-relation Signals in Distant Supervision Data
Figure 2 for False Positive and Cross-relation Signals in Distant Supervision Data
Figure 3 for False Positive and Cross-relation Signals in Distant Supervision Data
Figure 4 for False Positive and Cross-relation Signals in Distant Supervision Data

Distant supervision (DS) is a well-established method for relation extraction from text, based on the assumption that when a knowledge-base contains a relation between a term pair, then sentences that contain that pair are likely to express the relation. In this paper, we use the results of a crowdsourcing relation extraction task to identify two problems with DS data quality: the widely varying degree of false positives across different relations, and the observed causal connection between relations that are not considered by the DS method. The crowdsourcing data aggregation is performed using ambiguity-aware CrowdTruth metrics, that are used to capture and interpret inter-annotator disagreement. We also present preliminary results of using the crowd to enhance DS training data for a relation classification model, without requiring the crowd to annotate the entire set.

* in proceedings of the 6th Workshop on Automated Knowledge Base Construction (AKBC) at NIPS 2017 
Viaarxiv icon

Crowdsourcing Ground Truth for Medical Relation Extraction

Oct 03, 2017
Anca Dumitrache, Lora Aroyo, Chris Welty

Figure 1 for Crowdsourcing Ground Truth for Medical Relation Extraction
Figure 2 for Crowdsourcing Ground Truth for Medical Relation Extraction
Figure 3 for Crowdsourcing Ground Truth for Medical Relation Extraction
Figure 4 for Crowdsourcing Ground Truth for Medical Relation Extraction

Cognitive computing systems require human labeled data for evaluation, and often for training. The standard practice used in gathering this data minimizes disagreement between annotators, and we have found this results in data that fails to account for the ambiguity inherent in language. We have proposed the CrowdTruth method for collecting ground truth through crowdsourcing, that reconsiders the role of people in machine learning based on the observation that disagreement between annotators provides a useful signal for phenomena such as ambiguity in the text. We report on using this method to build an annotated data set for medical relation extraction for the $cause$ and $treat$ relations, and how this data performed in a supervised training experiment. We demonstrate that by modeling ambiguity, labeled data gathered from crowd workers can (1) reach the level of quality of domain experts for this task while reducing the cost, and (2) provide better training data at scale than distant supervision. We further propose and validate new weighted measures for precision, recall, and F-measure, that account for ambiguity in both human and machine performance on this task.

* ACM Transactions on Interactive Intelligent Systems (TIIS) Volume 8 Issue 2, July 2018  
* Accepted for publication in ACM Transactions on Interactive Intelligent Systems (TiiS) Special Issue on Human-Centered Machine Learning 
Viaarxiv icon