With the aim to provide teachers with more specific, frequent, and actionable feedback about their teaching, we explore how Large Language Models (LLMs) can be used to estimate ``Instructional Support'' domain scores of the CLassroom Assessment Scoring System (CLASS), a widely used observation protocol. We design a machine learning architecture that uses either zero-shot prompting of Meta's Llama2, and/or a classic Bag of Words (BoW) model, to classify individual utterances of teachers' speech (transcribed automatically using OpenAI's Whisper) for the presence of 11 behavioral indicators of Instructional Support. Then, these utterance-level judgments are aggregated over an entire 15-min observation session to estimate a global CLASS score. Experiments on two CLASS-coded datasets of toddler and pre-kindergarten classrooms indicate that (1) automatic CLASS Instructional Support estimation accuracy using the proposed method (Pearson $R$ up to $0.46$) approaches human inter-rater reliability (up to $R=0.55$); (2) LLMs yield slightly greater accuracy than BoW for this task; and (3) the best models often combined features extracted from both LLM and BoW. Finally, (4) we illustrate how the model's outputs can be visualized at the utterance level to provide teachers with explainable feedback on which utterances were most positively or negatively correlated with specific CLASS dimensions.
We consider a new kind of clustering problem in which clusters need not be independent of each other, but rather can have compositional relationships with other clusters (e.g., an image set consists of rectangles, circles, as well as combinations of rectangles and circles). This task is motivated by recent work in few-shot learning on compositional embedding models that structure the embedding space to distinguish the label sets, not just the individual labels, assigned to the examples. To tackle this clustering problem, we propose a new algorithm called Compositional Affinity Propagation (CAP). In contrast to standard Affinity Propagation as well as other algorithms for multi-view and hierarchical clustering, CAP can deduce compositionality among clusters automatically. We show promising results, compared to several existing clustering algorithms, on the MultiMNIST, OmniGlot, and LibriSpeech datasets. Our work has applications to multi-object image recognition and speaker diarization with simultaneous speech from multiple speakers.
We explore the utility of harnessing auxiliary labels (e.g., facial expression) to impose geometric structure when training embedding models for one-shot learning (e.g., for face verification). We introduce novel geometric constraints on the embedding space learned by a deep model using either manually annotated or automatically detected auxiliary labels. We contrast their performances (AUC) on four different face datasets(CK+, VGGFace-2, Tufts Face, and PubFig). Due to the additional structure encoded in the embedding space, our methods provide a higher verification accuracy (99.7, 86.2, 99.4, and 79.3% with our proposed TL+PDP+FBV loss, versus 97.5, 72.6, 93.1, and 70.5% using a standard Triplet Loss on the four datasets, respectively). Our method is implemented purely in terms of the loss function. It does not require any changes to the backbone of the embedding functions.
We propose a new method for speaker diarization that can handle overlapping speech with 2+ people. Our method is based on compositional embeddings [1]: Like standard speaker embedding methods such as x-vector [2], compositional embedding models contain a function f that separates speech from different speakers. In addition, they include a composition function g to compute set-union operations in the embedding space so as to infer the set of speakers within the input audio. In an experiment on multi-person speaker identification using synthesized LibriSpeech data, the proposed method outperforms traditional embedding methods that are only trained to separate single speakers (not speaker sets). In a speaker diarization experiment on the AMI Headset Mix corpus, we achieve state-of-the-art accuracy (DER=22.93%), slightly higher than the previous best result (23.82% from [3]).
In this work we present a multi-modal machine learning-based system, which we call ACORN, to analyze videos of school classrooms for the Positive Climate (PC) and Negative Climate (NC) dimensions of the CLASS observation protocol that is widely used in educational research. ACORN uses convolutional neural networks to analyze spectral audio features, the faces of teachers and students, and the pixels of each image frame, and then integrates this information over time using Temporal Convolutional Networks. The audiovisual ACORN's PC and NC predictions have Pearson correlations of $0.55$ and $0.63$ with ground-truth scores provided by expert CLASS coders on the UVA Toddler dataset (cross-validation on $n=300$ 15-min video segments), and a purely auditory ACORN predicts PC and NC with correlations of $0.36$ and $0.41$ on the MET dataset (test set of $n=2000$ videos segments). These numbers are similar to inter-coder reliability of human coders. Finally, using Graph Convolutional Networks we make early strides (AUC=$0.70$) toward predicting the specific moments (45-90sec clips) when the PC is particularly weak/strong. Our findings inform the design of automatic classroom observation and also more general video activity recognition and summary recognition systems.
In the context of building an intelligent tutoring system (ITS), which improves student learning outcomes by intervention, we set out to improve prediction of student problem outcome. In essence, we want to predict the outcome of a student answering a problem in an ITS from a video feed by analyzing their face and gestures. For this, we present a novel transfer learning facial affect representation and a user-personalized training scheme that unlocks the potential of this representation. We model the temporal structure of video sequences of students solving math problems using a recurrent neural network architecture. Additionally, we extend the largest dataset of student interactions with an intelligent online math tutor by a factor of two. Our final model, coined ATL-BP (Affect Transfer Learning for Behavior Prediction) achieves an increase in mean F-score over state-of-the-art of 45% on this new dataset in the general case and 50% in a more challenging leave-users-out experimental setting when we use a user-personalized training scheme.
We explore the idea of compositional set embeddings that can be used to infer not just a single class per input (e.g., image, video, audio signal), but a collection of classes, in the setting of one-shot learning. Class compositionality is useful in tasks such as multi-object detection in images and multi-speaker diarization in audio. Specifically, we devise and implement two novel models consisting of (1) an embedding function f trained jointly with a "composite" function g that computes set union operations between the classes encoded in two embedding vectors; and (2) embedding f trained jointly with a "query" function h that computes whether the classes encoded in one embedding subsume the classes encoded in another embedding. In contrast to previously developed methods, these models must both determine the classes associated with the input examples and encode the relationships between different class label sets. In experiments conducted on simulated data, OmniGlot, LibriSpeech and Open Images datasets, the proposed composite embedding models outperform baselines based on traditional embedding methods.
Automatic detectors of facial expression, gesture, affect, etc., can serve as scientific instruments to measure many behavioral and social phenomena (e.g., emotion, empathy, stress, engagement, etc.), and this has great potential to advance basic science. However, when a detector $d$ is trained to approximate an existing measurement tool (e.g., observation protocol, questionnaire), then care must be taken when interpreting measurements collected using $d$ since they are one step further removed from the underlying construct. We examine how the accuracy of $d$, as quantified by the correlation $q$ of $d$'s outputs with the ground-truth construct $U$, impacts the estimated correlation between $U$ (e.g., stress) and some other phenomenon $V$ (e.g., academic performance). In particular: (1) We show that if the true correlation between $U$ and $V$ is $r$, then the expected sample correlation, over all vectors $\mathcal{T}^n$ whose correlation with $U$ is $q$, is $qr$. (2) We derive a formula to compute the probability that the sample correlation (over $n$ subjects) using $d$ is positive, given that the true correlation between $U$ and $V$ is negative (and vice-versa). We show that this probability is non-negligible (around $10-15\%$) for values of $n$ and $q$ that have been used in recent affective computing studies. (3) With the goal to reduce the variance of correlations estimated by an automatic detector, we show empirically that training multiple neural networks $d^{(1)},\ldots,d^{(m)}$ using different training configurations (e.g., architectures, hyperparameters) for the same detection task provides only limited `coverage' of $\mathcal{T}^n$.
Recent work on privacy-preserving machine learning has considered how data-mining competitions such as Kaggle could potentially be "hacked", either intentionally or inadvertently, by using information from an oracle that reports a classifier's accuracy on the test set. For binary classification tasks in particular, one of the most common accuracy metrics is the Area Under the ROC Curve (AUC), and in this paper we explore the mathematical structure of how the AUC is computed from an n-vector of real-valued "guesses" with respect to the ground-truth labels. We show how knowledge of a classifier's AUC on the test set can constrain the set of possible ground-truth labelings, and we derive an algorithm both to compute the exact number of such labelings and to enumerate efficiently over them. Finally, we provide empirical evidence that, surprisingly, the number of compatible labelings can actually decrease as n grows, until a test set-dependent threshold is reached.
In the context of data-mining competitions (e.g., Kaggle, KDDCup, ILSVRC Challenge), we show how access to an oracle that reports a contestant's log-loss score on the test set can be exploited to deduce the ground-truth of some of the test examples. By applying this technique iteratively to batches of $m$ examples (for small $m$), all of the test labels can eventually be inferred. In this paper, (1) We demonstrate this attack on the first stage of a recent Kaggle competition (Intel & MobileODT Cancer Screening) and use it to achieve a log-loss of $0.00000$ (and thus attain a rank of #4 out of 848 contestants), without ever training a classifier to solve the actual task. (2) We prove an upper bound on the batch size $m$ as a function of the floating-point resolution of the probability estimates that the contestant submits for the labels. (3) We derive, and demonstrate in simulation, a more flexible attack that can be used even when the oracle reports the accuracy on an unknown (but fixed) subset of the test set's labels. These results underline the importance of evaluating contestants based only on test data that the oracle does not examine.