We study a general clustering setting in which we have $n$ elements to be clustered, and we aim to perform as few queries as possible to an oracle that returns a noisy sample of the similarity between two elements. Our setting encompasses many application domains in which the similarity function is costly to compute and inherently noisy. We propose two novel formulations of online learning problems rooted in the paradigm of Pure Exploration in Combinatorial Multi-Armed Bandits (PE-CMAB): fixed confidence and fixed budget settings. For both settings, we design algorithms that combine a sampling strategy with a classic approximation algorithm for correlation clustering and study their theoretical guarantees. Our results are the first examples of polynomial-time algorithms that work for the case of PE-CMAB in which the underlying offline optimization problem is NP-hard.
Algorithmic recourse -- providing recommendations to those affected negatively by the outcome of an algorithmic system on how they can take action and change that outcome -- has gained attention as a means of giving persons agency in their interactions with artificial intelligence (AI) systems. Recent work has shown that even if an AI decision-making classifier is ``fair'' (according to some reasonable criteria), recourse itself may be unfair due to differences in the initial circumstances of individuals, compounding disparities for marginalized populations and requiring them to exert more effort than others. There is a need to define more methods and metrics for evaluating fairness in recourse that span a range of normative views of the world, and specifically those that take into account time. Time is a critical element in recourse because the longer it takes an individual to act, the more the setting may change due to model or data drift. This paper seeks to close this research gap by proposing two notions of fairness in recourse that are in normative alignment with substantive equality of opportunity, and that consider time. The first considers the (often repeated) effort individuals exert per successful recourse event, and the second considers time per successful recourse event. Building upon an agent-based framework for simulating recourse, this paper demonstrates how much effort is needed to overcome disparities in initial circumstances. We then proposes an intervention to improve the fairness of recourse by rewarding effort, and compare it to existing strategies.
Despite the large research effort devoted to learning dependencies between time series, the state of the art still faces a major limitation: existing methods learn partial correlations but fail to discriminate across distinct frequency bands. Motivated by many applications in which this differentiation is pivotal, we overcome this limitation by learning a block-sparse, frequency-dependent, partial correlation graph, in which layers correspond to different frequency bands, and partial correlations can occur over just a few layers. To this aim, we formulate and solve two nonconvex learning problems: the first has a closed-form solution and is suitable when there is prior knowledge about the number of partial correlations; the second hinges on an iterative solution based on successive convex approximation, and is effective for the general case where no prior knowledge is available. Numerical results on synthetic data show that the proposed methods outperform the current state of the art. Finally, the analysis of financial time series confirms that partial correlations exist only within a few frequency bands, underscoring how our methods enable the gaining of valuable insights that would be undetected without discriminating along the frequency domain.
The bulk of the research effort on brain connectivity revolves around statistical associations among brain regions, which do not directly relate to the causal mechanisms governing brain dynamics. Here we propose the multiscale causal backbone (MCB) of brain dynamics shared by a set of individuals across multiple temporal scales, and devise a principled methodology to extract it. Our approach leverages recent advances in multiscale causal structure learning and optimizes the trade-off between the model fitting and its complexity. Empirical assessment on synthetic data shows the superiority of our methodology over a baseline based on canonical functional connectivity networks. When applied to resting-state fMRI data, we find sparse MCBs for both the left and right brain hemispheres. Thanks to its multiscale nature, our approach shows that at low-frequency bands, causal dynamics are driven by brain regions associated with high-level cognitive functions; at higher frequencies instead, nodes related to sensory processing play a crucial role. Finally, our analysis of individual multiscale causal structures confirms the existence of a causal fingerprint of brain connectivity, thus supporting from a causal perspective the existing extensive research in brain connectivity fingerprinting.
Algorithmic systems are often called upon to assist in high-stakes decision making. In light of this, algorithmic recourse, the principle wherein individuals should be able to take action against an undesirable outcome made by an algorithmic system, is receiving growing attention. The bulk of the literature on algorithmic recourse to-date focuses primarily on how to provide recourse to a single individual, overlooking a critical element: the effects of a continuously changing context. Disregarding these effects on recourse is a significant oversight, since, in almost all cases, recourse consists of an individual making a first, unfavorable attempt, and then being given an opportunity to make one or several attempts at a later date - when the context might have changed. This can create false expectations, as initial recourse recommendations may become less reliable over time due to model drift and competition for access to the favorable outcome between individuals. In this work we propose an agent-based simulation framework for studying the effects of a continuously changing environment on algorithmic recourse. In particular, we identify two main effects that can alter the reliability of recourse for individuals represented by the agents: (1) competition with other agents acting upon recourse, and (2) competition with new agents entering the environment. Our findings highlight that only a small set of specific parameterizations result in algorithmic recourse that is reliable for agents over time. Consequently, we argue that substantial additional work is needed to understand recourse reliability over time, and to develop recourse methods that reward agents' effort.
Social media have great potential for enabling public discourse on important societal issues. However, adverse effects, such as polarization and echo chambers, greatly impact the benefits of social media and call for algorithms that mitigate these effects. In this paper, we propose a novel problem formulation aimed at slightly nudging users' social feeds in order to strike a balance between relevance and diversity, thus mitigating the emergence of polarization, without lowering the quality of the feed. Our approach is based on re-weighting the relative importance of the accounts that a user follows, so as to calibrate the frequency with which the content produced by various accounts is shown to the user. We analyze the convexity properties of the problem, demonstrating the non-matrix convexity of the objective function and the convexity of the feasible set. To efficiently address the problem, we develop a scalable algorithm based on projected gradient descent. We also prove that our problem statement is a proper generalization of the undirected-case problem so that our method can also be adopted for undirected social networks. As a baseline for comparison in the undirected case, we develop a semidefinite programming approach, which provides the optimal solution. Through extensive experiments on synthetic and real-world datasets, we validate the effectiveness of our approach, which outperforms non-trivial baselines, underscoring its ability to foster healthier and more cohesive online communities.
Counterfactual examples have emerged as an effective approach to produce simple and understandable post-hoc explanations. In the context of graph classification, previous work has focused on generating counterfactual explanations by manipulating the most elementary units of a graph, i.e., removing an existing edge, or adding a non-existing one. In this paper, we claim that such language of explanation might be too fine-grained, and turn our attention to some of the main characterizing features of real-world complex networks, such as the tendency to close triangles, the existence of recurring motifs, and the organization into dense modules. We thus define a general density-based counterfactual search framework to generate instance-level counterfactual explanations for graph classifiers, which can be instantiated with different notions of dense substructures. In particular, we show two specific instantiations of this general framework: a method that searches for counterfactual graphs by opening or closing triangles, and a method driven by maximal cliques. We also discuss how the general method can be instantiated to exploit any other notion of dense substructures, including, for instance, a given taxonomy of nodes. We evaluate the effectiveness of our approaches in 7 brain network datasets and compare the counterfactual statements generated according to several widely-used metrics. Results confirm that adopting a semantic-relevant unit of change like density is essential to define versatile and interpretable counterfactual explanation methods.
We present a new effective and scalable framework for training GNNs in supervised node classification tasks, given graph-structured data. Our approach increasingly refines the weight update operations on a sequence of path graphs obtained by linearizing random spanning trees extracted from the input network. The path graphs are designed to retain essential topological and node information of the original graph. At the same time, the sparsity of path graphs enables a much lighter GNN training which, besides scalability, helps in mitigating classical training issues, like over-squashing and over-smoothing. We carry out an extensive experimental investigation on a number of real-world graph benchmarks, where we apply our framework to graph convolutional networks, showing simultaneous improvement of both training speed and test accuracy, as compared to well-known baselines.
The Densest Subgraph Problem requires to find, in a given graph, a subset of vertices whose induced subgraph maximizes a measure of density. The problem has received a great deal of attention in the algorithmic literature over the last five decades, with many variants proposed and many applications built on top of this basic definition. Recent years have witnessed a revival of research interest on this problem with several interesting contributions, including some groundbreaking results, published in 2022 and 2023. This survey provides a deep overview of the fundamental results and an exhaustive coverage of the many variants proposed in the literature, with a special attention on the most recent results. The survey also presents a comprehensive overview of applications and discusses some interesting open problems for this evergreen research topic.
Submodular function maximization is central in numerous data science applications, including data summarization, influence maximization, and recommendation. In many of these problems, our goal is to find a solution that maximizes the \emph{average} of the utilities for all users, each measured by a monotone submodular function. When the population of users is composed of several demographic groups, another critical problem is whether the utility is fairly distributed across groups. In the context of submodular optimization, we seek to improve the welfare of the \emph{least well-off} group, i.e., to maximize the minimum utility for any group, to ensure fairness. Although the \emph{utility} and \emph{fairness} objectives are both desirable, they might contradict each other, and, to our knowledge, little attention has been paid to optimizing them jointly. In this paper, we propose a novel problem called \emph{Bicriteria Submodular Maximization} (BSM) to strike a balance between utility and fairness. Specifically, it requires finding a fixed-size solution to maximize the utility function, subject to the value of the fairness function not being below a threshold. Since BSM is inapproximable within any constant factor in general, we propose efficient data-dependent approximation algorithms for BSM by converting it into other submodular optimization problems and utilizing existing algorithms for the converted problems to obtain solutions to BSM. Using real-world and synthetic datasets, we showcase applications of our framework in three submodular maximization problems, namely maximum coverage, influence maximization, and facility location.