In this paper, we investigate the problem of deciding whether two standard normal random vectors $\mathsf{X}\in\mathbb{R}^{n}$ and $\mathsf{Y}\in\mathbb{R}^{n}$ are correlated or not. This is formulated as a hypothesis testing problem, where under the null hypothesis, these vectors are statistically independent, while under the alternative, $\mathsf{X}$ and a randomly and uniformly permuted version of $\mathsf{Y}$, are correlated with correlation $\rho$. We analyze the thresholds at which optimal testing is information-theoretically impossible and possible, as a function of $n$ and $\rho$. To derive our information-theoretic lower bounds, we develop a novel technique for evaluating the second moment of the likelihood ratio using an orthogonal polynomials expansion, which among other things, reveals a surprising connection to integer partition functions. We also study a multi-dimensional generalization of the above setting, where rather than two vectors we observe two databases/matrices, and furthermore allow for partial correlations between these two.
In this paper, we investigate the problem of deciding whether two random databases $\mathsf{X}\in\mathcal{X}^{n\times d}$ and $\mathsf{Y}\in\mathcal{Y}^{n\times d}$ are statistically dependent or not. This is formulated as a hypothesis testing problem, where under the null hypothesis, these two databases are statistically independent, while under the alternative, there exists an unknown row permutation $\sigma$, such that $\mathsf{X}$ and $\mathsf{Y}^\sigma$, a permuted version of $\mathsf{Y}$, are statistically dependent with some known joint distribution, but have the same marginal distributions as the null. We characterize the thresholds at which optimal testing is information-theoretically impossible and possible, as a function of $n$, $d$, and some spectral properties of the generative distributions of the datasets. For example, we prove that if a certain function of the eigenvalues of the likelihood function and $d$, is below a certain threshold, as $d\to\infty$, then weak detection (performing slightly better than random guessing) is statistically impossible, no matter what the value of $n$ is. This mimics the performance of an efficient test that thresholds a centered version of the log-likelihood function of the observed matrices. We also analyze the case where $d$ is fixed, for which we derive strong (vanishing error) and weak detection lower and upper bounds.
Modern social media platforms play an important role in facilitating rapid dissemination of information through their massive user networks. Fake news, misinformation, and unverifiable facts on social media platforms propagate disharmony and affect society. In this paper, we consider the problem of online auditing of information flow/propagation with the goal of classifying news items as fake or genuine. Specifically, driven by experiential studies on real-world social media platforms, we propose a probabilistic Markovian information spread model over networks modeled by graphs. We then formulate our inference task as a certain sequential detection problem with the goal of minimizing the combination of the error probability and the time it takes to achieve correct decision. For this model, we find the optimal detection algorithm minimizing the aforementioned risk and prove several statistical guarantees. We then test our algorithm over real-world datasets. To that end, we first construct an offline algorithm for learning the probabilistic information spreading model, and then apply our optimal detection algorithm. Experimental study show that our algorithm outperforms state-of-the-art misinformation detection algorithms in terms of accuracy and detection time.
Sparse arrays enable resolving more direction of arrivals (DoAs) than antenna elements using non-uniform arrays. This is typically achieved by reconstructing the covariance of a virtual large uniform linear array (ULA), which is then processed by subspace DoA estimators. However, these method assume that the signals are non-coherent and the array is calibrated; the latter often challenging to achieve in sparse arrays, where one cannot access the virtual array elements. In this work, we propose Sparse-SubspaceNet, which leverages deep learning to enable subspace-based DoA recovery from sparse miscallibrated arrays with coherent sources. Sparse- SubspaceNet utilizes a dedicated deep network to learn from data how to compute a surrogate virtual array covariance that is divisible into distinguishable subspaces. By doing so, we learn to cope with coherent sources and miscalibrated sparse arrays, while preserving the interpretability and the suitability of model-based subspace DoA estimators.
We consider the task of detecting a hidden bipartite subgraph in a given random graph. Specifically, under the null hypothesis, the graph is a realization of an Erd\H{o}s-R\'{e}nyi random graph over $n$ vertices with edge density $q$. Under the alternative, there exists a planted $k_{\mathsf{R}} \times k_{\mathsf{L}}$ bipartite subgraph with edge density $p>q$. We derive asymptotically tight upper and lower bounds for this detection problem in both the dense regime, where $q,p = \Theta\left(1\right)$, and the sparse regime where $q,p = \Theta\left(n^{-\alpha}\right), \alpha \in \left(0,2\right]$. Moreover, we consider a variant of the above problem, where one can only observe a relatively small part of the graph, by using at most $\mathsf{Q}$ edge queries. For this problem, we derive upper and lower bounds in both the dense and sparse regimes.
We study the problem of detecting the correlation between two Gaussian databases $\mathsf{X}\in\mathbb{R}^{n\times d}$ and $\mathsf{Y}^{n\times d}$, each composed of $n$ users with $d$ features. This problem is relevant in the analysis of social media, computational biology, etc. We formulate this as a hypothesis testing problem: under the null hypothesis, these two databases are statistically independent. Under the alternative, however, there exists an unknown permutation $\sigma$ over the set of $n$ users (or, row permutation), such that $\mathsf{X}$ is $\rho$-correlated with $\mathsf{Y}^\sigma$, a permuted version of $\mathsf{Y}$. We determine sharp thresholds at which optimal testing exhibits a phase transition, depending on the asymptotic regime of $n$ and $d$. Specifically, we prove that if $\rho^2d\to0$, as $d\to\infty$, then weak detection (performing slightly better than random guessing) is statistically impossible, irrespectively of the value of $n$. This compliments the performance of a simple test that thresholds the sum all entries of $\mathsf{X}^T\mathsf{Y}$. Furthermore, when $d$ is fixed, we prove that strong detection (vanishing error probability) is impossible for any $\rho<\rho^\star$, where $\rho^\star$ is an explicit function of $d$, while weak detection is again impossible as long as $\rho^2d\to0$. These results close significant gaps in current recent related studies.
We propose an efficient method to learn both unstructured and structured sparse neural networks during training, using a novel generalization of the sparse envelope function (SEF) used as a regularizer, termed {\itshape{group sparse envelope function}} (GSEF). The GSEF acts as a neuron group selector, which we leverage to induce structured pruning. Our method receives a hardware-friendly structured sparsity of a deep neural network (DNN) to efficiently accelerate the DNN's evaluation. This method is flexible in the sense that it allows any hardware to dictate the definition of a group, such as a filter, channel, filter shape, layer depth, a single parameter (unstructured), etc. By the nature of the GSEF, the proposed method is the first to make possible a pre-define sparsity level that is being achieved at the training convergence, while maintaining negligible network accuracy degradation. We propose an efficient method to calculate the exact value of the GSEF along with its proximal operator, in a worst-case complexity of $O(n)$, where $n$ is the total number of groups variables. In addition, we propose a proximal-gradient-based optimization method to train the model, that is, the non-convex minimization of the sum of the neural network loss and the GSEF. Finally, we conduct an experiment and illustrate the efficiency of our proposed technique in terms of the completion ratio, accuracy, and inference latency.
Social media platforms (SMPs) leverage algorithmic filtering (AF) as a means of selecting the content that constitutes a user's feed with the aim of maximizing their rewards. Selectively choosing the contents to be shown on the user's feed may yield a certain extent of influence, either minor or major, on the user's decision-making, compared to what it would have been under a natural/fair content selection. As we have witnessed over the past decade, algorithmic filtering can cause detrimental side effects, ranging from biasing individual decisions to shaping those of society as a whole, for example, diverting users' attention from whether to get the COVID-19 vaccine or inducing the public to choose a presidential candidate. The government's constant attempts to regulate the adverse effects of AF are often complicated, due to bureaucracy, legal affairs, and financial considerations. On the other hand SMPs seek to monitor their own algorithmic activities to avoid being fined for exceeding the allowable threshold. In this paper, we mathematically formalize this framework and utilize it to construct a data-driven statistical algorithm to regulate the AF from deflecting users' beliefs over time, along with sample and complexity guarantees. We show that our algorithm is robust against potential adversarial users. This state-of-the-art algorithm can be used either by authorities acting as external regulators or by SMPs for self-regulation.
The planted densest subgraph detection problem refers to the task of testing whether in a given (random) graph there is a subgraph that is unusually dense. Specifically, we observe an undirected and unweighted graph on $n$ nodes. Under the null hypothesis, the graph is a realization of an Erd\H{o}s-R\'{e}nyi graph with edge probability (or, density) $q$. Under the alternative, there is a subgraph on $k$ vertices with edge probability $p>q$. The statistical as well as the computational barriers of this problem are well-understood for a wide range of the edge parameters $p$ and $q$. In this paper, we consider a natural variant of the above problem, where one can only observe a small part of the graph using adaptive edge queries. For this model, we determine the number of queries necessary and sufficient for detecting the presence of the planted subgraph. Specifically, we show that any (possibly randomized) algorithm must make $\mathsf{Q} = \Omega(\frac{n^2}{k^2\chi^4(p||q)}\log^2n)$ adaptive queries (on expectation) to the adjacency matrix of the graph to detect the planted subgraph with probability more than $1/2$, where $\chi^2(p||q)$ is the Chi-Square distance. On the other hand, we devise a quasi-polynomial-time algorithm that detects the planted subgraph with high probability by making $\mathsf{Q} = O(\frac{n^2}{k^2\chi^4(p||q)}\log^2n)$ non-adaptive queries. We then propose a polynomial-time algorithm which is able to detect the planted subgraph using $\mathsf{Q} = O(\frac{n^3}{k^3\chi^2(p||q)}\log^3 n)$ queries. We conjecture that in the leftover regime, where $\frac{n^2}{k^2}\ll\mathsf{Q}\ll \frac{n^3}{k^3}$, no polynomial-time algorithms exist. Our results resolve two questions posed in \cite{racz2020finding}, where the special case of adaptive detection and recovery of a planted clique was considered.
We study the two inference problems of detecting and recovering an isolated community of \emph{general} structure planted in a random graph. The detection problem is formalized as a hypothesis testing problem, where under the null hypothesis, the graph is a realization of an Erd\H{o}s-R\'{e}nyi random graph $\mathcal{G}(n,q)$ with edge density $q\in(0,1)$; under the alternative, there is an unknown structure $\Gamma_k$ on $k$ nodes, planted in $\mathcal{G}(n,q)$, such that it appears as an \emph{induced subgraph}. In case of a successful detection, we are concerned with the task of recovering the corresponding structure. For these problems, we investigate the fundamental limits from both the statistical and computational perspectives. Specifically, we derive lower bounds for detecting/recovering the structure $\Gamma_k$ in terms of the parameters $(n,k,q)$, as well as certain properties of $\Gamma_k$, and exhibit computationally unbounded optimal algorithms that achieve these lower bounds. We also consider the problem of testing in polynomial-time. As is customary in many similar structured high-dimensional problems, our model undergoes an "easy-hard-impossible" phase transition and computational constraints can severely penalize the statistical performance. To provide an evidence for this phenomenon, we show that the class of low-degree polynomials algorithms match the statistical performance of the polynomial-time algorithms we develop.