Abstract:Hypergraphs are a representation of complex systems involving interactions among more than two entities and allow to investigation of higher-order structure and dynamics in real-world complex systems. Community structure is a common property observed in empirical networks in various domains. Stochastic block models have been employed to investigate community structure in networks. Node attribute data, often accompanying network data, has been found to potentially enhance the learning of community structure in dyadic networks. In this study, we develop a statistical framework that incorporates node attribute data into the learning of community structure in a hypergraph, employing a stochastic block model. We demonstrate that our model, which we refer to as HyperNEO, enhances the learning of community structure in synthetic and empirical hypergraphs when node attributes are sufficiently associated with the communities. Furthermore, we found that applying a dimensionality reduction method, UMAP, to the learned representations obtained using stochastic block models, including our model, maps nodes into a two-dimensional vector space while largely preserving community structure in empirical hypergraphs. We expect that our framework will broaden the investigation and understanding of higher-order community structure in real-world complex systems.




Abstract:Neighborhood graphs are gaining popularity as a concise data representation in machine learning. However, naive graph construction by pairwise distance calculation takes $O(n^2)$ runtime for $n$ data points and this is prohibitively slow for millions of data points. For strings of equal length, the multiple sorting method (Uno, 2008) can construct an $\epsilon$-neighbor graph in $O(n+m)$ time, where $m$ is the number of $\epsilon$-neighbor pairs in the data. To introduce this remarkably efficient algorithm to continuous domains such as images, signals and texts, we employ a random projection method to convert vectors to strings. Theoretical results are presented to elucidate the trade-off between approximation quality and computation time. Empirical results show the efficiency of our method in comparison to fast nearest neighbor alternatives.