The rapid and continuous growth of data has increased the need for scalable mining algorithms in unsupervised learning and knowledge discovery. In this paper, we focus on Sequential Pattern Mining (SPM), a fundamental topic in knowledge discovery that faces a well-known memory bottleneck. We examine generic dataset modeling techniques and show how they can be used to improve SPM algorithms in time and memory usage. In particular, we develop trie-based dataset models and associated mining algorithms that can represent as well as effectively mine orders of magnitude larger datasets compared to the state of the art. Numerical results on real-life large-size test instances show that our algorithms are also faster and more memory efficient in practice.
Constrained sequential pattern mining aims at identifying frequent patterns on a sequential database of items while observing constraints defined over the item attributes. We introduce novel techniques for constraint-based sequential pattern mining that rely on a multi-valued decision diagram representation of the database. Specifically, our representation can accommodate multiple item attributes and various constraint types, including a number of non-monotone constraints. To evaluate the applicability of our approach, we develop an MDD-based prefix-projection algorithm and compare its performance against a typical generate-and-check variant, as well as a state-of-the-art constraint-based sequential pattern mining algorithm. Results show that our approach is competitive with or superior to these other methods in terms of scalability and efficiency.