Real-life events, behaviors and interactions produce sequential data. An important but rarely explored problem is to analyze those nonoccurring (also called negative) yet important sequences, forming negative sequence analysis (NSA). A typical NSA area is to discover negative sequential patterns (NSPs) consisting of important non-occurring and occurring elements and patterns. The limited existing work on NSP mining relies on frequentist and downward closure property-based pattern selection, producing large and highly redundant NSPs, nonactionable for business decision-making. This work makes the first attempt for actionable NSP discovery. It builds an NSP graph representation, quantify both explicit occurrence and implicit non-occurrence-based element and pattern relations, and then discover significant, diverse and informative NSPs in the NSP graph to represent the entire NSP set for discovering actionable NSPs. A DPP-based NSP representation and actionable NSP discovery method EINSP introduces novel and significant contributions for NSA and sequence analysis: (1) it represents NSPs by a determinantal point process (DPP) based graph; (2) it quantifies actionable NSPs in terms of their statistical significance, diversity, and strength of explicit/implicit element/pattern relations; and (3) it models and measures both explicit and implicit element/pattern relations in the DPP-based NSP graph to represent direct and indirect couplings between NSP items, elements and patterns. We substantially analyze the effectiveness of EINSP in terms of various theoretical and empirical aspects including complexity, item/pattern coverage, pattern size and diversity, implicit pattern relation strength, and data factors.
The distributions of real-life data streams are usually nonstationary, where one exciting setting is that a stream can be decomposed into several offline intervals with a fixed time horizon but different distributions and an out-of-distribution online interval. We call such data multi-distributional data streams, on which learning an on-the-fly expert for unseen samples with a desirable generalization is demanding yet highly challenging owing to the multi-distributional streaming nature, particularly when initially limited data is available for the online interval. To address these challenges, this work introduces a novel optimization method named coupling online-offline learning (CO$_2$) with theoretical guarantees about the knowledge transfer, the regret, and the generalization error. CO$_2$ extracts knowledge by training an offline expert for each offline interval and update an online expert by an off-the-shelf online optimization method in the online interval. CO$_2$ outputs a hypothesis for each sample by adaptively coupling both the offline experts and the underlying online expert through an expert-tracking strategy to adapt to the dynamic environment. To study the generalization performance of the output hypothesis, we propose a general theory to analyze its excess risk bound related to the loss function properties, the hypothesis class, the data distribution, and the regret.
Enterprise data typically involves multiple heterogeneous data sources and external data that respectively record business activities, transactions, customer demographics, status, behaviors, interactions and communications with the enterprise, and the consumption and feedback of its products, services, production, marketing, operations, and management, etc. A critical challenge in enterprise data science is to enable an effective whole-of-enterprise data understanding and data-driven discovery and decision-making on all-round enterprise DNA. We introduce a neural encoder Table2Vec for automated universal representation learning of entities such as customers from all-round enterprise DNA with automated data characteristics analysis and data quality augmentation. The learned universal representations serve as representative and benchmarkable enterprise data genomes and can be used for enterprise-wide and domain-specific learning tasks. Table2Vec integrates automated universal representation learning on low-quality enterprise data and downstream learning tasks. We illustrate Table2Vec in characterizing all-round customer data DNA in an enterprise on complex heterogeneous multi-relational big tables to build universal customer vector representations. The learned universal representation of each customer is all-round, representative and benchmarkable to support both enterprise-wide and domain-specific learning goals and tasks in enterprise data science. Table2Vec significantly outperforms the existing shallow, boosting and deep learning methods typically used for enterprise analytics. We further discuss the research opportunities, directions and applications of automated universal enterprise representation and learning and the learned enterprise data DNA for automated, all-purpose, whole-of-enterprise and ethical machine learning and data science.
The novel coronavirus disease 2019 (COVID-19) presents unique and unknown problem complexities and modeling challenges, where an imperative task is to model both its process and data uncertainties, represented in implicit and high-proportional undocumented infections, asymptomatic contagion, social reinforcement of infections, and various quality issues in the reported data. These uncertainties become even more phenomenal in the overwhelming mutation-dominated resurgences with vaccinated but still susceptible populations. Here we introduce a novel hybrid approach to (1) characterizing and distinguishing Undocumented (U) and Documented (D) infections commonly seen during COVID-19 incubation periods and asymptomatic infections by expanding the foundational compartmental epidemic Susceptible-Infected-Recovered (SIR) model with two compartments, resulting in a new Susceptible-Undocumented infected-Documented infected-Recovered (SUDR) model; (2) characterizing the probabilistic density of infections by empowering SUDR to capture exogenous processes like clustering contagion interactions, superspreading and social reinforcement; and (3) approximating the density likelihood of COVID-19 prevalence over time by incorporating Bayesian inference into SUDR. Different from existing COVID-19 models, SUDR characterizes the undocumented infections during unknown transmission processes. To capture the uncertainties of temporal transmission and social reinforcement during the COVID-19 contagion, the transmission rate is modeled by a time-varying density function of undocumented infectious cases. We solve the modeling by sampling from the mean-field posterior distribution with reasonable priors, making SUDR suitable to handle the randomness, noise and sparsity of COVID-19 observations widely seen in the public COVID-19 case data.
An explicit discriminator trained on observable in-distribution (ID) samples can make high-confidence prediction on out-of-distribution (OOD) samples due to its distributional vulnerability. This is primarily caused by the limited ID samples observable for training discriminators when OOD samples are unavailable. To address this issue, the state-of-the-art methods train the discriminator with OOD samples generated by general assumptions without considering the data and network characteristics. However, different network architectures and training ID datasets may cause diverse vulnerabilities, and the generated OOD samples thus usually misaddress the specific distributional vulnerability of the explicit discriminator. To reveal and patch the distributional vulnerabilities, we propose a novel method of \textit{fine-tuning explicit discriminators by implicit generators} (FIG). According to the Shannon entropy, an explicit discriminator can construct its corresponding implicit generator to generate specific OOD samples without extra training costs. A Langevin Dynamic sampler then draws high-quality OOD samples from the generator to reveal the vulnerability. Finally, a regularizer, constructed according to the design principle of the implicit generator, patches the distributional vulnerability by encouraging those generated OOD samples with high entropy. Our experiments on four networks, four ID datasets and seven OOD datasets demonstrate that FIG achieves state-of-the-art OOD detection performance and maintains a competitive classification capability.
Automated next-best action recommendation for each customer in a sequential, dynamic and interactive context has been widely needed in natural, social and business decision-making. Personalized next-best action recommendation must involve past, current and future customer demographics and circumstances (states) and behaviors, long-range sequential interactions between customers and decision-makers, multi-sequence interactions between states, behaviors and actions, and their reactions to their counterpart's actions. No existing modeling theories and tools, including Markovian decision processes, user and behavior modeling, deep sequential modeling, and personalized sequential recommendation, can quantify such complex decision-making on a personal level. We take a data-driven approach to learn the next-best actions for personalized decision-making by a reinforced coupled recurrent neural network (CRN). CRN represents multiple coupled dynamic sequences of a customer's historical and current states, responses to decision-makers' actions, decision rewards to actions, and learns long-term multi-sequence interactions between parties (customer and decision-maker). Next-best actions are then recommended on each customer at a time point to change their state for an optimal decision-making objective. Our study demonstrates the potential of personalized deep learning of multi-sequence interactions and automated dynamic intervention for personalized decision-making in complex systems.
The prediction of express delivery sequence, i.e., modeling and estimating the volumes of daily incoming and outgoing parcels for delivery, is critical for online business, logistics, and positive customer experience, and specifically for resource allocation optimization and promotional activity arrangement. A precise estimate of consumer delivery requests has to involve sequential factors such as shopping behaviors, weather conditions, events, business campaigns, and their couplings. Besides, conventional sequence prediction assumes a stable sequence evolution, failing to address complex nonlinear sequences and various feature effects in the above multi-source data. Although deep networks and attention mechanisms demonstrate the potential of complex sequence modeling, extant networks ignore the heterogeneous and coupling situation between features and sequences, resulting in weak prediction accuracy. To address these issues, we propose DeepExpress - a deep-learning based express delivery sequence prediction model, which extends the classic seq2seq framework to learning complex coupling between sequence and features. DeepExpress leverages an express delivery seq2seq learning, a carefully-designed heterogeneous feature representation, and a novel joint training attention mechanism to adaptively map heterogeneous data, and capture sequence-feature coupling for precise estimation. Experimental results on real-world data demonstrate that the proposed method outperforms both shallow and deep baseline models.
AI in finance broadly refers to the applications of AI techniques in financial businesses. This area has been lasting for decades with both classic and modern AI techniques applied to increasingly broader areas of finance, economy and society. In contrast to either discussing the problems, aspects and opportunities of finance that have benefited from specific AI techniques and in particular some new-generation AI and data science (AIDS) areas or reviewing the progress of applying specific techniques to resolving certain financial problems, this review offers a comprehensive and dense roadmap of the overwhelming challenges, techniques and opportunities of AI research in finance over the past decades. The landscapes and challenges of financial businesses and data are firstly outlined, followed by a comprehensive categorization and a dense overview of the decades of AI research in finance. We then structure and illustrate the data-driven analytics and learning of financial businesses and data. The comparison, criticism and discussion of classic vs. modern AI techniques for finance are followed. Lastly, open issues and opportunities address future AI-empowered finance and finance-motivated AI research.
The abundant sequential documents such as online archival, social media and news feeds are streamingly updated, where each chunk of documents is incorporated with smoothly evolving yet dependent topics. Such digital texts have attracted extensive research on dynamic topic modeling to infer hidden evolving topics and their temporal dependencies. However, most of the existing approaches focus on single-topic-thread evolution and ignore the fact that a current topic may be coupled with multiple relevant prior topics. In addition, these approaches also incur the intractable inference problem when inferring latent parameters, resulting in a high computational cost and performance degradation. In this work, we assume that a current topic evolves from all prior topics with corresponding coupling weights, forming the multi-topic-thread evolution. Our method models the dependencies between evolving topics and thoroughly encodes their complex multi-couplings across time steps. To conquer the intractable inference challenge, a new solution with a set of novel data augmentation techniques is proposed, which successfully discomposes the multi-couplings between evolving topics. A fully conjugate model is thus obtained to guarantee the effectiveness and efficiency of the inference technique. A novel Gibbs sampler with a backward-forward filter algorithm efficiently learns latent timeevolving parameters in a closed-form. In addition, the latent Indian Buffet Process (IBP) compound distribution is exploited to automatically infer the overall topic number and customize the sparse topic proportions for each sequential document without bias. The proposed method is evaluated on both synthetic and real-world datasets against the competitive baselines, demonstrating its superiority over the baselines in terms of the low per-word perplexity, high coherent topics, and better document time prediction.
Recent years have witnessed the fast development of the emerging topic of Graph Learning based Recommender Systems (GLRS). GLRS employ advanced graph learning approaches to model users' preferences and intentions as well as items' characteristics for recommendations. Differently from other RS approaches, including content-based filtering and collaborative filtering, GLRS are built on graphs where the important objects, e.g., users, items, and attributes, are either explicitly or implicitly connected. With the rapid development of graph learning techniques, exploring and exploiting homogeneous or heterogeneous relations in graphs are a promising direction for building more effective RS. In this paper, we provide a systematic review of GLRS, by discussing how they extract important knowledge from graph-based representations to improve the accuracy, reliability and explainability of the recommendations. First, we characterize and formalize GLRS, and then summarize and categorize the key challenges and main progress in this novel research area. Finally, we share some new research directions in this vibrant area.