Highly specific datasets of scientific literature are important for both research and education. However, it is difficult to build such datasets at scale. A common approach is to build these datasets reductively by applying topic modeling on an established corpus and selecting specific topics. A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert (SME) handpicks documents. This method does not scale and is prone to error as the dataset grows. Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature. Given a small initial "core" corpus of papers, we build a citation network of documents. At each step of the citation network, we generate text embeddings and visualize the embeddings through dimensionality reduction. Papers are kept in the dataset if they are "similar" to the core or are otherwise pruned through human-in-the-loop selection. Additional insight into the papers is gained through sub-topic modeling using SeNMFk. We demonstrate our new tool for literature review by applying it to two different fields in machine learning.
Malware is one of the most dangerous and costly cyber threats to national security and a crucial factor in modern cyber-space. However, the adoption of machine learning (ML) based solutions against malware threats has been relatively slow. Shortcomings in the existing ML approaches are likely contributing to this problem. The majority of current ML approaches ignore real-world challenges such as the detection of novel malware. In addition, proposed ML approaches are often designed either for malware/benign-ware classification or malware family classification. Here we introduce and showcase preliminary capabilities of a new method that can perform precise identification of novel malware families, while also unifying the capability for malware/benign-ware classification and malware family classification into a single framework.
As machine learning techniques become increasingly prevalent in data analysis, the threat of adversarial attacks has surged, necessitating robust defense mechanisms. Among these defenses, methods exploiting low-rank approximations for input data preprocessing and neural network (NN) parameter factorization have shown potential. Our work advances this field further by integrating the tensorization of input data with low-rank decomposition and tensorization of NN parameters to enhance adversarial defense. The proposed approach demonstrates significant defense capabilities, maintaining robust accuracy even when subjected to the strongest known auto-attacks. Evaluations against leading-edge robust performance benchmarks reveal that our results not only hold their ground against the best defensive methods available but also exceed all current defense strategies that rely on tensor factorizations. This study underscores the potential of integrating tensorization and low-rank decomposition as a robust defense against adversarial attacks in machine learning.
Matrix diagonalization is at the cornerstone of numerous fields of scientific computing. Diagonalizing a matrix to solve an eigenvalue problem requires a sequential path of iterations that eventually reaches a sufficiently converged and accurate solution for all the eigenvalues and eigenvectors. This typically translates into a high computational cost. Here we demonstrate how reinforcement learning, using the AlphaZero framework, can accelerate Jacobi matrix diagonalizations by viewing the selection of the fastest path to solution as a board game. To demonstrate the viability of our approach we apply the Jacobi diagonalization algorithm to symmetric Hamiltonian matrices that appear in quantum chemistry calculations. We find that a significant acceleration can often be achieved. Our findings highlight the opportunity to use machine learning as a promising tool to improve the performance of numerical linear algebra.
As the amount of text data continues to grow, topic modeling is serving an important role in understanding the content hidden by the overwhelming quantity of documents. One popular topic modeling approach is non-negative matrix factorization (NMF), an unsupervised machine learning (ML) method. Recently, Semantic NMF with automatic model selection (SeNMFk) has been proposed as a modification to NMF. In addition to heuristically estimating the number of topics, SeNMFk also incorporates the semantic structure of the text. This is performed by jointly factorizing the term frequency-inverse document frequency (TF-IDF) matrix with the co-occurrence/word-context matrix, the values of which represent the number of times two words co-occur in a predetermined window of the text. In this paper, we introduce a novel distributed method, SeNMFk-SPLIT, for semantic topic extraction suitable for large corpora. Contrary to SeNMFk, our method enables the joint factorization of large documents by decomposing the word-context and term-document matrices separately. We demonstrate the capability of SeNMFk-SPLIT by applying it to the entire artificial intelligence (AI) and ML scientific literature uploaded on arXiv.
Non-negative matrix factorization (NMF) with missing-value completion is a well-known effective Collaborative Filtering (CF) method used to provide personalized user recommendations. However, traditional CF relies on the privacy-invasive collection of users' explicit and implicit feedback to build a central recommender model. One-shot federated learning has recently emerged as a method to mitigate the privacy problem while addressing the traditional communication bottleneck of federated learning. In this paper, we present the first unsupervised one-shot federated CF implementation, named FedSPLIT, based on NMF joint factorization. In our solution, the clients first apply local CF in-parallel to build distinct client-specific recommenders. Then, the privacy-preserving local item patterns and biases from each client are shared with the processor to perform joint factorization in order to extract the global item patterns. Extracted patterns are then aggregated to each client to build the local models via knowledge distillation. In our experiments, we demonstrate the feasibility of our approach with standard recommendation datasets. FedSPLIT can obtain similar results than the state of the art (and even outperform it in certain situations) with a substantial decrease in the number of communications.
The need for efficient and scalable big-data analytics methods is more essential than ever due to the exploding size and complexity of globally emerging datasets. Nonnegative Matrix Factorization (NMF) is a well-known explainable unsupervised learning method for dimensionality reduction, latent feature extraction, blind source separation, data mining, and machine learning. In this paper, we introduce a new distributed out-of-memory NMF method, named pyDNMF-GPU, designed for modern heterogeneous CPU/GPU architectures that is capable of factoring exascale-sized dense and sparse matrices. Our method reduces the latency associated with local data transfer between the GPU and host using CUDA streams, and reduces the latency associated with collective communications (both intra-node and inter-node) via NCCL primitives. In addition, sparse and dense matrix multiplications are significantly accelerated with GPU cores, resulting in good scalability. We set new benchmarks for the size of the data being analyzed: in experiments, we measure up to 76x improvement on a single GPU over running on a single 18 core CPU and we show good weak scaling on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs, when decomposing a dense 340 Terabyte-size matrix and a 11 Exabyte-size sparse matrix of density 10e-6. Finally, we integrate our method with an automatic model selection method. With this integration, we introduce a new tool that is capable of analyzing, compressing, and discovering explainable latent structures in extremely large sparse and dense data.
We present a new four-pronged approach to build firefighter's situational awareness for the first time in the literature. We construct a series of deep learning frameworks built on top of one another to enhance the safety, efficiency, and successful completion of rescue missions conducted by firefighters in emergency first response settings. First, we used a deep Convolutional Neural Network (CNN) system to classify and identify objects of interest from thermal imagery in real-time. Next, we extended this CNN framework for object detection, tracking, segmentation with a Mask RCNN framework, and scene description with a multimodal natural language processing(NLP) framework. Third, we built a deep Q-learning-based agent, immune to stress-induced disorientation and anxiety, capable of making clear navigation decisions based on the observed and stored facts in live-fire environments. Finally, we used a low computational unsupervised learning technique called tensor decomposition to perform meaningful feature extraction for anomaly detection in real-time. With these ad-hoc deep learning structures, we built the artificial intelligence system's backbone for firefighters' situational awareness. To bring the designed system into usage by firefighters, we designed a physical structure where the processed results are used as inputs in the creation of an augmented reality capable of advising firefighters of their location and key features around them, which are vital to the rescue operation at hand, as well as a path planning feature that acts as a virtual guide to assist disoriented first responders in getting back to safety. When combined, these four approaches present a novel approach to information understanding, transfer, and synthesis that could dramatically improve firefighter response and efficacy and reduce life loss.
Live fire creates a dynamic, rapidly changing environment that presents a worthy challenge for deep learning and artificial intelligence methodologies to assist firefighters with scene comprehension in maintaining their situational awareness, tracking and relay of important features necessary for key decisions as they tackle these catastrophic events. We propose a deep Q-learning based agent who is immune to stress induced disorientation and anxiety and thus able to make clear decisions for navigation based on the observed and stored facts in live fire environments. As a proof of concept, we imitate structural fire in a gaming engine called Unreal Engine which enables the interaction of the agent with the environment. The agent is trained with a deep Q-learning algorithm based on a set of rewards and penalties as per its actions on the environment. We exploit experience replay to accelerate the learning process and augment the learning of the agent with human-derived experiences. The agent trained under this deep Q-learning approach outperforms agents trained through alternative path planning systems and demonstrates this methodology as a promising foundation on which to build a path planning navigation assistant capable of safely guiding fire fighters through live fire environments.
Firefighting is a dynamic activity, in which numerous operations occur simultaneously. Maintaining situational awareness (i.e., knowledge of current conditions and activities at the scene) is critical to the accurate decision-making necessary for the safe and successful navigation of a fire environment by firefighters. Conversely, the disorientation caused by hazards such as smoke and extreme heat can lead to injury or even fatality. This research implements recent advancements in technology such as deep learning, point cloud and thermal imaging, and augmented reality platforms to improve a firefighter's situational awareness and scene navigation through improved interpretation of that scene. We have designed and built a prototype embedded system that can leverage data streamed from cameras built into a firefighter's personal protective equipment (PPE) to capture thermal, RGB color, and depth imagery and then deploy already developed deep learning models to analyze the input data in real time. The embedded system analyzes and returns the processed images via wireless streaming, where they can be viewed remotely and relayed back to the firefighter using an augmented reality platform that visualizes the results of the analyzed inputs and draws the firefighter's attention to objects of interest, such as doors and windows otherwise invisible through smoke and flames.