Abstract:The paper presents a cross-domain review analysis on four popular review datasets: Amazon, Yelp, Steam, IMDb. The analysis is performed using Hadoop and Spark, which allows for efficient and scalable processing of large datasets. By examining close to 12 million reviews from these four online forums, we hope to uncover interesting trends in sales and customer sentiment over the years. Our analysis will include a study of the number of reviews and their distribution over time, as well as an examination of the relationship between various review attributes such as upvotes, creation time, rating, and sentiment. By comparing the reviews across different domains, we hope to gain insight into the factors that drive customer satisfaction and engagement in different product categories.
Abstract:This paper presents a cross-domain trend analysis that aims to identify and analyze the relationships between stock prices, stock news on Twitter, and users' behaviors on e-commerce websites. The analysis is based on three datasets: a US stock dataset, a stock tweets dataset, and an e-commerce behavior dataset. The analysis is performed using Hadoop, Hive, and Tableau, allowing for efficient and scalable processing and visualizing large datasets. The analysis includes trend analysis of Twitter sentiment (positive and negative tweets) and correlation analysis, including the correlation between tweet sentiment and stocks, the correlation between stock trends and shopping behavior, and the understanding of data based on different slices of time. By comparing different features from the datasets over time, we hope to gain insight into the factors that drive user behavior as well as the market in different categories. The results of this analysis can provide valuable insights for businesses and investors to inform decision-making. We believe that our analysis can serve as a valuable starting point for further research and investigation into these topics.
Abstract:This project aims to explore the process of deploying Machine learning models on Kubernetes using an open-source tool called Kubeflow [1] - an end-to-end ML Stack orchestration toolkit. We create end-to-end Machine Learning models on Kubeflow in the form of pipelines and analyze various points including the ease of setup, deployment models, performance, limitations and features of the tool. We hope that our project acts almost like a seminar/introductory report that can help vanilla cloud/Kubernetes users with zero knowledge on Kubeflow use Kubeflow to deploy ML models. From setup on different clouds to serving our trained model over the internet - we give details and metrics detailing the performance of Kubeflow.
Abstract:In the world of fake news and deepfakes, there have been an alarmingly large number of cases of images being tampered with and published in newspapers, used in court, and posted on social media for defamation purposes. Detecting these tampered images is an important task and one we try to tackle. In this paper, we focus on the methods to detect if an image has been tampered with using both Deep Learning and Image transformation methods and comparing the performances and robustness of each method. We then attempt to identify the tampered area of the image and predict the corresponding mask. Based on the results, suggestions and approaches are provided to achieve a more robust framework to detect and identify the forgeries.
Abstract:A large amount of work has been done on the KDD 99 dataset, most of which includes the use of a hybrid anomaly and misuse detection model done in parallel with each other. In order to further classify the intrusions, our approach to network intrusion detection includes use of two different anomaly detection models followed by misuse detection applied on the combined output obtained from the previous step. The end goal of this is to verify the anomalies detected by the anomaly detection algorithm and clarify whether they are actually intrusions or random outliers from the trained normal (and thus to try and reduce the number of false positives). We aim to detect a pattern in this novel intrusion technique itself, and not the handling of such intrusions. The intrusions were detected to a very high degree of accuracy.