Alert button
Picture for Jacob Danovitch

Jacob Danovitch

Alert button

Temporal Graph Benchmark for Machine Learning on Temporal Graphs

Jul 03, 2023
Shenyang Huang, Farimah Poursafaei, Jacob Danovitch, Matthias Fey, Weihua Hu, Emanuele Rossi, Jure Leskovec, Michael Bronstein, Guillaume Rabusseau, Reihaneh Rabbany

Figure 1 for Temporal Graph Benchmark for Machine Learning on Temporal Graphs
Figure 2 for Temporal Graph Benchmark for Machine Learning on Temporal Graphs
Figure 3 for Temporal Graph Benchmark for Machine Learning on Temporal Graphs
Figure 4 for Temporal Graph Benchmark for Machine Learning on Temporal Graphs

We present the Temporal Graph Benchmark (TGB), a collection of challenging and diverse benchmark datasets for realistic, reproducible, and robust evaluation of machine learning models on temporal graphs. TGB datasets are of large scale, spanning years in duration, incorporate both node and edge-level prediction tasks and cover a diverse set of domains including social, trade, transaction, and transportation networks. For both tasks, we design evaluation protocols based on realistic use-cases. We extensively benchmark each dataset and find that the performance of common models can vary drastically across datasets. In addition, on dynamic node property prediction tasks, we show that simple methods often achieve superior performance compared to existing temporal graph models. We believe that these findings open up opportunities for future research on temporal graphs. Finally, TGB provides an automated machine learning pipeline for reproducible and accessible temporal graph research, including data loading, experiment setup and performance evaluation. TGB will be maintained and updated on a regular basis and welcomes community feedback. TGB datasets, data loaders, example codes, evaluation setup, and leaderboards are publicly available at https://tgb.complexdatalab.com/ .

* 16 pages, 4 figures, 5 tables, preprint 
Viaarxiv icon

Fast and Attributed Change Detection on Dynamic Graphs with Density of States

May 15, 2023
Shenyang Huang, Jacob Danovitch, Guillaume Rabusseau, Reihaneh Rabbany

Figure 1 for Fast and Attributed Change Detection on Dynamic Graphs with Density of States
Figure 2 for Fast and Attributed Change Detection on Dynamic Graphs with Density of States
Figure 3 for Fast and Attributed Change Detection on Dynamic Graphs with Density of States
Figure 4 for Fast and Attributed Change Detection on Dynamic Graphs with Density of States

How can we detect traffic disturbances from international flight transportation logs or changes to collaboration dynamics in academic networks? These problems can be formulated as detecting anomalous change points in a dynamic graph. Current solutions do not scale well to large real-world graphs, lack robustness to large amounts of node additions/deletions, and overlook changes in node attributes. To address these limitations, we propose a novel spectral method: Scalable Change Point Detection (SCPD). SCPD generates an embedding for each graph snapshot by efficiently approximating the distribution of the Laplacian spectrum at each step. SCPD can also capture shifts in node attributes by tracking correlations between attributes and eigenvectors. Through extensive experiments using synthetic and real-world data, we show that SCPD (a) achieves state-of-the art performance, (b) is significantly faster than the state-of-the-art methods and can easily process millions of edges in a few CPU minutes, (c) can effectively tackle a large quantity of node attributes, additions or deletions and (d) discovers interesting events in large real-world graphs. The code is publicly available at https://github.com/shenyangHuang/SCPD.git

* in PAKDD 2023, 18 pages, 12 figures 
Viaarxiv icon

The Surprising Performance of Simple Baselines for Misinformation Detection

Apr 14, 2021
Kellin Pelrine, Jacob Danovitch, Reihaneh Rabbany

Figure 1 for The Surprising Performance of Simple Baselines for Misinformation Detection
Figure 2 for The Surprising Performance of Simple Baselines for Misinformation Detection
Figure 3 for The Surprising Performance of Simple Baselines for Misinformation Detection
Figure 4 for The Surprising Performance of Simple Baselines for Misinformation Detection

As social media becomes increasingly prominent in our day to day lives, it is increasingly important to detect informative content and prevent the spread of disinformation and unverified rumours. While many sophisticated and successful models have been proposed in the literature, they are often compared with older NLP baselines such as SVMs, CNNs, and LSTMs. In this paper, we examine the performance of a broad set of modern transformer-based language models and show that with basic fine-tuning, these models are competitive with and can even significantly outperform recently proposed state-of-the-art methods. We present our framework as a baseline for creating and evaluating new methods for misinformation detection. We further study a comprehensive set of benchmark datasets, and discuss potential data leakage and the need for careful design of the experiments and understanding of datasets to account for confounding variables. As an extreme case example, we show that classifying only based on the first three digits of tweet ids, which contain information on the date, gives state-of-the-art performance on a commonly used benchmark dataset for fake news detection --Twitter16. We provide a simple tool to detect this problem and suggest steps to mitigate it in future datasets.

Viaarxiv icon

Linking Social Media Posts to News with Siamese Transformers

Jan 10, 2020
Jacob Danovitch

Figure 1 for Linking Social Media Posts to News with Siamese Transformers
Figure 2 for Linking Social Media Posts to News with Siamese Transformers
Figure 3 for Linking Social Media Posts to News with Siamese Transformers
Figure 4 for Linking Social Media Posts to News with Siamese Transformers

Many computational social science projects examine online discourse surrounding a specific trending topic. These works often involve the acquisition of large-scale corpora relevant to the event in question to analyze aspects of the response to the event. Keyword searches present a precision-recall trade-off and crowd-sourced annotations, while effective, are costly. This work aims to enable automatic and accurate ad-hoc retrieval of comments discussing a trending topic from a large corpus, using only a handful of seed news articles.

Viaarxiv icon

Trouble with the Curve: Predicting Future MLB Players Using Scouting Reports

Oct 21, 2019
Jacob Danovitch

Figure 1 for Trouble with the Curve: Predicting Future MLB Players Using Scouting Reports
Figure 2 for Trouble with the Curve: Predicting Future MLB Players Using Scouting Reports
Figure 3 for Trouble with the Curve: Predicting Future MLB Players Using Scouting Reports
Figure 4 for Trouble with the Curve: Predicting Future MLB Players Using Scouting Reports

In baseball, a scouting report profiles a player's characteristics and traits, usually intended for use in player valuation. This work presents a first-of-its-kind dataset of almost 10,000 scouting reports for minor league, international, and draft prospects. Compiled from articles posted to MLB.com and Fangraphs.com, each report consists of a written description of the player, numerical grades for several skills, and unique IDs to reference their profiles on popular resources like MLB.com, FanGraphs, and Baseball-Reference. With this dataset, we employ several deep neural networks to predict if minor league players will make the MLB given their scouting report. We open-source this data to share with the community, and present a web application demonstrating language variations in the reports of successful and unsuccessful prospects.

* Carnegie Mellon Sports Analytics Conference 2019 
Viaarxiv icon