In this repository we make available the Friendster dataset used in our paper:
Leonardo Teixeira, Brian Jalaian, Bruno Ribeiro. (2019). Are Graph Neural Networks Miscalibrated? ICML Workshop on Learning and Reasoning with Graph-Structured Representations.
If you use the data or code from this repository in your own code, please cite our paper:
@InProceedings{teixeira2019GNNmiscalibrated,
title={Are Graph Neural Networks Miscalibrated?},
author={Leonardo Teixeira and Brian Jalaian and Bruno Ribeiro},
booktitle={ICML Workshop on Learning and Reasoning with Graph-Structured Representations},
url={https://arxiv.org/abs/1905.02296},
year={2019}
}
The Friendster dataset used in our paper is available in the folder
data
. We also provide the Train, Validation and Test split used
in the paper, as well as a Python class to facilitate the usage with the
PyTorch Geometric library.
We provide the dataset in HDF5 format and the data split as a NumPy NPY format file. We also provide a python class that is compatible with the PyTorch Geometric framework, which automatically downloads the data and split.
If you use the PyTorch Geometric library, we provide a Python class
that can be used to access our Friendster dataset. It can automatically
download and provide access to the Friendster graph (and the data split
used in the paper) as an Dataset
class from PyTorch Geometric.
The necessary libraries are:
- NumPy (numpy)
- PyTorch (torch)
- PyTorch Geometric (torch_geometric)
- HDF5 for Python (h5py)
Please, refer to their documentation for installation instructions (in particular for PyTorch and PyTorch Geometric). This code was tested with PyTorch 1.0.1, PyTorch Geometric 1.0.2, NumPy 1.15 and h5py 2.9.
Using the provided class is illustrated in the following snippet. The class takes care of downloading the data automatically.
from friendster import Friendster
# Download the dataset to the folder: './Friendster-25K'
dataset = Friendster(root="./Friendster-25K/")
# This dataset has a single graph
graph = dataset[0]
print(f"Friendster dataset: {graph.num_nodes} nodes")
# The data splits can be accessed as:
train_mask = graph.train_mask
validation_mask = graph.validation_mask
test_mask = graph.test_mask
A full example is given in the file example.py
, where we run a GCN
model on the Friendster dataset.
The dataset is available in the HDF5 format in the file
friendster_25K.h5
.
This file has the following HDF5 Datasets:
adjacency
: The adjacency matrix, withn_nodes
rows. Each entryu
is an array with the neighbors ofu
.features
: The feature matrix, of shape(n_nodes, n_features)
. Each entryu
has the features of nodeu
.target
: The target label of the ndoes, of shapen_nodes
. Each entryu
has the integer that represents the label of nodeu
.feature_names
: The names of each of the features. Hasn_features
entries.target_names
: The names of each label.
Using the h5py
library, the data can be loaded as:
from h5py import File
dataset = File("./friendster_25K.h5")
A = dataset["adjacency"][:] # Adjacency list
X = dataset["features"][:] # Feature matrix
y = dataset["target"][:] # Node labels
The data split is available in the file friendster_25K.split.npz
. This
can be loaded with:
import numpy as np
data = np.load("friendster_25K.split.npz")
train_nodes = data["arr_0"][0]
validation_nodes = data["arr_0"][1]
test_nodes = data["arr_0"][2]