From Real Data

These datasets are produced from real-world data sources.

Some are produced by repeatedly sampling from a large real world graphs, and others are simply reformated from an existing many-graph dataset.

class bgd.real.CoraDataset(root, stage='train', transform=None, pre_transform=None, pre_filter=None, num=2000)[source]

Bases: InMemoryDataset

Contributor: Alex O. Davies

Contributor email: alexander.davies@bristol.ac.uk

Academic citation graphs from the ML community, sampled from a large original graph using ESWR. The original graph is sourced from:

Yang, Zhilin, William Cohen, and Ruslan Salakhudinov. “Revisiting semi-supervised learning with graph embeddings.” International conference on machine learning. PMLR, 2016.

The original data has one-hot bag-of-words over paper abstract as node features.

The task is node classification for the category of each paper, one-hot encoded for seven categories.

  • Task: Node classification

  • Num node features: 2879

  • Num edge features: None

  • Num target values: 7

  • Target shape: N Nodes

  • Num graphs: Parameterised by num

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • stage (str) – The stage of the dataset to load. One of “train”, “val”, “test”. (default: "train")

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • num (int) – The number of samples to take from the original dataset. (default: 2000).

__init__(root, stage='train', transform=None, pre_transform=None, pre_filter=None, num=2000)[source]
process()[source]
property processed_file_names
property raw_file_names
class bgd.real.FacebookDataset(root, stage='train', transform=None, pre_transform=None, pre_filter=None, num=2000)[source]

Bases: InMemoryDataset

Contributor: Alex O. Davies

Contributor email: alexander.davies@bristol.ac.uk

Facebook page-to-page interaction graphs, sampled from a large original graph using ESWR. The original graph is sourced from:

Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-Scale Attributed Node Embedding. Journal of Complex Networks 2021

The original data has node features, but as they are of varying length, we don’t include them here.

The task is node classification for the category of each Facebook page in a given graph, one-hot encoded for four categories.

  • Task: Node classification

  • Num node features: None

  • Num edge features: None

  • Num target values: 4

  • Target shape: N Nodes

  • Num graphs: Parameterised by num

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • stage (str) – The stage of the dataset to load. One of “train”, “val”, “test”. (default: "train")

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • num (int) – The number of samples to take from the original dataset. (default: 2000).

__init__(root, stage='train', transform=None, pre_transform=None, pre_filter=None, num=2000)[source]
process()[source]
property processed_file_names
property raw_file_names
class bgd.real.FromOGBDataset(root, ogb_dataset, stage='train', num=-1, transform=None, pre_transform=None, pre_filter=None)[source]

Bases: InMemoryDataset

Contributor: Alex O. Davies

Contributor email: alexander.davies@bristol.ac.uk

Converts an Open Graph Benchmark dataset into a torch_geometric.data.InMemoryDataset. This allows standard dataset operations like concatenation with other datasets.

The Open Graph Benchmark project is available here:

Hu, Weihua, et al. “Open graph benchmark: Datasets for machine learning on graphs.” Advances in neural information processing systems 33 (2020): 22118-22133.

We convert atom and bond features into one-hot encodings. The resulting shapes are: - node (atom features): (174, N Atoms) - edge (bond features) features: (13, N Bonds)

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • ogb_dataset (list) – an PygGraphPropPredDataset to be converted back to InMemoryDataset.

  • stage (str) – The stage of the dataset to load. One of “train”, “val”, “test”. (default: "train")

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • num (int) – The number of samples to take from the original dataset. -1 takes all available samples for that stage. (default: -1).

__init__(root, ogb_dataset, stage='train', num=-1, transform=None, pre_transform=None, pre_filter=None)[source]
process()[source]
property processed_file_names
property raw_file_names
class bgd.real.FromTUDataset(root, ogb_dataset, stage='train', num=-1, transform=None, pre_transform=None, pre_filter=None)[source]

Bases: InMemoryDataset

Contributor: Alex O. Davies (minimal alterations to existing code from PyG)

Contributor email: alexander.davies@bristol.ac.uk

Returns a torch_geometric.data.InMemoryDataset for each TUDataset. This allows standard dataset operations like concatenation with other datasets.

The datasets were originally collected in this paper:

Morris, Christopher, et al. “TUDataset: A collection of benchmark datasets for learning with graphs.” (2020).

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • ogb_dataset (list) – a TUDataset to be converted back to InMemoryDataset.

  • stage (str) – The stage of the dataset to load. One of “train”, “val”, “test”, “None”. If None, returns the whole original dataset. Otherwise returns one of a (80,10,10) train/val/test split. This split is shuffled on each new production of the torch geometric data objects. (default: None)

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • num (int) – The number of samples to take from the original dataset. -1 takes all available samples for that stage. Ignored if stage is not None. (default: -1).

STATS:

Name

#graphs

#nodes

#edges

#features

#classes

MUTAG

188

~17.9

~39.6

7

2

ENZYMES

600

~32.6

~124.3

3

6

PROTEINS

1,113

~39.1

~145.6

3

2

COLLAB

5,000

~74.5

~4914.4

0

3

IMDB-BINARY

1,000

~19.8

~193.1

0

2

REDDIT-BINARY

2,000

~429.6

~995.5

0

2

__init__(root, ogb_dataset, stage='train', num=-1, transform=None, pre_transform=None, pre_filter=None)[source]
process()[source]
property processed_file_names
property raw_file_names
class bgd.real.LivejournalDataset(root, stage='train', transform=None, pre_transform=None, pre_filter=None, num=2000)[source]

Bases: InMemoryDataset

Contributor: Alex O. Davies

Contributor email: alexander.davies@bristol.ac.uk

LiveJournal is a free on-line community with almost 10 million members; a significant fraction of these members are highly active. (For example, roughly 300,000 update their content in any given 24-hour period.) LiveJournal allows members to maintain journals, individual and group blogs, and it allows people to declare which other members are their friends they belong.

The original graph is sourced from:

L. Backstrom, D. Huttenlocher, J. Kleinberg, X. Lan. Group Formation in Large Social Networks: Membership, Growth, and Evolution. KDD, 2006.

There are no node or edge features.

There is also no set task, although edge prediction is a valid option.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • stage (str) – The stage of the dataset to load. One of “train”, “val”, “test”. (default: "train")

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • num (int) – The number of samples to take from the original dataset. (default: 2000).

__init__(root, stage='train', transform=None, pre_transform=None, pre_filter=None, num=2000)[source]
process()[source]
property processed_file_names
property raw_file_names
class bgd.real.NeuralDataset(root, stage='train', transform=None, pre_transform=None, pre_filter=None, num=2000)[source]

Bases: InMemoryDataset

Contributor: Alex O. Davies

Contributor email: alexander.davies@bristol.ac.uk

A dataset of the connectome of a fruit fly larvae. The original graph is sourced from:

Michael Winding et al. , The connectome of an insect brain.Science379,eadd9330(2023).DOI:10.1126/science.add9330

We process the original multigraph into ESWR samples of this neural network, with predicting the strength of the connection (number of synapses) between two neurons as the target.

  • Task: Edge regression

  • Num node features: 0

  • Num edge features: 0

  • Num target values: 1

  • Target shape: N Edges

  • Num graphs: Parameterised by num

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • stage (str) – The stage of the dataset to load. One of “train”, “val”, “test”. (default: "train")

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • num (int) – The number of samples to take from the original dataset. (default: 2000).

__init__(root, stage='train', transform=None, pre_transform=None, pre_filter=None, num=2000)[source]
process()[source]
property processed_file_names
property raw_file_names
class bgd.real.PennsylvaniaRoadDataset(root, stage='train', transform=None, pre_transform=None, pre_filter=None, num=2000)[source]

Bases: InMemoryDataset

Contributor: Alex O. Davies

Contributor email: alexander.davies@bristol.ac.uk

NOTE: This is a big graph (1M nodes) so subsampling many small graphs from it can be very slow. To alleviate this we pre-sample a graph of 100k nodes from the original.

Road graphs from Pennsylvania, sampled from a large original graph using ESWR. The original graph is sourced from:

J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Mathematics 6(1) 29–123, 2009.

The task is predicting whether a given graph is planar (can be laid out with no crossing edges).

  • Task: Graph classification

  • Num node features: None

  • Num edge features: None

  • Num target values: 1

  • Target shape: 1

  • Num graphs: Parameterised by num

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • stage (str) – The stage of the dataset to load. One of “train”, “val”, “test”. (default: "train")

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • num (int) – The number of samples to take from the original dataset. (default: 2000).

__init__(root, stage='train', transform=None, pre_transform=None, pre_filter=None, num=2000)[source]
process()[source]
property processed_file_names
property raw_file_names
class bgd.real.RedditDataset(root, stage='train', transform=None, pre_transform=None, pre_filter=None, num=2000)[source]

Bases: InMemoryDataset

Contributor: Alex O. Davies

Contributor email: alexander.davies@bristol.ac.uk

Reddit hyperlink graphs - ie graphs of subreddits interacting with one another. The original graph is sourced from:

Kumar, Srijan, et al. “Community interaction and conflict on the web.” Proceedings of the 2018 world wide web conference. 2018.

We produce this dataset of small graphs using ESWR. The data has text embeddings as node features for each subreddit and text features for the cross-post edges.

The task is edge classification for the sentiment of the interaction between subreddits.

  • Task: Edge classification

  • Num node features: 300

  • Num edge features: 86

  • Num target values: 1

  • Target shape: N Edges

  • Num graphs: Parameterised by num

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • stage (str) – The stage of the dataset to load. One of “train”, “val”, “test”. (default: "train")

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • num (int) – The number of samples to take from the original dataset. (default: 2000).

__init__(root, stage='train', transform=None, pre_transform=None, pre_filter=None, num=2000)[source]
process()[source]
property processed_file_names
property raw_file_names
class bgd.real.TwitchEgoDataset(root, stage='train', transform=None, pre_transform=None, pre_filter=None, num=5000)[source]

Bases: InMemoryDataset

Contributor: Alex O. Davies

Contributor email: alexander.davies@bristol.ac.uk

Ego networks from the streaming platform Twitch. The original graph is sourced from:

B. Rozemberczki, O. Kiss, R. Sarkar: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs 2019.

The task is predicting whether a given streamer plays multiple different games.

  • Task: Graph classification

  • Num node features: None

  • Num edge features: None

  • Num target values: 1

  • Target shape: 1

  • Num graphs: 127094

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • stage (str) – The stage of the dataset to load. One of “train”, “val”, “test”. (default: "train")

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • num (int) – The number of samples to take from the original dataset. (default: 2000).

__init__(root, stage='train', transform=None, pre_transform=None, pre_filter=None, num=5000)[source]
process()[source]
property processed_file_names
property raw_file_names
bgd.real.from_ogb_dataset(root, stage='train', num=-1)[source]

Load a dataset from the Open Graph Benchmark (OGB) and convert it to the Big Graph Dataset format.

Parameters:
  • name (str) – The name of the OGB dataset. (Classification: “ogbg-molpcba”, “ogbg-molhiv”, “ogbg-moltox21”, “ogbg-molbace”, “ogbg-molbbbp”, “ogbg-molclintox”, “ogbg-molmuv”, “ogbg-molsider”, “ogbg-moltoxcast”, “ogbg-ppa”) (Regression: “ogbg-molesol”, “ogbg-molfreesolv”, “ogbg-mollipo”)

  • stage (str, optional) – The stage of the dataset to load (e.g., “train”, “valid”, “test”). Defaults to “train”.

  • num (int, optional) – The number of samples to load. Set to -1 to load all samples. Defaults to -1.

Returns:

The converted dataset in the Big Graph Dataset format.

Return type:

FromOGBDataset

bgd.real.from_tu_dataset(root, stage='train', num=-1)[source]

Load a TU dataset and convert it to the Big Graph Dataset format.

Parameters:
  • name (str) – The name of the TU dataset. (“MUTAG”, “ENZYMES”, “PROTEINS”, “COLLAB”, “IMDB-BINARY”, “REDDIT-BINARY”)

  • stage (str, optional) – The stage of the dataset to load (e.g., “train”, “valid”, “test”, None). Defaults to None, which returns the whole dataset. Otherwise returns one of a (80/10/10) train/val/test split.

  • num (int, optional) – The number of samples to load. Set to -1 to load all samples. Defaults to -1. Ignored if stage is not None.

Returns:

The converted dataset in the Big Graph Dataset format.

Return type:

FromOGBDataset