Reddit Example Dataset

A walkthrough of the dataset code for the Big Graph Dataset project

Alex Davies, University of Bristol, 2024

In this notebook we’ll write code to:

  • download a large Reddit graph from an online repository

  • sample that graph to produce a dataset of smaller graphs

  • process that dataset into a Pytorch Geometric InMemoryDataset


Getting the graph

First we need to download the graph, here from the Stanford Network Analysis Project

We first find the links for the graph and node features (here node features are text embeddings):

[1]:
graph_url = "https://snap.stanford.edu/data/soc-redditHyperlinks-title.tsv"
embedding_url = "http://snap.stanford.edu/data/web-redditEmbeddings-subreddits.csv"

Then play with directories and download the data:

[2]:
import os
import pickle
import wget
import pandas as pd

# Swap into dataset directory
print(os.getcwd())
start_dir = os.getcwd()
os.chdir("bgd_files")

# We won't actually need this part in the final function!
if "reddit" not in os.listdir():
    os.mkdir("reddit")

os.chdir("reddit")

#  Download raw files if we don't already have them
if "soc-redditHyperlinks-title.tsv" not in os.listdir():
    graph_data = wget.download(graph_url) # Edgelist and edge features
if "web-redditEmbeddings-subreddits.csv" not in os.listdir():
    embedding_data = wget.download(embedding_url) # Node features

# We know that there are 300 components in the node feature vectors
embedding_column_names = ["COMPONENT", *[i for i in range(300)]]
embeddings = pd.read_csv("web-redditEmbeddings-subreddits.csv", names=embedding_column_names).transpose()
graph_data = pd.read_csv("soc-redditHyperlinks-title.tsv", sep = "\t")


# Avoids weird directory problems
os.chdir(start_dir)


/home/alex/Projects/big-graph-dataset

Let’s have a look at the node embeddings:

[3]:
embeddings.head()
[3]:
0 1 2 3 4 5 6 7 8 9 ... 51268 51269 51270 51271 51272 51273 51274 51275 51276 51277
COMPONENT spiders askreddit globaloffensivetrade fireteams funny the_donald videos news leagueoflegends rocketleagueexchange ... motleyfool govtjobsrchinindia snoopdogg fortean whatcanidoforbernie 33rd bestofvic2015 aberystwyth mail_forwarding cover
0 0.158972 -0.499114 -0.023145 2.492506 -0.81937 -0.123265 0.131896 0.132825 -2.785298 0.553341 ... 0.004494 0.001908 -0.000534 0.000615 0.000906 -0.000076 -0.000203 -0.001563 0.009269 0.00457
1 0.285813 0.323983 -1.199374 -2.529917 -0.865261 -0.610208 0.866419 1.505527 -0.166391 -3.283354 ... 0.052268 0.042618 0.023619 0.023847 0.017816 -0.001643 0.012698 0.004733 0.024779 0.012403
2 0.226329 -0.424809 1.661484 -0.448484 0.301753 0.361495 0.919025 0.730393 1.592624 -3.091485 ... -0.027792 -0.021329 -0.003317 -0.018297 -0.004231 -0.002896 -0.007575 -0.000082 -0.017018 -0.000363
3 -0.183338 -0.222705 -1.025296 -3.543441 0.018787 -1.171773 -0.765584 -0.505759 -1.269829 0.877085 ... 0.013468 0.012 0.005566 -0.004873 -0.010438 0.000581 0.006486 -0.000982 -0.007228 -0.002496

5 rows × 51278 columns

And now the edges:

[4]:
graph_data.head()
[4]:
SOURCE_SUBREDDIT TARGET_SUBREDDIT POST_ID TIMESTAMP LINK_SENTIMENT PROPERTIES
0 rddtgaming rddtrust 1u4pzzs 2013-12-31 16:39:18 1 25.0,23.0,0.76,0.0,0.44,0.12,0.12,4.0,4.0,0.0,...
1 xboxone battlefield_4 1u4tmfs 2013-12-31 17:59:11 1 100.0,88.0,0.78,0.02,0.08,0.13,0.07,16.0,16.0,...
2 ps4 battlefield_4 1u4tmos 2013-12-31 17:59:40 1 100.0,88.0,0.78,0.02,0.08,0.13,0.07,16.0,16.0,...
3 fitnesscirclejerk leangains 1u50xfs 2013-12-31 19:01:56 1 49.0,43.0,0.775510204082,0.0,0.265306122449,0....
4 fitnesscirclejerk lifeprotips 1u51nps 2013-12-31 21:02:28 1 14.0,14.0,0.785714285714,0.0,0.428571428571,0....

Turn the data into a graph

Now we need to turn the data into a graph. Our edges come from graph_data (SOURCE and TARGET), including categories for each edge (LINK_SENTIMENT), as well as edge features (PROPERTIES).

The first step is making a networkx.Graph object, which is a useful graph class, then we add the nodes one by one. We’ll include the text embedding for the subreddit as a node attribute, taken from the embedding dataframe.

[5]:
# networkx is a Python library for graph structures
import networkx as nx
# tqdm for loading bars
from tqdm import tqdm
# cheeky function to visualise a networkx graph
from bgd.utils import vis_networkx

embeddings.columns = embeddings.iloc[0]
embeddings = embeddings.drop(["COMPONENT"], axis = 0)

graph = nx.Graph()

# Add a node for each id in the embedding data
for col in tqdm(embeddings.columns, desc = "Adding nodes"):
    # attrs here is taken from the embedding data, with the node id the column (col)
    graph.add_node(col, attrs=embeddings[col].to_numpy().astype(float))
Adding nodes: 100%|██████████| 51278/51278 [00:01<00:00, 32961.66it/s]

Now we add the edges (the actual graph stuff).

We need to include two properties:

  • Type of edge, negative or positive interaction, -1 or 1. In dataframe as LINK_SENTIMENT

  • Properties of the edge, text embedding of the reddit post content

[6]:
import numpy as np

def fix_property_string(input_string):
    input_string = input_string.split(',')
    input_string = [float(item) for item in input_string]

    return np.array(input_string)

sources = graph_data["SOURCE_SUBREDDIT"].to_numpy()
targets = graph_data["TARGET_SUBREDDIT"].to_numpy()

# This line can take a while!
attrs = [fix_property_string(properties) for properties in tqdm(graph_data["PROPERTIES"].tolist(), desc = "Wrangling edge features")]
Wrangling edge features: 100%|██████████| 571927/571927 [00:05<00:00, 103705.47it/s]

Now iterate over the edges, only adding them if their nodes have data:

[7]:

labels = graph_data["LINK_SENTIMENT"].to_numpy() all_nodes = list(graph.nodes()) for i in tqdm(range(sources.shape[0]), desc = "Adding edges"): # Check that the edge has data if sources[i] in all_nodes and targets[i] in all_nodes: graph.add_edge(sources[i], targets[i], labels = labels[i], attr = attrs[i])
Adding edges: 100%|██████████| 571927/571927 [00:37<00:00, 15434.45it/s]
[8]:
print(graph)

# Last tidying bits
graph = nx.convert_node_labels_to_integers(graph)
CGs = [graph.subgraph(c) for c in nx.connected_components(graph)]
CGs = sorted(CGs, key=lambda x: x.number_of_nodes(), reverse=True)
graph = CGs[0]
graph = nx.convert_node_labels_to_integers(graph)

# Save the graph!
with open("bgd_files/reddit/reddit-graph.npz", "wb") as f:
    pickle.dump(graph, f)
Graph with 51278 nodes and 178143 edges

Nice! Graph achieved. Spot that in the last section we had to deal with some missing data - we’re including edges ONLY if their nodes also have data.


Sample to make a dataset of smaller graphs

This is not too hard, I’ve written some code (Exploration Sampling With Replacement, ESWR) that does the sampling for you.

This will produce a big list of small networkx graphs sampled from that original graph.

[9]:
from bgd.utils import ESWR

# Sample 1000 graphs of max 96 nodes from the big reddit graph
nx_graph_list = ESWR(graph, 1000, 96)

Sampling 1000 in 5 chunks with size 200
Sampling from large graph: 100%|██████████| 6/6 [00:07<00:00,  1.31s/it]
Done sampling!


Next we need to convert it to a Pytorch Geometric format. This will be specific to your data - here we have node labels, edge labels, edge attributes.

[10]:
import torch
from torch_geometric.data import Data

def specific_from_networkx(graph):
    # Turns a graph into a pytorch geometric object
    # Mostly by unpacking dictionaries on nodes and edges
    # Here node labels are the target
    # One of these functions for each dataset ideally - they are unlikely to transfer across datasets
    node_attrs = []
    edge_indices = []
    edge_labels = []
    edge_attrs = []

    # Collect node labels and attributes
    for n in list(graph.nodes(data=True)):
        # list(graph.nodes(data=True)) returns [(node_id1, {attribute dictionary}), (node_id2, ...), (node_id3, ...)]
        node_attrs.append(torch.Tensor(n[1]["attrs"]))

    # Collect edge indices and attributes
    for e in graph.edges(data=True):
        # graph.edges(data=True) is a generator producing (node_id1, node_id2, {attribute dictionary})
        edge_indices.append((e[0], e[1]))

        edge_attrs.append(torch.Tensor(e[2]["attr"]))
        edge_labels.append(e[2]["labels"])


    # Specific to classification on edges! This is a binary edge classification (pos/neg) task
    edge_labels = ((torch.Tensor(edge_labels) + 1)/2).reshape(-1,1)

    edge_attrs = torch.stack(edge_attrs)
    node_attrs = torch.stack(node_attrs)
    edge_indices = torch.tensor(edge_indices, dtype=torch.long).t().contiguous()

    # Create PyG Data object
    # Can pass:
    # x:            node features, shape (n nodes x n features)
    # edge_index:   the list of edges in the graph, shape (2, n_edges). Entries edge_index[i, :] are [node_id1, node_id2].
    # edge_attr:    edge features, shape (n_edges, n_features), same order as edgelist
    # y:            targets. Graph regression shape (n_variables), graph classification (n_classes), node classification (n_nodes, n_classes), edge classification (n_edges, n_classes)
    data = Data(x=node_attrs, edge_index=edge_indices, edge_attr = edge_attrs,  y=edge_labels)

    return data

print("Data in torch geometric format:")
specific_from_networkx(nx_graph_list[0])
Data in torch geometric format:
/home/alex/anaconda3/envs/adgcl/lib/python3.8/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[10]:
Data(x=[54, 300], edge_index=[2, 249], edge_attr=[249, 86], y=[249, 1])

The final dataset

Finally we place all that data into an InMemoryDataset!

Please note that in-code this whole notebook will be in functions - see datasets/example_dataset.py.

This means that the datalist argument wouldn’t actually exist below - instead you’d call something like get_reddit_dataset() within your .process method.

[13]:
from torch_geometric.data import InMemoryDataset
from bgd.utils import vis_grid

class RedditDataset(InMemoryDataset):
    r"""
    Reddit hyperlink graphs - ie graphs of subreddits interacting with one another.
    The original graph is sourced from:

    `Kumar, Srijan, et al. "Community interaction and conflict on the web." Proceedings of the 2018 world wide web conference. 2018.`

    The data has text embeddings as node features for each subreddit and text features for the cross-post edges.

    The task is edge classification for the sentiment of the interaction between subreddits.

    - Task: Edge classification
    - Num node features: 300
    - Num edge features: 86
    - Num target values: 1
    - Target shape: N Edges
    - Num graphs: Parameterised by `num`

    Args:
    root (str): Root directory where the dataset should be saved.
    datalist (list): A list of pytorch geometric data objects. Only obtained from an argument here, not in actual code!
    stage (str): The stage of the dataset to load. One of "train", "val", "test". (default: :obj:`"train"`)
    transform (callable, optional): A function/transform that takes in an :obj:`torch_geometric.data.Data` object and returns a transformed version. The data object will be transformed before every access. (default: :obj:`None`)
    pre_transform (callable, optional): A function/transform that takes in an :obj:`torch_geometric.data.Data` object and returns a transformed version. The data object will be transformed before being saved to disk. (default: :obj:`None`)
    pre_filter (callable, optional): A function that takes in an :obj:`torch_geometric.data.Data` object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: :obj:`None`)
    num (int): The number of samples to take from the original dataset. (default: :obj:`2000`).
    """

    # datalist is only an argument for this notebook example - see existing datasets under bgd/.. for how it actually works
    def __init__(self, root, datalist, stage = "train", transform=None, pre_transform=None, pre_filter=None, num = 2000):
        self.num = num
        self.stage = stage
        self.stage_to_index = {"train":0,
                               "val":1,
                               "test":2}
        self.datalist = datalist

        # Options are node-classification, node-regression, graph-classification, graph-regression, edge-regression, edge-classification
        # Graph-level tasks are preferred! (graph-classification and graph-regression)
        # edge-prediction is another option if you can't think of a good task
        self.task = "node-classification"

        super().__init__(root, transform, pre_transform, pre_filter)
        self.data, self.slices = torch.load(self.processed_paths[self.stage_to_index[self.stage]])


    @property
    def raw_file_names(self):
        # Replace with your saved raw file name
        return ['reddit-graph.npz']

    @property
    def processed_file_names(self):
        return ['train.pt',
                'val.pt',
                'test.pt']


    def process(self):
        # Read data into huge `Data` list.

        if os.path.isfile(self.processed_paths[self.stage_to_index[self.stage]]):
            print(f"Cora files exist")
            return

        # Get a list of num pytorch_geometric.data.Data objects
        data_list = self.datalist # get_example_dataset(num=self.num)

        # Torch geometric stuff
        if self.pre_filter is not None:
            data_list = [data for data in data_list if self.pre_filter(data)]

        if self.pre_transform is not None:
            data_list = [self.pre_transform(data) for data in data_list]

        # Save data
        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[self.stage_to_index[self.stage]])

pyg_graphs = [specific_from_networkx(g) for g in tqdm(nx_graph_list, desc = "Converting data to PyG data objects")]

# The visualisation doesn't work in documentation :( uncomment to see the graphs
vis_grid(pyg_graphs[:16], os.getcwd() + "/bgd_files/reddit/example.png", show_plot = True)

train_dataset = RedditDataset(os.getcwd() + "/bgd_files/reddit",
                              pyg_graphs)
Converting data to PyG data objects: 100%|██████████| 1000/1000 [00:01<00:00, 991.15it/s]
/home/alex/Projects/big-graph-dataset/utils.py:398: UserWarning: The figure layout has changed to tight
  plt.tight_layout()
_images/reddit-dataset-example_22_1.png
Cora files exist
Processing...
Done!

Other datsets

There are already some datasets we can look at under bgd/DATASET.py.

[18]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize = (10,10))
im = plt.imread("docs/source/_static/datasets.png")
plt.imshow(im)
plt.axis('off')
plt.tight_layout()
plt.show()
_images/reddit-dataset-example_24_0.png
[ ]: