Welcome to the project! We’re excited to have you on board. We’ll be collaborating through GitHub, with everyone working in their own branch.
- There are a few rules for the datasets, demonstrated in
bgd/real/example_dataset.py: Please cite your sources for data in documentation - see the existing datasets for examples
Where possible start from existing datasets that have been used in-literature (to avoid ethics paperwork)
If using generators, use generators that are well-understood (for example Erdos-Renyi graphs)
The datasets need at least a train/val/test split
Datasets should be many small (less than 400 node) graphs
Ideally the number of graphs in each dataset should be controllable
Data should be downloaded in-code to keep the repo small. If this isn’t possible let me know.
Set Up & Contributing
1. Clone the Repository
Open your terminal and run the following command to clone the main repository:
git clone https://github.com/alexodavies/big-graph-dataset.git
3. Create a new branch:
git checkout -b your-name
Replace your-name with your name or a descriptive name for your data.
3. Work your magic:
Copy
bgd/real/example_dataset.pyinto the relevant sub-directory (bgd/real/,bgd/synthetic/depending on your data)Re-tool it for your data (
NAME_dataset.pyor something similar)Add your dataset to relevant
__init__.pyfiles (bgd/X/__init__.pyandbgd/__init__.py)
4. Stage your changes:
Add the files you modified or created to the staging area:
git add NAME_dataset.py
5. Commit your changes:
Commit your changes with a descriptive message:
git commit -m "A very detailed and useful commit message that everyone likes to read."
6. Push Your Branch to GitHub
Push your branch to the main repository on GitHub:
git push origin your-name
7. Create a Pull Request
Go to the repository on GitHub <https://github.com/alexodavies/big-graph-dataset.git>.
Click on the “Pull Requests” tab.
Click the “New pull request” button.
Select the branch you just pushed from the “compare” drop-down menu.
Provide a title and description for your pull request.
Click “Create pull request”.
8. Merge the pull request:
After your code is reviewed, the pull request will be merged into the main branch by the project maintainer (Alex O. Davies <alexander.davies@bristol.ac.uk>).
Summary of Git Commands
# Clone the repository
git clone https://github.com/alexodavies/big-graph-dataset.git
cd big-graph-dataset
# Create a new branch
git checkout -b your-name
# Make changes, stage, and commit
git add NAME_dataset.py
git commit -m "Add detailed description of changes"
# Push your branch to GitHub
git push origin your-name
Python Environments
Please use Python 3.11.X
A generic environment is under requirements_dev.txt, use pip install -r requirements_dev. txt within a virtual (Conda etc.) environment to get everything installed:
conda create -n big-graph-dataset python=3.11 numpy<=2.0.0 pytorch<=2.4.1 -c pytorch
conda activate big-graph-dataset
pip install -r requirements_dev.txt
There is also a packaged conda environment in environment.yml - you can run conda env create -f enviroment.yml to try this instead of the pip installs.
If this doesn’t work, create and activate a new environment (here with Conda):
conda create -n big-graph-dataset python=3.11 numpy=1.26.4 pytorch=2.4.1
conda activate big-graph-dataset
then these commands should install (most of) the relevant libraries:
pip install torch torchvision torchaudio
pip install torch_geometric torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.4.0+cpu.html
pip install matplotlib pandas wget umap-learn
Enviroment Issues
The packaged enviroment uses CPU versions of PyTorch and PyTorch Geometric. If you have a GPU, you can install the GPU versions by changing the torch and torch_geometric lines in requirements_dev.txt to the GPU versions from the PyTorch website.
Some packages don’t play well with others. If you have significant issues, please get in touch.
I sometimes run into a OSError: [Errno 66] Directory not empty error during pip install -r requirements_dev. txt from littleballoffur - this is a numpy versioning issue, try pip install numpy==1.26.4.
torch_sparse can take a long time to install
If your code requires new package installs please tell me (email, but this can also be detailed in the pull request)
Testing Code
In your if __name == `__main__ section you can use some prepared code:
from ..utils import describe_one_dataset, vis_grid
... rest of your code ...
if __name__ == "__main__":
# Please set the last part of the path to your dataset name!
dataset = NAMEDataset(os.getcwd()+'/bgd_files/'+'NAME', stage = "train")
describe_one_dataset(dataset)
vis_grid(dataset[:16], os.getcwd()+"/bgd_files/NAME/train.png")
# Option to instead show the plot in GUI
vis_grid(dataset[:16], os.getcwd()+"/bgd_files/NAME/train.png", show_plot = True)
You’d run this from the root big-graph-dataset with:
python -m bgd.(real/synthetic).NAME_dataset
assuming that you’ve added your dataset to the relevant __init__.py files.
Please don’t make changes to any other files!
Feel free to reach out if you have any questions or need further assistance. Happy coding!
Citing
@misc{big-graph-dataset,
title = {{Big Graph Dataset} Documentation},
howpublished = {https://big-graph-dataset.readthedocs.io/}}