pyHGT - Adaptive Transformer Solutions for Scalable Heterogeneous Graph Analysis

Introduction to the Heterogeneous Graph Transformer (HGT) Project

The Heterogeneous Graph Transformer (HGT) is an innovative architecture in the field of graph neural networks, specifically designed to effectively handle large-scale, diverse, and dynamic graphs. This project has been implemented with the Pytorch Geometric API and offers a robust solution for working with complex graph data, such as those found in various real-world applications.

Project Components

conv.py: This is the heart of the HGT model, implementing a transformer-like layer for heterogeneous graph convolution. It's designed to capture complex patterns across different types of nodes and edges.
model.py: This file includes different components of the model, structured to work in unison to process and learn from the graph data.
data.py: Provides data interface and management, featuring:
- Graph Class: Manages the data structure for heterogeneous graphs, with node features stored as pandas DataFrames and edge relationships as a dictionary.
- sample_subgraph Function: Implements a sophisticated algorithm for sampling subgraphs, ensuring efficient and focused processing by selecting nodes based on certain criteria.
train_*.py: Contains scripts for training and validating the model on specific tasks. Key functions include:
- Sample Functions: Tailored sampling methods for different tasks, crucial for proper learning and preventing data leakage.
- prepare_data Function: Facilitates parallel data sampling, harmonizing the data preparation phase with model training for efficiency.

Setup and Installation

To set up this project, you need the following dependencies installed:

Pytorch 1.3.0
pytorch_geometric 1.3.2, along with its subdependencies:
- torch-cluster==1.4.5
- torch-scatter==1.3.2
- torch-sparse==0.4.3
Other required libraries include gensim, sklearn, tqdm, dill, and pandas.

You can easily install all necessary packages by using the command: pip install -r requirements.txt.

Experimentation and Data

The current experiments with HGT are conducted using the Open Academic Graph (OAG). This dataset, after preprocessing, is split into various categories such as all Computer Science, Machine Learning, and Neural Networks papers from 1900-2020. These can be downloaded and used directly for experiments, or new data can be processed using the provided tools and scripts like preprocess_OAG.py.

Using the HGT Project

To utilize the HGT model for a specific task, such as paper-field classification, you can execute the training script with appropriate parameters. For instance:

python3 train_paper_field.py --data_dir PATH_OF_DATASET --model_dir PATH_OF_SAVED_MODEL --conv_name hgt

In this command:

conv_name specifies the model, default to HGT.
sample_depth and sample_width control the graph sampling dimensions.
n_pool dictates the number of processes for parallel data sampling.
repeat determines how often the same sampled batch is reused during training.

The script also includes several other optional hyperparameters for fine-tuning.

Contribution and Citation

If you find the HGT project useful in your work, the developers kindly ask you to consider citing their research paper:

@inproceedings{hgt,
  author    = {Ziniu Hu and
               Yuxiao Dong and
               Kuansan Wang and
               Yizhou Sun},
  title     = {Heterogeneous Graph Transformer},
  year      = {2020},
  url       = {https://doi.org/10.1145/3366423.3380027},
  doi       = {10.1145/3366423.3380027},
  publisher = {{ACM} / {IW3C2}},
}

This citation acknowledges their contribution to advancing research in handling complex graph structures with the Heterogeneous Graph Transformer.