picoGPT - A Simplified and Educationally Focused Numpy GPT-2

Project Introduction: picoGPT

Among the various implementations of GPT-2 available today, picoGPT stands out as a quirky and minimalist approach. It is an implementation designed to simplify the complexity of the GPT-2 model using only NumPy, a popular numerical computing library in Python. picoGPT shrinks the essence of the GPT-2 forward pass into just 40 lines of code, making it an intriguing project for those interested in understanding the core mechanics of language models without navigating through extensive codebases.

Key Features of picoGPT

Speed and Complexity: Despite its concise nature, picoGPT is not fast. In fact, it operates with seemingly intentional slowness. Additionally, it does not include training code, highlighting its focus on minimalism over functionality in a production scenario.
Inference Methodology: The model supports only single-step inference, aligning with its theme of simplicity by processing one data file, line by line. Basic greedy sampling is possible, but it lacks more sophisticated approaches like top-p or top-k sampling.
Readability and Size: Created to be both readable and extremely compact, the project includes two main code files: gpt2.py and gpt2_pico.py. While gpt2.py is already concise, gpt2_pico.py significantly reduces the code to its bare minimum.

Breakdown of Files

encoder.py: This file contains code for OpenAI's Byte Pair Encoding (BPE) tokenizer, directly imported from OpenAI's original GPT-2 repository. It handles the transformation of input text into tokenized data that the model can process.
utils.py: It provides utilities for downloading and loading the necessary GPT-2 model weights, tokenizers, and hyper-parameters required for setting up the model.
gpt2.py: This file encapsulates the full GPT model and its generation capabilities, allowing it to be run as a standalone Python executable for text generation.
gpt2_pico.py: A distilled version of gpt2.py, further reducing the lines of code while retaining the same basic functionalities.

Dependencies and Requirements

picoGPT requires Python 3.9.10 for execution, along with other dependencies specified in a requirements.txt file. These can be installed via:

pip install -r requirements.txt

Usage

To run the model and generate text, execute the following command:

python gpt2.py "Alan Turing theorized that computers would one day become"

Upon execution, the program generates a continuation of the input text, for example:

the most powerful machines on the planet.

The computer is a machine that can perform complex calculations, and it can perform these calculations in a way that is very similar to the human brain.

Users can customize the number of tokens produced, model size, and save directory with additional parameters:

python gpt2.py \
    "Alan Turing theorized that computers would one day become" \
    --n_tokens_to_generate 40 \
    --model_size "124M" \
    --models_dir "models"

In summary, picoGPT is an educational and minimalistic tool for those interested in exploring the internals of the GPT-2 model without unnecessary complexity or excessive code. It serves as a testament to the power and elegance of simplifying large-scale models into accessible formats suitable for curiosity-driven exploration.