Transformer-from-scratch - Simplified Training of Large Language Models Using a Minimal Code Approach

Introduction to the Transformer-from-Scratch Project

The Transformer-from-Scratch project offers a concise demonstration of training a Large Language Model (LLM) using the Transformer architecture with only around 240 lines of code. Inspired by the nanoGPT project, it aims to provide an educational resource for newcomers to LLM training using PyTorch. This project stands out due to its simplicity and serves as a foundational guide for understanding the process of LLM training.

Project Overview

Objective

The objective of this project is to demystify the complexities of training a Transformer-based large language model by using an approachable and easy-to-follow codebase. This approach allows individuals with minimal experience in artificial intelligence to grasp the core concepts and methods involved in building LLMs from scratch.

Dataset and Model Specifications

The model is trained using a 450KB sample textbook dataset, downloaded from a public repository. Remarkably, this entire model, comprising approximately 51 million parameters, can be trained on a single i7 CPU in roughly 20 minutes. By the end of the training, the model achieves about 1.3 million parameters, showcasing the efficiency and learning capability of the model.

How to Get Started

Installation

Begin by installing the necessary dependencies using the following command:

pip install numpy requests torch tiktoken

Execution

Run the Model Script:

Execute model.py to initiate the training process. The script initially downloads the dataset and saves it in the data folder. Training and validation losses are continuously reported on the console, reflecting the model’s learning progress. For instance:
```
Step: 0 Training Loss: 11.68 Validation Loss: 11.681
...
```
Over 5000 iterations, the training loss converges to around 2.807. The trained model is saved as model-ckpt.pt.
Model Output:

After training, the model generates sample text based on the learned patterns. It provides a glimpse into the model’s ability to create coherent text, such as:
```
The salesperson to identify the other cost savings...
```

Experimentation

Users can experiment by modifying hyperparameters at the top of the model.py file to observe different training outcomes. This feature promotes active learning and comprehension of the impact of various parameters on the model's performance.

Learning with Jupyter Notebook

For a more in-depth understanding of the project architecture, a detailed Jupyter Notebook (step-by-step.ipynb) is available. This includes visual representations and intermediate results at each stage of the Transformer’s operations. To utilize this feature, additional installations are needed:

pip install matplotlib pandas

The notebook covers:

Input embeddings
Positional encoding
Attention mechanisms and their visual depictions

Advanced Topics and Exploration

For those interested in further exploration, the /GPT2 directory contains sample code on fine-tuning a pre-trained GPT-2 model and conducting inference tasks. Additionally, the author's blog post, Transformer Architecture: LLM From Zero-to-Hero, provides in-depth insights into transformer architectures, perfect for those new to LLM.

References for Expanded Learning

nanoGPT by Andrej Karpathy
Transformers from Scratch by Mat Miller
Attention is All You Need by Vaswani et al.

This blend of educational resources and practical implementation offers a comprehensive entry point into the field of large language model training using Transformers.