Pretraining and Finetuning LLMs from the Ground Up
Overview
The "Pretraining and Finetuning LLMs from the Ground Up" project is an insightful and practical tutorial designed for coders who are eager to delve into the mechanics and coding processes behind large language models (LLMs) using PyTorch. The project begins by introducing the foundational concepts of LLMs, highlighting significant breakthroughs and their applications in various fields. Participants will gain hands-on experience by developing a small GPT-like LLM, starting from the data input pipeline to its core architectural components and pretraining procedures. Once the foundational understanding is established, the tutorial guides learners through the process of loading pretrained weights and finetuning LLMs with the help of open-source libraries.
The instructional content is inspired by the book Build a Large Language Model From Scratch and employs the LitGPT library for implementation.
Setup Instructions
To facilitate learning, a pre-configured cloud environment is provided, complete with necessary code examples and dependencies. This setup allows participants to seamlessly execute code, especially in the sections focusing on pretraining and finetuning, using a GPU. Access to this environment is available through this link.
Moreover, detailed setup instructions are provided in the setup folder, enabling users to configure their local machines to run the code independently.
Outline
Here is a detailed outline of what each section of the project covers:
-
Introduction to LLMs
This introductory section presents an overview of LLMs, outlines the subjects to be covered throughout the workshop, and provides setup instructions for participants.
Folder: 01_intro -
Understanding LLM Input Data
Participants will learn how to create a text input pipeline by developing a text tokenizer and a custom PyTorch DataLoader designed for the LLM.
Folder: 02_data -
Coding an LLM Architecture
This section explores the individual components of LLMs, focusing on how to integrate them into a coherent GPT-like model. It emphasizes understanding the big picture rather than delving into the minutiae of each module.
Folder: 03_architecture -
Pretraining LLMs
This part deals with the pretraining process of LLMs. Participants will write code to pretrain the model architecture crafted in earlier sections. To keep resources in check, pretraining is done using a small public domain text sample, allowing the LLM to generate basic sentences.
Folder: 04_pretraining -
Loading Pretrained Weights
As pretraining is resource-intensive, this section teaches how to load pretrained weights into the custom-built architecture. Participants are introduced to the LitGPT library, which provides more advanced code for training and finetuning LLMs. They learn to load pretrained LLM weights such as Llama, Phi, Gemma, and Mistral using LitGPT.
Folder: 05_weightloading -
Finetuning LLMs
This final section covers the techniques for finetuning LLMs. Participants prepare a small dataset for instruction finetuning and use it to further finetune an LLM within the LitGPT framework.
Folder: 06_finetuning
This project integrates the principles from the Build a Large Language Model From Scratch book and utilizes the LitGPT library to deliver a comprehensive educational experience in constructing and optimizing large language models.