nano-llama31 - Simplified Llama 3.1 Implementation with Minimal Dependencies for Streamlined Training and Inference

Introduction to nano-llama31

The nano-llama31 project is a simplified and lightweight implementation of the Llama 3.1 architecture. Much like how nanoGPT provides a minimal version of GPT-2, nano-llama31 aims to offer a straightforward and dependency-free approach for training, fine-tuning, and performing inference with the Llama 3.1 model. Unlike the official implementation from Meta or the more feature-rich versions from Hugging Face, nano-llama31 focuses on reducing dependencies and simplifying code structure.

Core Focus

Currently, the project is centered on the 8B base model of Llama 3.1. This is a work-in-progress project, actively developed but not yet fully ready for prime time use.

Reference from Meta's Official Llama 3.1 Code

To understand nano-llama31, it's useful to first look at Meta's official Llama 3.1 code. Unfortunately, the official repository lacks comprehensive documentation or instructions on how to utilize the models post-download. Here's a basic guide on how to set up and generate text using the official code:

Clone the official llama-models repository.

git clone https://github.com/meta-llama/llama-models.git

Download the Llama 3.1 8B base model.
```
cd llama-models/models/llama3_1
chmod u+x download.sh
./download.sh
```
Access needs to be requested from Meta, and the model file is about 16GB.

Set up a Python environment.

conda create -n llama31 python=3.10
conda activate llama31

Install necessary packages and run the generation script.

pip install -r requirements.txt
pip install -e .
cd ../
pip install fire
torchrun --nnodes 1 --nproc_per_node 1 reference.py \
    --ckpt_dir llama-models/models/llama3_1/Meta-Llama-3.1-8B \
    --tokenizer_path llama-models/models/llama3_1/Meta-Llama-3.1-8B/tokenizer.model

Despite the effort, the code is marked as "deprecated" by Meta, raising concerns about its reliability with Llama 3.1 models. Nevertheless, it produces plausible text completions, offering initial validation.

Nano-llama31's Unique Approach

Nano-llama31 enhances the official implementation by stripping down cumbersome modules like torchrun and fairscale, aiming for simplicity while ensuring the output remains consistent with trusted references.

Run tests with:

python test_llama31.py

This simplified single-file PyTorch adaptation matches reference outputs, confirming its effectiveness.

Fine-Tuning Capabilities

An early draft of fine-tuning on the Tiny Stories dataset is available with nano-llama31, although it requires significant VRAM—highlighting the need for powerful GPUs. This draft focuses on training components like RMSNorm.

Future Plans and Improvements

There are several objectives planned for nano-llama31's development:

Remove unnecessary code and enhance efficiency.
Expand fine-tuning features similar to nanoGPT, including mixed precision, distributed parallel processing, and more.
Add support for Chat model inference and fine-tuning, not only the Base model.
Consider extending support to Llama 3 models larger than 8B.
Address deprecated functions such as set_default_tensor_type.
Fix fine-tuning processes to respect Llama 3’s structural training methods, particularly in attention masking during inference.
Optimize memory use by managing the KV cache effectively.

The nano-llama31 project offers a promising, minimalist alternative to interact with the Llama 3.1 models, as it evolves to address current limitations and expand its functionalities.