cramming - Efficient BERT-Like Model Pretraining on Single GPU in One Day

Introduction to the Cramming Language Model Project

The Cramming project is an innovative approach to training language models using significantly limited computational resources. The project stems from the research paper titled "Cramming: Training a Language Model on a Single GPU in One Day," which explores the potential of training a language model with constrained resources. The essential question driving this research is: "How far can we get with a single GPU in just one day?"

Project Overview

The project's core aim is to investigate the performance limits of a transformer-based language model, akin to BERT, within a constrained environment. This involves training a model from scratch using a technique called masked language modeling, restricted to just one GPU for 24 hours. The study not only re-evaluates almost every aspect of the pretraining pipeline but also proposes modifications to approach BERT-level performance, even with limited computational power. Intriguingly, it examines why scaling down the model is a challenge and which changes genuinely enhance performance in such settings.

Updating the Cramming Framework

The latest version of the framework demands PyTorch 2.0 or higher. This version showcases improvements, with models trained on this updated codebase showing better performance by 1-2% on the GLUE benchmark compared to earlier versions. Notably, the data preprocessing step has been enhanced, allowing users to stream data directly from a hosted dataset, simplifying the setup process.

Key Rules for Cramming

The Cramming project adheres to a set of stringent rules:

The model, irrespective of its size, is to be trained from scratch without relying on existing pretrained models.
Only raw text—excluding downstream data—can be part of the training process.
Data downloading and pre-processing are excluded from the total compute budget.
Training occurs on a singular GPU over a 24-hour period.
The model's downstream performance is evaluated using the GLUE benchmark, with a restricted finetuning exercise.

Running the Code

Executing the Cramming project's code involves several steps:

Dependencies Installation: All the necessary packages can be installed through a single command, pip install ..
Data Handling: Preprocessed data can be conveniently managed using datasets hosted on the Hugging Face platform. This allows for efficient storage and retrieval, leveraging pre-tokenized datasets saved in a structured database.
General Usage: The pretrain.py script facilitates running pretrained models under limited compute conditions. Configuration options can be modified directly from the command line to adapt to different experimental requirements.
Model Evaluation: Evaluation on tasks like GLUE is done through the eval.py script, which locates the latest model checkpoint and assesses its performance.

Exemplifying Recipes and Additional Tools

Several pre-designed recipes are available to help replicate findings or explore new training setups. For optimal performance, it's suggested to use specific inductor settings, especially when engaging with the newer PyTorch versions.

Support and Contribution

The project welcomes contributions and inquiries from the community. Interested individuals can engage with the development process, troubleshoot issues, or explore additional options provided in the repository.

In conclusion, the Cramming project offers a compelling exploration into achieving effective language model training with minimal computational resources, challenging the prevailing notion that more substantial computational power is a prerequisite for state-of-the-art performance.