datablations
Discover strategies for scaling language models in data-limited contexts. This repository includes experiments on data repetition and computational budgets, working with up to 900 billion tokens and models with 9 billion parameters. It offers a scaling law for computational efficiency, considering the decreasing utility of repeated tokens and excess parameters. Methods to address data limitations, such as code augmentation and filtering techniques including perplexity and deduplication, are explained. Access to over 400 training models and datasets is provided, supporting robust language model development in constrained environments.