SmallLanguageModel Project Introduction
The SmallLanguageModel project is designed for those who want to create their own Large Language Model (LLM) from scratch. Drawing inspiration from Karpathy's nanoGPT and a Shakespeare generator, the project encompasses everything needed, from data collection to model architecture, tokenization, and training. Here's a breakdown of the project for anyone interested in diving into the world of LLMs:
Repository Structure
The repository is organized into several key components:
-
Data Collector: This section includes a web-scraper directory, which is perfect for users aiming to gather data from scratch instead of relying on pre-existing datasets.
-
Data Processing: Here, users will find code that facilitates the pre-processing of certain file types. This includes converting parquet files into .txt and .csv formats and codes for appending files.
-
Models: This component contains all the necessary scripts to train a personal model. It includes examples of a BERT model, GPT model, and a Seq-2-Seq model, alongside the tokenizer and run files needed for training.
Prerequisites
To ensure a smooth setup of the SmallLanguageModel, make sure you have these prerequisites ready:
- Python 3.8 or a higher version
- pip, the Python package installer
How to Use
Training your own tokenizer or producing outputs from a trained model involves a few steps, as outlined below:
-
Clone the Repository: Open your terminal and execute the following commands:
git clone https://github.com/shivendrra/SmallLanguageModel-project cd SLM-clone
-
Install Dependencies: Use pip to install the required packages:
pip install requirements.txt
-
Train: For detailed instructions on training, refer to the training.md available in the repository. It provides comprehensive guidance on the training process.
Star History
The repository includes a star history chart, allowing users to track its popularity and engagement over time via the Star History Chart.
Contributing
The project warmly welcomes contributions. If you wish to propose significant changes, it is advised to open an issue first to discuss the intended modifications. Ensure that any updates to tests are made as necessary.
License
The project is distributed under the MIT License. For more detailed information, refer to the License.md included in the repository.
The SmallLanguageModel project is an excellent resource for developers keen on building custom language models, offering clear instructions and a structured approach to model development.