dolma
Dolma provides a 3 trillion token dataset derived from diverse sources such as web content and academic materials for language model training by AI2. Available on HuggingFace, it includes a high-speed toolkit suitable for processing large datasets with parallel workflows, cross-platform portability, and efficient deduplication using Rust Bloom filters. Researchers can utilize built-in taggers and customize settings for AWS S3, enhancing the versatility in AI and machine learning initiatives.