datatrove
DataTrove efficiently manages, filters, and deduplicates large-scale text data. It runs seamlessly across platforms including local machines and Slurm clusters, ideal for processing complex workloads such as large language model training datasets. With minimal memory usage and customizable functionality, DataTrove supports diverse file systems via fsspec, providing scalable and adaptable data processing pipelines.