Project Icon

datatrove

Efficiently Manage Large Text Data with Adaptive Low-Memory Pipelines

Product DescriptionDataTrove efficiently manages, filters, and deduplicates large-scale text data. It runs seamlessly across platforms including local machines and Slurm clusters, ideal for processing complex workloads such as large language model training datasets. With minimal memory usage and customizable functionality, DataTrove supports diverse file systems via fsspec, providing scalable and adaptable data processing pipelines.
Project Details