NeMo-Curator
NeMo Curator is a GPU-optimized open-source library designed to speed up dataset preparation in generative AI contexts. Utilizing Dask and RAPIDS, it provides efficient modules for curating multilingual text and images, thereby enhancing training and tuning processes. Features such as language identification, filtering, and deduplication support various AI tasks, including pretraining and fine-tuning. Its modular approach allows for the customization of data workflows while maintaining objectivity and clarity.