MetaCLIP
This project presents an innovative method for curating CLIP data that prioritizes data quality over quantity. It features a transparent and scalable approach to data curation, managing over 300B image-text pairs from CommonCrawl without needing prior models. By focusing on signal preservation and noise reduction, it offers improved data quality compared to other open-source initiatives. MetaCLIP integrates OpenAI CLIP's training framework for precise and unbiased model comparisons and includes metadata and training data distribution details for a complete understanding of pretraining datasets, catering to those aiming to enhance their data pipeline comprehensively.