wit - A Comprehensive Multimodal Multilingual Dataset Derived from Wikipedia

WIT: Wikipedia-based Image Text Dataset

The Wikipedia-based Image Text (WIT) Dataset is an extensive multimodal and multilingual dataset designed to advance machine learning models. It stands out due to its vast collection of 37.6 million image-text examples and 11.5 million unique images, sourced from 108 different languages on Wikipedia. The dataset's size and diversity make it an exceptional resource for pretraining multimodal machine learning models.

Key Advantages

The WIT dataset offers several unique benefits:

Largest Multimodal Dataset: At the time of its release, it held the record for the number of image-text examples publicly available.
Multilingual Coverage: It is the first dataset of its kind, spanning 108 languages, enabling broader linguistic accessibility.
Page Level Metadata: Offers detailed metadata and contextual information at the page level.
Diverse Concepts and Entities: Represents a wide range of real-world concepts and entities.
Challenging Test Sets: Introduces complex real-world test scenarios for robust model training.

Latest Updates

WIT has achieved significant milestones over the years:

April 2021: The WIT paper was accepted at the SIGIR Conference.
September 2021: The WIT Image-Text Competition launched on Kaggle, accompanied by a significant release of image data for research.
April 2022: Awarded the WikiMedia Foundation's Research Award of the Year.
May 2022: Released the validation and test sets.
October 2022 to May 2023: Various proposals and datasets were released and accepted at conferences, contributing to ongoing research in the field.

Example Use Case

A practical example of WIT's application is found on the Wikipedia page for "Half Dome, Yosemite in CA." The extracted data, including images, text snippets, and metadata, showcases how this rich dataset helps in forming high-quality image-text examples crucial for multimodal modeling.

Motivation

WIT was developed to overcome two major challenges in visio-linguistic models: the need for a large, rich dataset to improve the performance of image-text relationship modeling and the lack of multilingual datasets. WIT enables research in multilingual multimodal learning, offering significant improvements in multilingual textual understanding.

Dataset Composition

The WIT Dataset contains over 37.6 million image-text pairs, making it the largest available multimodal dataset with robust multilingual coverage. It provides 12,000+ examples in each of 108 languages, with many languages exceeding 100,000 image-text pairs.

Detailed Dataset Numbers

Rows/Tuples: 37.6 million in total, including training, validation, and test sets.
Unique Images: 11.5 million
Reference Texts: 17.2 million unique texts
Attribution Texts: 35.2 million unique texts
Alt Texts: 5.4 million unique texts
Context Texts: 119.8 million

Availability

Researchers can download WIT to enhance multimodal and multilingual model training. The dataset provides a foundation for improved learning and representation techniques in real-world visio-linguistic tasks.

Citing WIT

Researchers using the WIT dataset can reference it in their work using the provided citation format from its published paper at the SIGIR 2021 conference.

Licensing

The WIT Dataset is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License, supporting its use in research and development.

Projects Utilizing WIT

The dataset has been influential in various projects and research studies, demonstrating its applications in diverse multimodal and multilingual learning scenarios.

For any additional information or inquiries, interested parties are encouraged to reach out to the project team via the contact information provided. The WIT dataset is an invaluable resource for researchers in the field, promoting advancements in understanding across languages and media through intelligent machine learning models.