WanJuan1.0 - Versatile Multimodal Dataset Facilitating Enhanced AI Model Training

Introduction to Intern · WanJuan 1.0

Intern · WanJuan 1.0 is a groundbreaking open-source multimodal corpus presented by the Shanghai AI Lab. This project consists of a text dataset, an image-text dataset, and a video dataset, altogether surpassing 2TB in data volume. The project, built upon the work of the large model data alliance, has undergone a meticulous process of fine-tuning and refinement. Intern · WanJuan 1.0 is distinguished by four key characteristics: multiple integration, fine processing, value alignment, and ease of use and efficiency.

Multiple Integration

Intern · WanJuan 1.0 is a comprehensive collection of multimodal data including text, images, and videos. This data spans across various domains such as science and technology, literature, media, education, and law. By integrating diverse data types, it enhances knowledge content, logical reasoning, and significantly improves generalization capabilities, allowing it to be a robust resource for future research and development.

Fine Processing

The data within Intern · WanJuan 1.0 has been meticulously processed to ensure its quality and reliability. This includes extensive steps such as language screening, text extraction, format standardization, and data filtering and cleaning using both rules and models. Multi-scale deduplication and data quality assessments were also conducted. The result is a dataset that is finely tuned and ready to meet the rigorous demands of subsequent model training activities.

Value Alignment

Value alignment is a core aspect of the Intern · WanJuan 1.0 project. Throughout its creation, researchers made an extra effort to align the content with mainstream Chinese values. The purity and integrity of the corpus were enhanced through a combination of algorithmic processes and manual evaluation. This ensures that the dataset not only retains quality but is also culturally and contextually appropriate.

Ease of Use and Efficiency

Designed with usability in mind, Intern · WanJuan 1.0 adopts a unified format complemented by detailed field descriptions and tool guidance. This makes the dataset both easy to navigate and highly efficient to use, permitting fast and effective application in training Multimodal Large Language Models (MLLMs) or traditional large language models (LLMs).

Application and Performance

Intern · WanJuan 1.0 is already making waves in the world of machine learning, having been applied to train large-scale models such as Intern Multimodal and Intern Puyu. By utilizing this high-quality corpus, the Intern series of models have demonstrated outstanding capabilities in generative tasks, including semantic understanding, knowledge question answering, visual understanding, and visual question answering.

Intern · WanJuan 1.0 - Text Dataset

The text component of Intern · WanJuan 1.0 includes preprocessed text data derived from a wide range of sources: web pages, encyclopedias, books, patents, textbooks, and exam questions. Surpassing 500 million documents and 1TB in size, the text dataset is organized in a uniform jsonl format. Each document is meticulously cleaned, deduplicated, and value-aligned, resulting in a reliable and high-quality pre-training dataset.

Intern · WanJuan 1.0 - Image-Text Dataset

Comprising over 22 million documents with more than 140GB of data (excluding images), the image-text dataset mainly sources its data from public web pages. Incorporating images interwoven with text, this dataset spans areas such as news, people, landscapes, and social life. The images are accessible via URLs, which can be downloaded and utilized for further analysis and training.

In summary, Intern · WanJuan 1.0 is a highly versatile and invaluable corpus designed to propel advancements in AI research and development. With its comprehensive data integration, meticulous processing, cultural alignment, and user-friendly design, it stands as a cornerstone resource for the development of sophisticated and competent language models.