MNBVC - Extensive Chinese Corpus from Multiple Cultural Origins

Introduction to the MNBVC Project

Overview

The MNBVC (Massive Never-ending BT Vast Chinese corpus) is an ambitious and extensive project dedicated to creating the largest Chinese internet text dataset. Announced on January 1, 2023, by the historic and enigmatic MOP Liwu Community, this project aims to collect and continuously update a vast corpus of Chinese text representing both mainstream and niche cultures.

Project Goals

The primary objective of the MNBVC is to gather a 40 terabyte dataset, 95.86% of which is already completed. The corpus includes a variety of text forms such as news articles, novels, lyrics, chat logs, and even arcane scripts.

Data Collection and Formatting

The data, sourced from the internet, is stored in compressed formats with passwords and is available in txt, json, jsonl, and parquet formats. While some data sources are accessible through URLs in the links.txt file, the project focuses on the dataset itself to avoid legal issues around data copyright. The team ensures the data is desensitized, especially for long number strings and performed preliminary data cleaning.

Usage and Availability

The MNBVC dataset has been meticulously organized and categorized, although specific indexing and categorization details are kept private. The team appeals to users to use the data modestly and responsibly. The cleaned data will be progressively uploaded to Hugging Face under the dataset category of “liwu/MNBVC.”

Involvement and Community

The project is openly inviting collaborators, particularly those with Python skills and time to assist in various specific groups such as OCR, question-answer, and data enhancement teams.

Tools and Technologies

The project has developed several specialized tools for processing Chinese text, including ones for encoding detection, bulk text-to-jsonl conversion, and directory sample extraction among others. Additionally, it introduces crawlers and scrapers for code repositories on platforms like GitHub and Bitbucket to contribute towards coding-related corpus compilation.

Future and Contribution

Users even with minimal time can participate by submitting documents via the “Corpus Spirit Bomb” project to aid in the corpus's growth. The team is committed to providing continuous updates and sharing the immense resources with researchers worldwide.

Access and Downloads

The project offers compressed dataset downloads through platforms like p2p micro force and Baidu Netdisk, updating dynamically with ongoing cleaning advancements. For more detailed downloading instructions or to contribute, users can engage with the project via the provided links.

Citation

For those who employ the MNBVC data or tools in their work, the project encourages proper citation to acknowledge the team's efforts.

@misc{mnbvc,
  author = {{MOP-LIWU Community} and {MNBVC Team}},
  title = {MNBVC: Massive Never-ending BT Vast Chinese corpus},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/esbatmop/MNBVC}},
}

The MNBVC stands as a testament to the power of community-driven open-source projects in the AI and data science fields, emphasizing the importance of comprehensive and meticulous large-scale data collection and handling.