Introduction to DiffusionDB
DiffusionDB represents a groundbreaking development in the realm of AI and machine learning. It is the first large-scale text-to-image prompt dataset, boasting a staggering collection of 14 million images. These images are generated by Stable Diffusion—a cutting-edge AI model—using prompts and hyperparameters specified by real users. The vast scale and diversity of DiffusionDB open up exciting avenues for research into how prompts interact with generative models, the detection of deepfakes, and the creation of intuitive human-AI interaction tools that make these models more accessible.
Getting Started
To explore the dataset, individuals can find it on the popular platform of 🤗 Hugging Face Datasets, which simplifies access for researchers and developers alike.
Subsets of DiffusionDB
DiffusionDB is divided into two subsets that cater to different user needs: DiffusionDB 2M and DiffusionDB Large.
- DiffusionDB 2M: This subset contains 2 million images, approximately 1.5 million unique prompts, and has a size of 1.6TB. These images are stored in the
png
format. - DiffusionDB Large: This more comprehensive subset includes 14 million images and 1.8 million unique prompts, with a total size of 6.5TB. The images here are preserved in a lossless
webp
format.
Dataset Structure
The dataset is organized in a modular file structure to manage the enormous quantity of data effectively:
- Each subset consists of multiple folders, each containing 1,000 images and a JSON file. This JSON file links every image to its corresponding prompts and hyperparameters.
- Images in DiffusionDB 2M are in 2,000 folders, while those in DiffusionDB Large are distributed across 14,000 folders.
Understanding the Metadata
Metadata for DiffusionDB is provided to facilitate easy access to prompts and other image attributes without needing to download massive files. This metadata helps in analyzing and utilizing the dataset effectively by offering comprehensive data fields, including unique image names, prompts, and various hyperparameters. Each entry also includes potential NSFW scores based on advanced detection tools.
Methods for Loading DiffusionDB
Given its immense size, users are offered several ways to load DiffusionDB:
-
Using Hugging Face Datasets Loader: This method involves utilizing the Hugging Face Datasets library for loading specific subsets of DiffusionDB.
-
Downloading via Script: Users have access to a Python script to download their preferred parts of the dataset, with added options to manage download destinations and unzip files post-download.
-
Accessing Metadata: For tasks only involving text data, users can load the comprehensive
metadata.parquet
file directly for analysis.
Dataset Creation and Maintenance
DiffusionDB's collection is derived from the Stable Diffusion Discord server, ensuring an authentic dataset. Should users find objectionable content, there is a transparent process to report and request the removal of certain images or prompts.
Contribution and Licensing
This dataset is the work of a talented team including Jay Wang, Evan Montoya, David Munechika, Alex Yang, Ben Hoover, and Polo Chau. DiffusionDB is licensed under the CC0 1.0 License, while the associated code is available under the MIT License, promoting open-access use and further adaptation.
In summary, DiffusionDB provides an invaluable resource for anyone studying generative models, AI prompt engineering, or digital creativity. Its extensive image database and corresponding metadata support a broad spectrum of research and development initiatives.