ydata-synthetic - Synthetic Data Generation with Improved Privacy and Performance

Introduction to YData-Synthetic

YData-Synthetic is an open-source package initially developed in 2020, with a primary focus on educating users about generative models for creating synthetic data. This tool was designed to facilitate exploratory research and learning. However, it wasn't specifically optimized for the high-quality performance, and scalability that organizations typically require.

What is Synthetic Data?

Synthetic data consists of artificially generated information that mirrors the statistical properties of real data without containing any identifiable information. This approach helps in maintaining individuals’ privacy while still providing valuable data for analysis.

Why Use Synthetic Data?

Synthetic data serves multiple purposes, such as:

Ensuring privacy compliance for data-sharing and machine learning development.
Removing bias from datasets.
Balancing datasets for uniform representation.
Augmenting datasets to enhance the size and diversity of data available.

YData also offers a product called YData Fabric, which is a comprehensive solution for generating high-quality synthetic datasets. It provides a full UI experience, from data preparation to evaluation.

Transition from ydata-synthetic to ydata-sdk

The evolution of ydata-synthetic into ydata-sdk marks a significant step forward in synthetic data generation. The ydata-sdk offers users a single API that automatically chooses the most suitable generative model for the user's data. This advancement removes the necessity for manual selection from a list of models, which have historically included:

GAN (Generative Adversarial Network)
CGAN (Conditional GAN)
WGAN (Wasserstein GAN)
WGAN-GP (Wassertein GAN with Gradient Penalty)
DRAGAN (Deep Regret Analytic GAN)
Cramer GAN (Cramer Distance Solution to Biased Wasserstein Gradients)
CWGAN-GP (Conditional Wassertein GAN with Gradient Penalty)
CTGAN (Conditional Tabular GAN)
TimeGAN (for time-series data)
DoppelGANger (for time-series data)

With the ydata-sdk, model selection is automated, ensuring the highest quality output without the need for users to perform manual interventions or tedious hyperparameter tuning.

Quickstart

To start using ydata-sdk, simply install it via the Python Package Index (PyPI):

pip install ydata-sdk

UI Guide and Examples

YData Fabric provides a user interface to guide users through the steps and required inputs for generating structured synthetic data. Users can get started with YData Fabric by registering for the Community version.

For usage examples, YData offers several scenarios, such as generating synthetic data from the Titanic Kaggle dataset or creating time-series synthetic data. More examples are continuously added to the examples directory.

Datasets for Experimentation

The project provides several datasets for practical experimentation, including:

Tabular datasets:
- Adult Census Income
- Credit Card Fraud
- Cardiovascular Disease
Sequential datasets:
- Stock data
- FCC MBA data

Community and Support

Users seeking assistance with the YData tools can join the project’s active Discord community for support and discussion. This community is known for providing quick responses and fostering a collaborative environment.

For more comprehensive resources, FAQs, and to engage directly with the YData team, users are encouraged to visit the Frequently Asked Questions page or schedule a chat for further discussions.

Licensing

The project is available under the MIT License, ensuring that users can freely use, modify, and distribute the software according to their needs.

Join YData in exploring the future of synthetic data generation with ydata-sdk and immerse yourself in the advancements of Generative AI.