Introduction to YData-Synthetic
YData-Synthetic
is an open-source package initially developed in 2020, with a primary focus on educating users about generative models for creating synthetic data. This tool was designed to facilitate exploratory research and learning. However, it wasn't specifically optimized for the high-quality performance, and scalability that organizations typically require.
What is Synthetic Data?
Synthetic data consists of artificially generated information that mirrors the statistical properties of real data without containing any identifiable information. This approach helps in maintaining individuals’ privacy while still providing valuable data for analysis.
Why Use Synthetic Data?
Synthetic data serves multiple purposes, such as:
- Ensuring privacy compliance for data-sharing and machine learning development.
- Removing bias from datasets.
- Balancing datasets for uniform representation.
- Augmenting datasets to enhance the size and diversity of data available.
YData also offers a product called YData Fabric, which is a comprehensive solution for generating high-quality synthetic datasets. It provides a full UI experience, from data preparation to evaluation.
Transition from ydata-synthetic to ydata-sdk
The evolution of ydata-synthetic
into ydata-sdk
marks a significant step forward in synthetic data generation. The ydata-sdk
offers users a single API that automatically chooses the most suitable generative model for the user's data. This advancement removes the necessity for manual selection from a list of models, which have historically included:
- GAN (Generative Adversarial Network)
- CGAN (Conditional GAN)
- WGAN (Wasserstein GAN)
- WGAN-GP (Wassertein GAN with Gradient Penalty)
- DRAGAN (Deep Regret Analytic GAN)
- Cramer GAN (Cramer Distance Solution to Biased Wasserstein Gradients)
- CWGAN-GP (Conditional Wassertein GAN with Gradient Penalty)
- CTGAN (Conditional Tabular GAN)
- TimeGAN (for time-series data)
- DoppelGANger (for time-series data)
With the ydata-sdk
, model selection is automated, ensuring the highest quality output without the need for users to perform manual interventions or tedious hyperparameter tuning.
Quickstart
To start using ydata-sdk
, simply install it via the Python Package Index (PyPI):
pip install ydata-sdk
UI Guide and Examples
YData Fabric provides a user interface to guide users through the steps and required inputs for generating structured synthetic data. Users can get started with YData Fabric by registering for the Community version.
For usage examples, YData offers several scenarios, such as generating synthetic data from the Titanic Kaggle dataset or creating time-series synthetic data. More examples are continuously added to the examples directory.
Datasets for Experimentation
The project provides several datasets for practical experimentation, including:
-
Tabular datasets:
- Adult Census Income
- Credit Card Fraud
- Cardiovascular Disease
-
Sequential datasets:
- Stock data
- FCC MBA data
Community and Support
Users seeking assistance with the YData tools can join the project’s active Discord community for support and discussion. This community is known for providing quick responses and fostering a collaborative environment.
For more comprehensive resources, FAQs, and to engage directly with the YData team, users are encouraged to visit the Frequently Asked Questions page or schedule a chat for further discussions.
Licensing
The project is available under the MIT License, ensuring that users can freely use, modify, and distribute the software according to their needs.
Join YData in exploring the future of synthetic data generation with ydata-sdk
and immerse yourself in the advancements of Generative AI.