gretel-synthetics - Explore Advanced Synthetic Data Creation with Neural Networks

Gretel Synthetics

Gretel Synthetics is a library designed by Gretel.ai to create synthetic data that users can freely use and experiment with. This library is open-source and caters to developers and data scientists who are keen on using machine learning models to generate synthetic data. It offers tools for transforming real datasets into synthetic counterparts, preserving privacy while maintaining utility.

Documentation and Getting Started

To help users get started, Gretel Synthetics provides comprehensive documentation. This includes guides on configuration, training models, and generating synthetic records. Beginners can explore the library through interactive tutorials available in Jupyter notebooks, easily accessible via Google Colab.

Installation

Installing Gretel Synthetics involves a few steps. First, the user must ensure that some dependencies like TensorFlow, SDV (Synthetic Data Vault), and PyTorch are installed separately based on the models to be used. This can be done using Python's package manager with commands like:

pip install tensorflow==2.12.1        # Required for LSTM model
pip install sdv<0.18                   # Required for ACTGAN model
pip install torch==2.0                 # Required for Timeseries DGAN

To install the Gretel Synthetics package itself, one can clone the repository and use:

pip install -U .

or simply:

pip install gretel-synthetics

Suggested Development Environment

For users seeking to use a GPU for faster processing, setting up a virtual environment with Conda is recommended. The library also provides a setup script to ease the installation of necessary software packages for GPU usage, particularly suitable for Ubuntu 18.04 environments.

Model Overviews

Timeseries DGAN

Gretel Synthetics provides a timeseries DoppelGANger (DGAN) module, implemented in PyTorch. This module is specifically optimized for handling timeseries data.

ACTGAN

ACTGAN is an extension of the classic CTGAN model, boasting improvements in memory usage and data transformation capabilities. It leverages the functionalities of the SDV library, designed for creating high-quality synthetic data.

LSTM

This is a user-friendly module for generating synthetic data with neural networks, using Tensorflow in the backend to handle the complexities. It provides two primary modes:

Simple Mode: Tailored for line-by-line training on text files.
DataFrame Mode: Suitable for CSV or DataFrames, organizing data in batches of columns for model training.

Core Components

Gretel Synthetics consists of essential components crucial for synthetic data generation:

Configurations: Define parameters necessary for model training and data generation, with options to save and archive models.
Tokenizers: Convert text data into integer IDs for processing by underlying machine learning engines.
Training: Involves setting up configurations and tokenizers to build models that can generate data.
Generation: Utilizes trained models to produce new data records, ensuring they comply with given constraints through optional validation.

Utilities and Privacy

The library includes utilities for advanced model training and dataset evaluation, available through an additional installation of utils. Moreover, it supports differential privacy in TensorFlow mode, leveraging the TensorFlow Privacy library to ensure that the training data's privacy is respected during the generation process.

Overall, Gretel Synthetics provides a flexible and comprehensive solution for generating synthetic data, making it easier for developers and researchers to access and experiment with machine learning models safely and efficiently.