Introduction to REaLTabFormer
REaLTabFormer is a powerful tool that helps in creating synthetic data, banking on the advanced capabilities of transformers to manage both relational and non-relational tabular data. It provides a unified framework, allowing users to generate realistic datasets that are instrumental in various data analysis and machine learning applications.
What is REaLTabFormer?
At its core, REaLTabFormer stands for "Realistic Relational and Tabular Data using Transformers." It utilizes a sequence-to-sequence (Seq2Seq) model to generate synthetic relational datasets. For non-relational tabular data, REaLTabFormer employs the GPT-2 model, capable of handling any tabular data with independent observations out-of-the-box. This versatility makes it suitable for organizations looking to manage large datasets efficiently.
Installation
Installing REaLTabFormer is easy. It's available via PyPI, a popular repository for Python software, and requires Python version 3.7 or above. To install it, simply run:
pip install realtabformer
Usage
REaLTabFormer can be used to model and generate synthetic data from a trained model. It supports both regular (non-relational) tabular data and relational datasets (data that have relationships with other tables).
Generating Non-Relational Tabular Data
For non-relational tabular data, REaLTabFormer trains a model on your data, fitting it to mimic the real data distribution as closely as possible. Once the model is trained, it can generate synthetic samples with the same number of observations as the original dataset.
import pandas as pd
from realtabformer import REaLTabFormer
# Load data
df = pd.read_csv("foo.csv")
# Initialize and fit model
rtf_model = REaLTabFormer(model_type="tabular")
rtf_model.fit(df)
# Save and regenerate dataset
rtf_model.save("rtf_model/")
samples = rtf_model.sample(n_samples=len(df))
Generating Relational Data
For relational data, which includes dependencies between data in multiple tables, REaLTabFormer first models the "parent" table before generating related data based on this primary dataset.
import os
import pandas as pd
from realtabformer import REaLTabFormer
# Load parent and child data
parent_df = pd.read_csv("foo.csv")
child_df = pd.read_csv("bar.csv")
join_on = "unique_id"
# Ensure both datasets share a common key
assert ((join_on in parent_df.columns) and (join_on in child_df.columns))
# Fit models for parent and child
parent_model = REaLTabFormer(model_type="tabular")
parent_model.fit(parent_df.drop(join_on, axis=1))
child_model = REaLTabFormer(
model_type="relational",
parent_realtabformer_path="path/to/parent_model",
train_size=0.8
)
child_model.fit(df=child_df, in_df=parent_df, join_on=join_on)
# Generate samples
parent_samples = parent_model.sample(len(parent_df))
child_samples = child_model.sample(
input_unique_ids=parent_samples[join_on],
input_df=parent_samples.drop(join_on, axis=1)
)
Validators for Synthetic Samples
REaLTabFormer includes functionalities to ensure the validity of synthetic samples. Validators, such as the GeoValidator
, filter out unrealistic data points. This ensures that generated data maintains real-world plausibility, such as keeping geographical data within expected boundaries.
from realtabformer import rtf_validators as rtf_val
# Define geographic validator with a polygon boundary
obs_validator = rtf_val.ObservationValidator()
obs_validator.add_validator(
"geo_validator",
rtf_val.GeoValidator(geo_boundary_polygon),
("Longitude", "Latitude")
)
# Sample with validation
samples_validated = rtf_model.sample(
n_samples=10240,
validator=obs_validator
)
Conclusion
REaLTabFormer is a comprehensive solution for generating synthetic tabular and relational data using advanced transformer models. Its user-friendly nature and powerful functionalities make it a valuable asset for researchers and organizations focused on data analysis, privacy, and machine learning. By ensuring synthetic data maintains resemblance to real-world datasets, it supports safe and effective data exploration without compromising on privacy or data integrity.
Acknowledgments
The development of REaLTabFormer was supported by the World Bank-UNHCR Joint Data Center on Forced Displacement. A portion of the funding was geared towards researching disclosure risk and the mosaic effect in synthetic data generation. Special thanks are extended to HuggingFace and the numerous open-source contributors for their resources and inspiration.