data-validation - Efficiently Explore and Validate Machine Learning Data at Scale

An Introduction to TensorFlow Data Validation

TensorFlow Data Validation (TFDV) is a dynamic library designed to assist machine learning practitioners in exploring and validating their datasets. This library is an integral part of the TensorFlow Extended (TFX) ecosystem and aims to streamline data handling, ensuring data integrity and quality before diving into model training.

Key Features of TFDV

TFDV is packed with features that cater to several essential data handling needs in machine learning:

Scalable Summary Statistics: The library can efficiently generate summary statistics for both training and test datasets, which helps in understanding the underlying data distributions.
Data Distribution Visualization: Integration with Facets allows users to visually inspect data distributions and compare pairs of features to better understand their relationships or discrepancies within the data.
Automated Data Schema Generation: TFDV can automatically generate a data schema. This schema outlines the expected structure and values of the dataset, such as required fields, ranges of numerical values, and permissible vocabularies.
Anomaly Detection: Anomalies, such as missing data, values outside an expected range, or data type mismatches, can be automatically detected with TFDV. This feature ensures that any data issues are identified early, reducing the risk of faulty model training.
Anomalies Viewer: This visualization tool allows users to inspect and address anomalies, providing detailed insights to correct data issues effectively.

Getting Started with TFDV

To start using TensorFlow Data Validation:

Follow the get started guide to understand its basic functionalities.
An example notebook is available to showcase how TFDV works in practice.

Moreover, detailed methodologies and techniques incorporated within TFDV have been published in a technical paper at the SysML'19 conference.

Installation and Build Options

Installing from PyPI

The simplest method for installation is via PyPI. Use the command:

pip install tensorflow-data-validation

Nightly Packages

For those looking to stay on the cutting edge with the latest updates, TFDV provides nightly packages. These can be installed using:

export TFX_DEPENDENCY_SELECTOR=NIGHTLY
pip install --extra-index-url https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

Keep in mind that nightly packages are experimental and may have stability issues.

Building with Docker

For building TFDV under Linux, utilizing Docker is the recommended approach. The process involves:

Installing Docker and Docker Compose.
Cloning the TFDV repository.
Building the pip package using Docker commands.
Installing the resultant package.

Building from Source

Building from source requires:

Installing NumPy and Bazel.
Cloning the TFDV repository.
Building and installing the pip package using Python scripts tailored to your Python version.

Platform Support and Dependencies

TFDV supports 64-bit macOS (12.5 or later) and Ubuntu (20.04 or later) operating systems. Note that usage requires TensorFlow and Apache Beam, which facilitates high-performance distributed processing through compatible scalers like Apache Arrow.

Compatibility

TFDV ensures compatibility with several versions of essential tools and frameworks like Apache Beam, TensorFlow, and TensorFlow Metadata. Detailed compatibility matrices are available to ensure seamless integration.

In conclusion, TensorFlow Data Validation provides a comprehensive solution for data validation and exploration in a machine learning workflow, enhancing data quality and model performance from the onset.