petastorm - Provide Efficient Data Access for Deep Learning with Petastorm

Introduction to Petastorm

Petastorm is an open-source data access library developed by Uber ATG, designed to facilitate both single-machine and distributed training of deep learning models using datasets stored in the Apache Parquet format. It's particularly useful for Python-based machine learning frameworks like TensorFlow, PyTorch, and PySpark, although it can also be used in pure Python projects.

Installation

To get started with Petastorm, users can install it via pip:

pip install petastorm

Additionally, there are optional dependencies for frameworks like TensorFlow with GPU support, PyTorch, and more. For example, to include TensorFlow GPU support and OpenCV, one can install it as follows:

pip install petastorm[opencv,tf_gpu]

Generating a Dataset

Creating datasets with Petastorm involves using Apache Parquet as the storage format and an additional higher-level schema that integrates multidimensional arrays. Users can define custom data compression codecs or use standard ones like JPEG or PNG. The following is an example of generating a dataset with PySpark:

import numpy as np
from pyspark.sql import SparkSession
from petastorm.etl.dataset_metadata import materialize_dataset
from petastorm.unischema import Unischema, UnischemaField

HelloWorldSchema = Unischema('HelloWorldSchema', [
    UnischemaField('id', np.int32, (), ScalarCodec(IntegerType()), False),
    UnischemaField('image1', np.uint8, (128, 256, 3), CompressedImageCodec('png'), False),
    UnischemaField('array_4d', np.uint8, (None, 128, 30, None), NdarrayCodec(), False),
])

def row_generator(x):
    return {'id': x,
            'image1': np.random.randint(0, 255, dtype=np.uint8, size=(128, 256, 3)),
            'array_4d': np.random.randint(0, 255, dtype=np.uint8, size=(4, 128, 30, 3))}

def generate_petastorm_dataset(output_url='file:///tmp/hello_world_dataset'):
    with materialize_dataset(spark, output_url, HelloWorldSchema, rowgroup_size_mb):
        rows_rdd = sc.parallelize(range(rows_count)) \
            .map(row_generator) \
            .map(lambda x: dict_to_spark_row(HelloWorldSchema, x))
        spark.createDataFrame(rows_rdd, HelloWorldSchema.as_spark_schema()) \
            .coalesce(10) \
            .write \
            .mode('overwrite') \
            .parquet(output_url)

Reading a Dataset

Petastorm offers a straightforward API for reading datasets. The primary class is petastorm.reader.Reader, which supports various features including column selection, shuffling, partitioning for multi-GPU training, and local caching. Here’s a quick example using the reader:

from petastorm import make_reader

with make_reader('hdfs://myhadoop/some_dataset') as reader:
   for row in reader:
       print(row)

Integration with TensorFlow and PyTorch

Petastorm seamlessly integrates with TensorFlow and PyTorch for model training:

TensorFlow: Connect Petastorm data to a TensorFlow graph using tf_tensors. Users can also integrate with tf.data.Dataset API.

from petastorm.tf_utils import tf_tensors
with make_reader('file:///some/localpath/a_dataset') as reader:
    row_tensors = tf_tensors(reader)
    with tf.Session() as session:
        for _ in range(3):
            print(session.run(row_tensors))

PyTorch: Use petastorm.pytorch.DataLoader for efficient data loading. The DataLoader supports custom collating functions and data transformations.

import torch
from petastorm.pytorch import DataLoader

with DataLoader(make_reader('file:///localpath/mnist/train', num_epochs=10)) as train_loader:
    train(model, device, train_loader, 10, optimizer, 1)

Spark Dataset Converter API

This API simplifies converting Spark DataFrames to TensorFlow or PyTorch datasets by first saving them as Parquet files. Here’s a brief example:

from petastorm.spark import SparkDatasetConverter, make_spark_converter

df = ...  # Spark DataFrame
converter = make_spark_converter(df)

with converter.make_tf_dataset() as dataset:
    model.fit(dataset)

Analyzing Datasets with PySpark

Petastorm datasets can be read into Spark DataFrames for analysis:

dataframe = spark.read.parquet(dataset_url)
dataframe.printSchema()
dataframe.count()
dataframe.select('id').show()

Reading Non-Petastorm Parquet Stores

Petastorm also supports accessing data from non-Petastorm Parquet stores using make_batch_reader.

Conclusion

Petastorm provides a robust framework for handling large datasets in deep learning workflows, making it a valuable tool for data scientists and engineers working with distributed machine learning and big data processing. Its integration capabilities with popular ML frameworks and operational ease across various environments make it an excellent choice for scalable and efficient data processing.