Introduction to Petastorm
Petastorm is an open-source data access library developed by Uber ATG, designed to facilitate both single-machine and distributed training of deep learning models using datasets stored in the Apache Parquet format. It's particularly useful for Python-based machine learning frameworks like TensorFlow, PyTorch, and PySpark, although it can also be used in pure Python projects.
Installation
To get started with Petastorm, users can install it via pip:
pip install petastorm
Additionally, there are optional dependencies for frameworks like TensorFlow with GPU support, PyTorch, and more. For example, to include TensorFlow GPU support and OpenCV, one can install it as follows:
pip install petastorm[opencv,tf_gpu]
Generating a Dataset
Creating datasets with Petastorm involves using Apache Parquet as the storage format and an additional higher-level schema that integrates multidimensional arrays. Users can define custom data compression codecs or use standard ones like JPEG or PNG. The following is an example of generating a dataset with PySpark:
import numpy as np
from pyspark.sql import SparkSession
from petastorm.etl.dataset_metadata import materialize_dataset
from petastorm.unischema import Unischema, UnischemaField
HelloWorldSchema = Unischema('HelloWorldSchema', [
UnischemaField('id', np.int32, (), ScalarCodec(IntegerType()), False),
UnischemaField('image1', np.uint8, (128, 256, 3), CompressedImageCodec('png'), False),
UnischemaField('array_4d', np.uint8, (None, 128, 30, None), NdarrayCodec(), False),
])
def row_generator(x):
return {'id': x,
'image1': np.random.randint(0, 255, dtype=np.uint8, size=(128, 256, 3)),
'array_4d': np.random.randint(0, 255, dtype=np.uint8, size=(4, 128, 30, 3))}
def generate_petastorm_dataset(output_url='file:///tmp/hello_world_dataset'):
with materialize_dataset(spark, output_url, HelloWorldSchema, rowgroup_size_mb):
rows_rdd = sc.parallelize(range(rows_count)) \
.map(row_generator) \
.map(lambda x: dict_to_spark_row(HelloWorldSchema, x))
spark.createDataFrame(rows_rdd, HelloWorldSchema.as_spark_schema()) \
.coalesce(10) \
.write \
.mode('overwrite') \
.parquet(output_url)
Reading a Dataset
Petastorm offers a straightforward API for reading datasets. The primary class is petastorm.reader.Reader
, which supports various features including column selection, shuffling, partitioning for multi-GPU training, and local caching. Here’s a quick example using the reader:
from petastorm import make_reader
with make_reader('hdfs://myhadoop/some_dataset') as reader:
for row in reader:
print(row)
Integration with TensorFlow and PyTorch
Petastorm seamlessly integrates with TensorFlow and PyTorch for model training:
-
TensorFlow: Connect Petastorm data to a TensorFlow graph using
tf_tensors
. Users can also integrate withtf.data.Dataset
API.from petastorm.tf_utils import tf_tensors with make_reader('file:///some/localpath/a_dataset') as reader: row_tensors = tf_tensors(reader) with tf.Session() as session: for _ in range(3): print(session.run(row_tensors))
-
PyTorch: Use
petastorm.pytorch.DataLoader
for efficient data loading. TheDataLoader
supports custom collating functions and data transformations.import torch from petastorm.pytorch import DataLoader with DataLoader(make_reader('file:///localpath/mnist/train', num_epochs=10)) as train_loader: train(model, device, train_loader, 10, optimizer, 1)
Spark Dataset Converter API
This API simplifies converting Spark DataFrames to TensorFlow or PyTorch datasets by first saving them as Parquet files. Here’s a brief example:
from petastorm.spark import SparkDatasetConverter, make_spark_converter
df = ... # Spark DataFrame
converter = make_spark_converter(df)
with converter.make_tf_dataset() as dataset:
model.fit(dataset)
Analyzing Datasets with PySpark
Petastorm datasets can be read into Spark DataFrames for analysis:
dataframe = spark.read.parquet(dataset_url)
dataframe.printSchema()
dataframe.count()
dataframe.select('id').show()
Reading Non-Petastorm Parquet Stores
Petastorm also supports accessing data from non-Petastorm Parquet stores using make_batch_reader
.
Conclusion
Petastorm provides a robust framework for handling large datasets in deep learning workflows, making it a valuable tool for data scientists and engineers working with distributed machine learning and big data processing. Its integration capabilities with popular ML frameworks and operational ease across various environments make it an excellent choice for scalable and efficient data processing.