Delta Sharing: An Overview
Delta Sharing is an open protocol designed for the secure and real-time exchange of large datasets. This innovative protocol allows organizations to share data seamlessly, regardless of the computing platforms they use. Delta Sharing is built on a straightforward REST protocol, enabling secure access to cloud data storage systems such as S3, ADLS, or GCS, to ensure reliable data transfer.
Key Features of Delta Sharing
-
Cross-Platform Compatibility: Delta Sharing enables users to connect directly to shared data through common tools and platforms like pandas, Tableau, Apache Spark, Rust, and others that support the open protocol, eliminating the need to set up specific computing infrastructure beforehand.
-
Ease of Data Access: Data providers can share a single dataset with a wide array of consumers, allowing those consumers to quickly begin using the data without extensive setup.
Components Included
Delta Sharing encompasses various components to facilitate data sharing:
- Protocol Specification: Defines the rules and formats for secure data sharing.
- Python Connector: A library that allows Python users to read shared tables as pandas or Apache Spark DataFrames.
- Apache Spark Connector: Facilitates reading shared tables from a Delta Sharing Server, enabling access in SQL, Python, Java, Scala, or R.
- Delta Sharing Server: A server implementation to allow development and testing of shared tables in Delta Lake and Apache Parquet format on cloud storage.
Python Connector
The Python Connector for Delta Sharing provides a library to access shared tables efficiently. This connector transforms tables into pandas DataFrames or Apache Spark DataFrames, enabling diverse data analysis capabilities. Installation is simple using pip, and access to shared data is managed through JSON profile files containing user credentials.
Quick Start with Python
import delta_sharing
# Specify the profile file's location
profile_file = "<profile-file-path>"
# Initialize the SharingClient
client = delta_sharing.SharingClient(profile_file)
# Fetch and list shared tables
client.list_all_tables()
# Access a specific shared table using its URL
table_url = profile_file + "#<share-name>.<schema-name>.<table-name>"
# Load the table as a pandas DataFrame
dataframe = delta_sharing.load_as_pandas(table_url)
Apache Spark Connector
The Apache Spark Connector enables accessing shared tables through Spark, supporting various programming languages. It integrates into Spark environments, allowing seamless data operations in SQL, Python, Scala, Java, or R. Setting up involves configuring Spark environments or adding dependencies to Maven or SBT projects.
Example in Scala
val tablePath = "<profile-file-path>#<share-name>.<schema-name>.<table-name>"
val dataframe = spark.read.format("deltaSharing").load(tablePath)
Community and Extensions
Delta Sharing supports numerous connectors that extend its reach, such as those for Power BI, Clojure, Node.js, and more, maintained by various contributors. These connectors facilitate integration with several platforms, enhancing Delta Sharing’s utility.
Setting up the Delta Sharing Reference Server
For testing and development purposes, the Delta Sharing Reference Server allows users to experiment with their connector implementations. Managed service options also exist, like those offered by Databricks, providing comprehensive solutions for data sharing.
By embracing open protocols like Delta Sharing, organizations can enhance their data integration capabilities, promoting secure, timely, and scalable data exchanges across diverse environments.