datacompy - Comprehensive Comparison Tool for Pandas and Spark DataFrames in Python

Introduction to DataComPy

DataComPy is an advanced Python package designed to compare two data tables, known as DataFrames, using the Pandas library. Originally conceptualized as an enhanced alternative to SAS’s PROC COMPARE, it offers more than the basic Pandas.DataFrame.equals() method by including functionalities that allow users to print statistical differences and adjust matching accuracy. This functionality has since been extended to support Spark DataFrames, making it a versatile tool in data analysis.

Installation Process

Getting started with DataComPy is easy. Users can install it using either pip or conda:

pip install datacompy

conda install datacompy

For those interested in additional functionalities or working with other data processing backends, DataComPy offers extra installation options:

pip install datacompy[spark]
pip install datacompy[dask]
pip install datacompy[duckdb]
pip install datacompy[ray]
pip install datacompy[snowflake]

Spark Implementation Notice

With the release of version v0.12.0, DataComPy has shifted from the original SparkCompare implementation to a more efficient Pandas on Spark model. The newer implementation aims to align APIs and enhance consistency. For those still utilizing SparkCompare, it remains accessible under a new LegacySparkCompare module:

from datacompy.spark.legacy import LegacySparkCompare

In subsequent updates, notably version v0.13.0, improvements were made with the introduction of SparkSQLCompare, a new class designed for handling pyspark.sql.DataFrame objects with better performance. The previous Pandas on Spark implementation is now known as SparkPandasCompare.

As of v0.14.1, SparkPandasCompare is due for deprecation. Users are encouraged to transition to SparkSQLCompare, which offers superior performance but note the compatibility caveats associated with numpy 2+.

Compatibility Matrix

The support matrix for various versions of Spark, Pandas, and Python is an important reference to ensure compatibility when using DataComPy:

	Spark 3.2.4	Spark 3.3.4	Spark 3.4.2	Spark 3.5.1
Python 3.9	✅	✅	✅	✅
Python 3.10	✅	✅	✅	✅
Python 3.11	❌	❌	✅	✅
Python 3.12	❌	❌	❌	❌

	Pandas < 1.5.3	Pandas >=2.0.0
`Compare`	✅	✅
`SparkPandasCompare`	✅	❌
`SparkSQLCompare`	✅	✅
Fugue	✅	✅

Python 3.12 is notably unsupported in certain Spark functionalities and within the Fugue framework. However, Pandas and Polars are supported and thoroughly tested.

Supported Backends

DataComPy seamlessly integrates with several data processing backends, offering broad applicability:

Pandas: Suitable for local data manipulation and analysis.
Spark: Handles large-scale data processing on distributed systems.
Polars: Known for its speed and low memory usage.
Snowflake/Snowpark: Enables data processing within Snowflake environments.
Fugue: Provides a consistent interface for data processing across various backends including Pandas, DuckDB, Polars, Arrow, Spark, Dask, and Ray.

Community and Contributions

DataComPy thrives on community contributions. Potential contributors should sign the Contributor License Agreement before participating. The project adheres to the Open Source Code of Conduct, requiring all participants to respect these guidelines.

Looking Ahead

The DataComPy team continuously seeks to improve and expand its functionalities. Users can view the project’s roadmap for future enhancements here.

In conclusion, DataComPy provides a comprehensive solution for data comparison, catering to a wide array of use cases across different data environments. Whether for straightforward Pandas operations or complex Spark analyses, DataComPy offers a robust toolkit for data professionals.