Introduction to DataComPy
DataComPy is an advanced Python package designed to compare two data tables, known as DataFrames, using the Pandas library. Originally conceptualized as an enhanced alternative to SAS’s PROC COMPARE
, it offers more than the basic Pandas.DataFrame.equals()
method by including functionalities that allow users to print statistical differences and adjust matching accuracy. This functionality has since been extended to support Spark DataFrames, making it a versatile tool in data analysis.
Installation Process
Getting started with DataComPy is easy. Users can install it using either pip
or conda
:
pip install datacompy
or
conda install datacompy
For those interested in additional functionalities or working with other data processing backends, DataComPy offers extra installation options:
pip install datacompy[spark]
pip install datacompy[dask]
pip install datacompy[duckdb]
pip install datacompy[ray]
pip install datacompy[snowflake]
Spark Implementation Notice
With the release of version v0.12.0
, DataComPy has shifted from the original SparkCompare
implementation to a more efficient Pandas on Spark model. The newer implementation aims to align APIs and enhance consistency. For those still utilizing SparkCompare
, it remains accessible under a new LegacySparkCompare
module:
from datacompy.spark.legacy import LegacySparkCompare
In subsequent updates, notably version v0.13.0
, improvements were made with the introduction of SparkSQLCompare
, a new class designed for handling pyspark.sql.DataFrame
objects with better performance. The previous Pandas on Spark implementation is now known as SparkPandasCompare
.
As of v0.14.1
, SparkPandasCompare
is due for deprecation. Users are encouraged to transition to SparkSQLCompare
, which offers superior performance but note the compatibility caveats associated with numpy
2+.
Compatibility Matrix
The support matrix for various versions of Spark, Pandas, and Python is an important reference to ensure compatibility when using DataComPy:
Spark 3.2.4 | Spark 3.3.4 | Spark 3.4.2 | Spark 3.5.1 | |
---|---|---|---|---|
Python 3.9 | ✅ | ✅ | ✅ | ✅ |
Python 3.10 | ✅ | ✅ | ✅ | ✅ |
Python 3.11 | ❌ | ❌ | ✅ | ✅ |
Python 3.12 | ❌ | ❌ | ❌ | ❌ |
Pandas < 1.5.3 | Pandas >=2.0.0 | |
---|---|---|
Compare | ✅ | ✅ |
SparkPandasCompare | ✅ | ❌ |
SparkSQLCompare | ✅ | ✅ |
Fugue | ✅ | ✅ |
Python 3.12
is notably unsupported in certain Spark functionalities and within the Fugue framework. However, Pandas and Polars are supported and thoroughly tested.
Supported Backends
DataComPy seamlessly integrates with several data processing backends, offering broad applicability:
- Pandas: Suitable for local data manipulation and analysis.
- Spark: Handles large-scale data processing on distributed systems.
- Polars: Known for its speed and low memory usage.
- Snowflake/Snowpark: Enables data processing within Snowflake environments.
- Fugue: Provides a consistent interface for data processing across various backends including Pandas, DuckDB, Polars, Arrow, Spark, Dask, and Ray.
Community and Contributions
DataComPy thrives on community contributions. Potential contributors should sign the Contributor License Agreement before participating. The project adheres to the Open Source Code of Conduct, requiring all participants to respect these guidelines.
Looking Ahead
The DataComPy team continuously seeks to improve and expand its functionalities. Users can view the project’s roadmap for future enhancements here.
In conclusion, DataComPy provides a comprehensive solution for data comparison, catering to a wide array of use cases across different data environments. Whether for straightforward Pandas operations or complex Spark analyses, DataComPy offers a robust toolkit for data professionals.