text-dedup - Scripts for Text Deduplication with Advanced Algorithms

Introduction to the Text-Dedup Project

Text-dedup is a project that provides a suite of tools and scripts to help with the task of text deduplication. As the size of datasets increases, especially into the terabytes range, ensuring uniqueness in data is crucial to maintain data quality and integrity. Text-dedup offers a range of techniques to identify and remove duplicate text entries effectively.

Installation

To start using text-dedup, it can be easily installed via pip:

pip install text-dedup

Alternatively, it can be installed directly from the GitHub repository:

pip install git+https://github.com/ChenghaoMou/text-dedup

Features

Text-dedup is equipped with a variety of deduplication methods:

RETSim/UniSim: An embedding-based method for near deduplication, which is in progress.
MinHash + MinHashLSH: Includes a Spark implementation ideal for handling very large datasets.
SimHash: Available in both 64 and 128-bit versions.
SuffixArray Substring: Used for substring-based deduplication.
Bloom Filter: A probabilistic method for detecting duplicates.
Exact Hash: Can be applied at the document or line level.

Future Directions

The project has several ambitions for the future:

Conducting memory benchmarks for streaming processing.
Developing strategies for inter-dataset deduplication.
Rewriting the suffix array in Python.
Exploring additional deduplication methods such as SuperMinHash and ProbMinHash.

Acknowledgements

The text-dedup project draws inspiration from numerous sources:

Influences from projects like Datasketch and simhash-py.
Learnings from contributions to BigScience and BigCode communities.
Publications such as "Deduplicating Training Data Makes Language Models Better."

A blog post detailing the journey and learnings from the project is available, inviting feedback from users and contributors.

Usage Examples

Native PySpark

The PySpark implementation allows for efficient processing of large datasets. Before utilizing it, users should modify the script to suit their specific project needs:

spark-submit --executor-memory 16g \
    --driver-memory 20g \
    --executor-cores 3 \
    --num-executors 2 \
    --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12 \
    text_dedup/minhash_spark.py\
    --input "./temp-data" \
    --output "./temp-output" \
    --column "text" \
    --threshold 0.7

Other Methods

Examples of other text deduplication methods include:

UniSim: Utilizes Google's RETSim model for embedding-based deduplication.
Suffix Array Substring Deduplication: Finds exact matches in a given dataset.
MinHash/SimHash: Detects near duplicates using MinHash and SimHash algorithms.

Benchmarks

Benchmarks revealed the effectiveness of various algorithms employed in text-dedup. MinHash and its variations, for instance, showcased a strong balance of precision and recall in identifying duplicates. The project was tested on datasets such as pinecone/core-2020-05-10 and NEWS-COPY to demonstrate its capabilities.

License

The project operates under the Apache 2.0 license, permitting a wide range of applications and modifications, though users must also comply with the license terms when distributing modified versions of the project.

Citations

If you wish to reference text-dedup in your work, please use the following citation format:

@software{chenghao_mou_2023_8364980,
  author       = {Chenghao Mou and Chris Ha and Kenneth Enevoldsen and Peiyuan Liu},
  title        = {ChenghaoMou/text-dedup: Reference Snapshot},
  month        = sep,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {2023.09.20},
  doi          = {10.5281/zenodo.8364980},
  url          = {https://doi.org/10.5281/zenodo.8364980}
}

Text-dedup serves as a versatile toolkit for handling and maintaining text data integrity, especially in large datasets. Whether you are a researcher, a data engineer, or a developer, text-dedup offers valuable resources and methods for efficient text deduplication.