text-dedup
This project offers a collection of scripts designed for text deduplication leveraging methods such as MinHash, SimHash, and SuffixArray Substring. It caters to handling large datasets with techniques like embedding-based detection and exact hashing. Suitable for developers seeking customizable and script-based solutions over general-purpose libraries, it ensures deduplication tasks are tailored. The project includes detailed documentation and examples for implementation, serving as a resource to enhance data quality by reducing redundancy.