Introduction to Similarities Project
The Similarities project is an innovative toolkit designed for calculating similarity and performing semantic searches across text and images. It is a versatile resource that integrates various algorithms to efficiently manage large-scale data searches, whether it's text-to-text, text-to-image, or image-to-image. Developed in Python, the project is readily accessible via pip, making it user-friendly for developers.
Features
Text Similarity Calculation and Search
-
Semantic Matching Models:
Similarities uses the text2vec framework, implementing the CoSENT model for calculating text similarity and searching. It supports various languages with pre-trained models, including SentenceBERT, and offers diverse similarity measures like Cosine Similarity, Dot Product, Hamming Distance, and Euclidean Distance. It's proficient at handling data searches across millions of entries and includes tools for command line operations such as converting text to vectors, indexing, and batch searching. -
Literal Matching Models:
This component includes models like Word2Vec, BM25, TFIDF, and others, which aid in literal text matching. These models are crucial for tasks like cold-start matching in text search applications.
Image and Text-Image Similarity Calculation and Search
-
CLIP Model:
A key feature of the project is the implementation of the CLIP (Contrastive Language-Image Pre-Training) model. CLIP facilitates image-text matching, image search, and zero-shot image classification. It supports both OpenAI's and Chinese-CLIP models, allowing efficient data retrieval with GPU acceleration. The project supports various operations like extracting image and text embeddings, creating indices, and carrying out batch searches. -
Image Feature Extraction:
Utilizing algorithms like pHash, SIFT, and others, the project can extract essential features from images for better similarity computation and search outcomes.
Usage
Installation
Installation is straightforward with pip, which supports both PyTorch and the main Similarities library. Users can also clone the repository from GitHub for a more hands-on approach.
Examples
The project repository offers multiple examples that demonstrate how to use the tool for tasks such as:
- Text Vector Similarity Calculation: Demonstrates usage of models to compute similarity scores between texts.
- Semantic Search: Shows how to find the most relevant text matches using semantic search techniques.
- Literal Text Search: Guides on using literal matching algorithms for similarity computation.
- Image Search: Displays how image embeddings and features can be used for searching and classifying images.
Advanced Features
-
Clustering:
The toolkit supports clustering of large datasets to identify groups of similar sentences, enhancing search and categorization efforts. -
Semantic Deduplication:
Employing paraphrase mining algorithms, Similarities can identify duplicate or semantically identical data entries, aiding in redundancy reduction.
Command Line Interface (CLI)
For enhanced accessibility, the project includes a CLI that allows batch operations of vector embedding, indexing, and filtering. The CLI also facilitates easy server deployment for embedding and search services.
Contact and Contribution
The project is open for contributions and welcomes suggestions via GitHub issues. Users are encouraged to add unit tests for any new features they contribute and ensure their changes pass all existing tests. Detailed contact methods, including an email and WeChat, are provided for further assistance or joining developer communities.
Licensing and Citation
The project operates under the Apache License 2.0, supporting free commercial use. Users are encouraged to cite the project in their research as specified in the project documentation.
The Similarities project represents a cutting-edge solution for developers looking to incorporate efficient similarity calculations and semantic search capabilities into their software, offering comprehensive tools and resources for tackling complex data retrieval challenges.