Fast Vector Similarity Library
Introduction
The Fast Vector Similarity Library is a high-performance tool crafted in Rust, designed to efficiently calculate similarity measures between vectors. This library is especially useful for tasks in data analysis, machine learning, and statistics where comparing vectors is key. It includes advanced similarity measures, performance enhancements, and is compatible with Python for seamless use in Python workflows.
Features
Similarity Measures
The library supports various traditional and contemporary similarity measures, such as:
- Spearman's Rank-Order Correlation (
spearman_rho
) - Kendall's Tau Rank Correlation (
kendall_tau
) - Optimized for faster processing with large datasets. - Approximate Distance Correlation (
approximate_distance_correlation
) - Vectorized for increased speed and accuracy. - Jensen-Shannon Dependency Measure (
jensen_shannon_dependency_measure
) - Revised to better measure dependencies. - Hoeffding's D Measure (
hoeffding_d
) - Normalized Mutual Information (
normalized_mutual_information
) - Recently added for analyzing dependencies between variables.
Bootstrapping Technique
The library incorporates a powerful bootstrapping feature that estimates the distribution of similarity measures. This technique enhances confidence in results by repeatedly resampling the dataset.
Performance Optimizations
Several enhancements ensure optimal efficiency:
- Parallel Processing: Uses Rust's
rayon
crate for parallel computing, enabling operations to scale with CPU core numbers. - Efficient Algorithms: Implements merge sort for inversion counting, boosting the speed of measures like Kendall's Tau.
- Vectorized Operations: Many functions use vectorized operations backed by the
ndarray
crate, maximizing performance in Rust.
Benchmarking and Verification
The library includes a benchmarking suite to ensure the accuracy of numerical results while assessing performance improvements. This guarantees that any enhancement in computational speed maintains accuracy, except for changes intended, like in the Jensen-Shannon measure.
Python Bindings
The library provides Python bindings, making its core functionality easily accessible in Python environments. Python users can use two main functions:
py_compute_vector_similarity_stats
: For computing various vector similarity measures.py_compute_bootstrapped_similarity_stats
: For conducting bootstrapped similarity calculations.
The results are returned in JSON format, simplifying integration into Python workflows.
Installation
Rust
To use the library in Rust, add it to your project's Cargo.toml
.
Python
For Python, the library can be installed through PyPI with:
pip install fast_vector_similarity
Usage with Text Embedding Vectors from LLMs
The library works effectively with modern language models such as Llama2, facilitating the analysis of text embeddings. It can process high-dimensional embeddings, such as 4096-dimensional vectors, and integrates with services like the Llama2 Embeddings FastAPI Service.
Example Workflow
- Load Embeddings into a DataFrame: Convert text embeddings from JSON into a Pandas DataFrame.
- Compute Similarities: Utilize the Fast Vector Similarity Library to compute similarity measures between embeddings with optimized functions.
- Analyze Results: Obtain a ranked list of the most similar vectors using measures like Hoeffding's D.
Example Python Code
Here's a Python example demonstrating how to use the library with large embedding vectors:
# Sample Python code leveraging the library
import time
import numpy as np
import json
import pandas as pd
import fast_vector_similarity as fvs
from random import choice
# Function to convert JSON embeddings to a Pandas DataFrame
def convert_embedding_json_to_pandas_df(file_path):
with open(file_path, 'r') as file:
data = json.load(file)
texts = [item['text'] for item in data]
embeddings = [item['embedding'] for item in data]
df = pd.DataFrame(embeddings, index=texts)
return df
# Function to apply vector similarity using the library
def apply_fvs_to_vector(row_embedding, query_embedding):
params = {
"vector_1": query_embedding.tolist(),
"vector_2": row_embedding.tolist(),
"similarity_measure": "all"
}
similarity_stats_str = fvs.py_compute_vector_similarity_stats(json.dumps(params))
return json.loads(similarity_stats_str)
# Main function to demonstrate exact and bootstrapped similarity calculations
def main():
# Generate and print test vectors
length_of_test_vectors = 15000
vector_1 = np.linspace(0., length_of_test_vectors - 1, length_of_test_vectors)
vector_2 = vector_1 ** 0.2 + np.random.rand(length_of_test_vectors)
# Parameters for exact similarity calculations
similarity_measure = "all"
params = {
"vector_1": vector_1.tolist(),
"vector_2": vector_2.tolist(),
"similarity_measure": similarity_measure
}
# Time and compute exact similarity measures
start_time_exact = time.time()
similarity_stats_str = fvs.py_compute_vector_similarity_stats(json.dumps(params))
similarity_stats_json = json.loads(similarity_stats_str)
elapsed_time_exact = time.time() - start_time_exact
# Bootstrapped similarity calculations
number_of_bootstraps = 2000
sample_size = int(length_of_test_vectors / 15)
params_bootstrapped = {
"x": vector_1.tolist(),
"y": vector_2.tolist(),
"sample_size": sample_size,
"number_of_bootstraps": number_of_bootstraps,
"similarity_measure": similarity_measure
}
start_time_bootstrapped = time.time()
bootstrapped_similarity_stats_str = fvs.py_compute_bootstrapped_similarity_stats(json.dumps(params_bootstrapped))
bootstrapped_similarity_stats_json = json.loads(bootstrapped_similarity_stats_str)
elapsed_time_bootstrapped = time.time() - start_time_bootstrapped
# Analyze using Llama2 embeddings
input_file_path = "sample_input_files/Shakespeare_Sonnets_small.json"
embeddings_df = convert_embedding_json_to_pandas_df(input_file_path)
query_embedding_index = choice(embeddings_df.index)
query_embedding = embeddings_df.loc[query_embedding_index]
embeddings_df = embeddings_df.drop(index=query_embedding_index)
json_outputs = embeddings_df.apply(lambda row: apply_fvs_to_vector(row, query_embedding), axis=1)
vector_similarity_results_df = pd.DataFrame.from_records(json_outputs)
vector_similarity_results_df.index = embeddings_df.index
# Output the top 10 most similar results
vector_similarity_results_df = vector_similarity_results_df[columns]
vector_similarity_results_df = vector_similarity_results_df.sort_values(by="hoeffding_d", ascending=False)
print("\nTop 10 most similar embedding results by Hoeffding's D:")
print(vector_similarity_results_df.head(10))
if __name__ == "__main__":
main()
Usage
In Rust
Rust projects can directly use core functions such as compute_vector_similarity_stats
or compute_bootstrapped_similarity_stats
for efficient computations.
In Python
After installing the Python package, users can compute vector similarity or perform bootstrapped analysis, as demonstrated in the example.
Detailed Overview of Similarity Measures
-
Spearman's Rank-Order Correlation (
spearman_rho
): Captures monotonic relationships by ranking values and calculating Pearson correlation on these ranks. -
Kendall's Tau Rank Correlation (
kendall_tau
): Evaluates ordinal association through the relative ordering of data points, optimized with efficient sorting and counting techniques. -
Approximate Distance Correlation (
approximate_distance_correlation
): Detects dependencies, both linear and non-linear, by computing and comparing pairwise distance matrices. -
Jensen-Shannon Dependency Measure (
jensen_shannon_dependency_measure
): Assesses dependency based on the divergence of probability distributions, factoring in a baseline comparison. -
Hoeffding's D Measure (
hoeffding_d
): A non-parametric measure sensitive to complex relationships between variables. -
Normalized Mutual Information (
normalized_mutual_information
): Quantifies mutual dependence using entropy-based calculations, ideal for non-linear associations.
Bootstrapping Technique for Robust Estimation
The library's bootstrapping technique enhances similarity estimate reliability by resampling the dataset multiple times, improving robustness, reducing outlier effects, and allowing model-free estimation.
Advantages of Bootstrapping
- Robust to Outliers: Mitigates outlier influence, delivering more reliable estimates.
- Model-Free: Applicable across various datasets without relying on specific assumptions.
- Confidence Intervals & Insights: Provides interpretability and deeper understanding of data relationships by offering confidence intervals based on resampled distributions.