Chroma Project Overview
Chroma is an open-source embedding database designed to support the swift development of Python or JavaScript applications that require memory. By offering a streamlined, efficient approach to embedding and retrieving data, Chroma speeds up the process of building applications involving large language models (LLMs).
Installation and Basic Usage
Chroma is easy to set up, catering to both Python and JavaScript developers. For Python users, installing Chroma is as simple as executing:
pip install chromadb
JavaScript developers can incorporate Chroma with:
npm install chromadb
Additionally, Chroma can be run in a client-server mode using:
chroma run --path /chroma_db_path
Core API
The core API of Chroma is concise, consisting of just four main functions. This API can be interacted with via platforms like Google Colab or through a Replit template.
Here's a quick overview of its usage:
-
Setup: First, initialize Chroma for in-memory use, which simplifies prototyping. Persistence can be added seamlessly later.
import chromadb client = chromadb.Client()
-
Create Collection: You can create collections to store your documents—additional operations like retrieving, creating, or deleting collections are also supported.
collection = client.create_collection("all-my-documents")
-
Add Documents: Add documents, set up metadata for filtering, and assign unique IDs.
collection.add( documents=["This is document1", "This is document2"], metadatas=[{"source": "notion"}, {"source": "google-docs"}], ids=["doc1", "doc2"], )
-
Query/Search: Perform queries to find the most similar documents, specifying the number of results or using additional filters.
results = collection.query( query_texts=["This is a query document"], n_results=2, )
Features
- Simplicity: Chroma prioritizes ease of use with its fully-typed, fully-tested, and documented API, ensuring a smooth user experience.
- Integrations: The platform integrates seamlessly with tools like LangChain and LlamaIndex for both Python and JavaScript.
- Versatile Deployment: Developers can use the same API across different environments—from development and testing to a production cluster.
- Rich Functionality: It offers advanced functionalities, including complex queries, filtering, and density estimation.
- Free & Open Source: Distributed under the Apache 2.0 License, Chroma is both free and open for community contributions and enhancements.
Application Example: ChatGPT for Data
A practical use case for Chroma could be creating a "Chat your data" application. The steps include:
- Adding documents to the database, either using Chroma's default embeddings or providing custom embeddings.
- Querying relevant documents using natural language inputs.
- Composing those documents into the context window of large language models like GPT-3 for further processing.
Understanding Embeddings
Embeddings are a technique used to convert data (text, images, audio) into numerical vectors, making them comprehensible to machine learning models. These embeddings provide a position for each document in a latent space at a specific layer of a neural network, allowing for efficient similarity searches.
Chroma leverages techniques such as Sentence Transformers for onsite embeddings but can also utilize embeddings from OpenAI or Cohere, or custom embeddings as needed.
Community and Contribution
Chroma is an emerging project that thrives with community support. Contributors are encouraged to get involved:
- Join discussions on Discord
- Review and suggest new ideas via the Roadmap
- Tackle issues and contribute PRs through GitHub
Release and License
Chroma has a regular release schedule, with new versions rolled out weekly. Hotfixes are published as necessary throughout the week. The project is licensed under Apache 2.0, ensuring freedom of use and modification for all developers.