Introduction to DataChad V3
DataChad V3 is an innovative application designed to facilitate the process of asking questions about any data source using a blend of modern technologies. By integrating embeddings, vector databases, large language models, and langchains, it provides a robust platform for data interaction.
How Does It Work?
-
Creating Knowledge Bases: Users can upload any file or enter a path or URL to create Knowledge Bases. These bases can include multiple files of diverse types, formats, and content. Additionally, users can create Smart FAQs, which are organized lists of curated questions and answers.
-
Data Loading and Chunking: The data from these sources or files is loaded into the system and divided into smaller text document chunks to ease processing.
-
Embedding the Data: These text chunks are then transformed into embeddings. This process can be done using platforms like OpenAI or Hugging Face.
-
Storing Embeddings: The embeddings are stored as a vector dataset in Activeloop's database hub, ensuring efficient retrieval and storage.
-
Building a Langchain: A langchain is constructed by selecting a large language model (LLM), like
gpt-3.5-turbo
, along with multiple vector stores that act as knowledge bases, plus a unique smart FAQ vector store. -
Question-Answering Process: When a user asks a question, the system embeds the input query and performs a similarity search across the vector stores. The results serve as context for the LLM, which generates an apt response.
-
Chat History: To enhance the user experience, the chat history is stored locally, mimicking a ChatGPT-style conversation.
Important Information
- System Requirements: The application requires Python version 3.10 or higher.
- Setup Instructions: For local execution or deployment, users need to copy a template environment file (
.env.template
) and set credentials. Alternatively, they can set system environment variables or store credentials in.streamlit/secrets.toml
when hosting via Streamlit. - Configuration Adjustments: Users can modify configurations in
datachad/backend/constants.py
to enable advanced features. - Support and Contribution: If users encounter data loading issues, they are encouraged to open an Issue or Pull Request and contribute to the project.
- Previous Versions: For users preferring the original functionality and UI, previous releases like V1 or V2 are available.
Current Application Interface
DataChad V3 features an updated user interface designed to simplify user interactions.
Contribution and Development Opportunities
The project is open to contributions, with several tasks available:
- Refactoring utility loaders
- Adding model and embedding options
- Supporting fully local/private modes
- Enabling multi-file uploads to a single dataset
- Decoupling DataChad modules from Streamlit
- Introducing smart FAQs
- Implementing enhancements like user creation, asynchronous I/O, and more
- Developing a frontend and possibly containerizing the app
The team welcomes contributors to address these tasks, contributing to the evolution of DataChad V3.