DataChad - Advanced Q&A Using Large Language Models and Vector Databases

Introduction to DataChad V3

DataChad V3 is an innovative application designed to facilitate the process of asking questions about any data source using a blend of modern technologies. By integrating embeddings, vector databases, large language models, and langchains, it provides a robust platform for data interaction.

How Does It Work?

Creating Knowledge Bases: Users can upload any file or enter a path or URL to create Knowledge Bases. These bases can include multiple files of diverse types, formats, and content. Additionally, users can create Smart FAQs, which are organized lists of curated questions and answers.
Data Loading and Chunking: The data from these sources or files is loaded into the system and divided into smaller text document chunks to ease processing.
Embedding the Data: These text chunks are then transformed into embeddings. This process can be done using platforms like OpenAI or Hugging Face.
Storing Embeddings: The embeddings are stored as a vector dataset in Activeloop's database hub, ensuring efficient retrieval and storage.
Building a Langchain: A langchain is constructed by selecting a large language model (LLM), like gpt-3.5-turbo, along with multiple vector stores that act as knowledge bases, plus a unique smart FAQ vector store.
Question-Answering Process: When a user asks a question, the system embeds the input query and performs a similarity search across the vector stores. The results serve as context for the LLM, which generates an apt response.
Chat History: To enhance the user experience, the chat history is stored locally, mimicking a ChatGPT-style conversation.

Important Information

System Requirements: The application requires Python version 3.10 or higher.
Setup Instructions: For local execution or deployment, users need to copy a template environment file (.env.template) and set credentials. Alternatively, they can set system environment variables or store credentials in .streamlit/secrets.toml when hosting via Streamlit.
Configuration Adjustments: Users can modify configurations in datachad/backend/constants.py to enable advanced features.
Support and Contribution: If users encounter data loading issues, they are encouraged to open an Issue or Pull Request and contribute to the project.
Previous Versions: For users preferring the original functionality and UI, previous releases like V1 or V2 are available.

Current Application Interface

DataChad V3 features an updated user interface designed to simplify user interactions.

DataChadV3 Interface

Contribution and Development Opportunities

The project is open to contributions, with several tasks available:

Refactoring utility loaders
Adding model and embedding options
Supporting fully local/private modes
Enabling multi-file uploads to a single dataset
Decoupling DataChad modules from Streamlit
Introducing smart FAQs
Implementing enhancements like user creation, asynchronous I/O, and more
Developing a frontend and possibly containerizing the app

The team welcomes contributors to address these tasks, contributing to the evolution of DataChad V3.