ai-template - Use AI to Train and Chat with Documents and Websites Efficiently

Introduction to Mercury: Your Custom GPT Trainer

Chat with Any Document or Website

Mercury is a groundbreaking project designed to enable users to interact with documents and websites in a conversational manner. It allows you to train a custom GPT (Generative Pre-trained Transformer) on data you've chosen, whether it's from specific websites or documents you upload. This unique capability lets you build a dialogue history and ensures that your queries are backed by cited sources, providing reliability in responses.

Supported File Types

Mercury supports a variety of file types to train your custom GPT, including:

PDFs (.pdf)
Word documents (.docx)
Markdown files (.md)
Text files (.txt)
Images (.png and .jpg)
HTML and JSON files

There are plans to support additional formats like .csv, .pptx, and integrations with Notion and Next 13 App Dir.

Training Process

Mercury offers two main methods to embed data for training:

1. Upload Method:

Using the /api/embed-file, you can upload any supported file. The system cleans the file to convert it into plain text, which is then divided into 1000-character segments. Each segment is processed through OpenAI’s embedding API using the "text-embedding-ada-002" model to generate embeddings, which are permanently stored in a Pinecone namespace.

2. Web Scraping Method:

Through the /api/embed-webpage, whole web pages can be scraped. They are subsequently cleaned and divided into segments. Just like the upload method, embeddings are generated and stored using Pinecone.

Query System

Mercury allows users to make questions through the /api/query. It generates a single embedding from a user’s question, which is then matched against the vector database through a similarity search. The most similar results help create a comprehensive prompt for GPT-3 to generate responses that are streamed back to the user.

How to Get Started

To begin using Mercury:

1. Clone Repository and Install Dependencies

Use degit to clone the repository and install dependencies with npm:

npx degit https://github.com/Jordan-Gilliam/ai-template ai-template
cd ai-template
npm i

2. Set Up Pinecone

Create an account on Pinecone, set up a new Pinecone Index with 1536 dimensions, and obtain your API key. Record the environment name and index name.

3. Set Up OpenAI API

Create an account on OpenAI, generate, and copy your API key.

4. Configure Environment Settings

Duplicate the .env.example file into .env.local and configure it with your OpenAI and Pinecone API keys.

5. Launch the Application

Run the application in development mode and open it in your browser:

npm run dev

Access it at http://localhost:3000.

Project Features

Key features of Mercury include API integration with OpenAI, storage through Pinecone, API routes via Next.js, and a stylish UI with Tailwind CSS. The project supports dark mode and utilizes various libraries to enhance the development experience.

Inspiration

Mercury draws inspiration from various technology projects and developers, including @gannonh and @mayooear.

How Embeddings Work

Embeddings are a key concept in Mercury. They are vectors that represent how closely related different text segments are. They make use of cosine similarity to identify the most relevant data. This feature is central in developing conversation models for specific domains, ensuring that the GPT-3 trained can accurately respond to niche queries by leveraging closely related text snippets from the training data.

By using Mercury, users are equipped with the tools to develop advanced, domain-specific AI applications that can provide accurate and cited information, enhancing digital interactions in various fields.