gpt-crawler - Optimize Your Site's Data with Custom GPT-Based Knowledge Extraction

GPT Crawler: A Comprehensive Guide

GPT Crawler is a versatile and powerful tool designed to crawl websites and generate knowledge files. These files can then be used to create custom Generative Pre-trained Transformers (GPTs) from one or multiple web pages. Whether you're looking to synthesize information from a single site or combine data from various sources, GPT Crawler simplifies the process.

Example

One practical example of GPT Crawler's capabilities is its integration with Builder.io's documentation. By crawling Builder.io’s documentation pages, it can generate a comprehensive file that serves as the foundation for a custom GPT. This custom GPT can efficiently answer queries related to integrating Builder.io into other sites.

Try it out: You can see this custom GPT in action by exploring this link. Note that some features might require a subscription to a paid ChatGPT plan.

Getting Started

To capitalize on the functionalities of GPT Crawler, it's essential to understand how to configure and operate it, either locally or through alternative methods such as Docker or APIs.

Running Locally

To run GPT Crawler on your local machine, follow these key steps:

Clone the Repository

Ensure Node.js version 16 or higher is installed on your system. Use the following command to clone the repository:
```
git clone https://github.com/builderio/gpt-crawler
```
Install Dependencies

Next, navigate to the cloned directory and install the necessary dependencies using:
```
npm i
```
Configure the Crawler

Adjust the configuration settings by editing the config.ts file. You need to specify details such as URL, link patterns to match, and the CSS selector for grabbing text. Here's an example configuration designed for Builder.io:
```
export const defaultConfig: Config = {
 url: "https://www.builder.io/c/docs/developers",
 match: "https://www.builder.io/c/docs/**",
 selector: `.docs-builder-container`,
 maxPagesToCrawl: 50,
 outputFileName: "output.json",
};
```
Run Your Crawler

Execute the crawler with the command:
```
npm start
```

Alternative Methods

Docker Container: The tool can be executed within a container environment. Modifications can be made in the containerapp directory's config.ts. The output file will be situated in the designated data directory.

API Usage: GPT Crawler can be launched as an API server using Express JS. Run the server with:

npm run start:server

The server listens on port 3000 by default, and you can interact with it through POST requests to the /crawl endpoint.

Uploading Your Data to OpenAI

Upon successful crawling, a file named output.json is generated. This file can be uploaded to OpenAI to construct custom GPTs or assistants.

Creating a Custom GPT

For those looking to offer a user interface for their generated knowledge, creating a custom GPT via OpenAI's platform is an ideal solution. After navigating to OpenAI Chat, you can create a GPT and upload the necessary files.

If the file size exceeds limits, consider splitting the file or reducing its size via configurations in config.ts.

Creating a Custom Assistant

To enable API access to the generated knowledge for product integration, you can create a custom assistant by uploading your file through the OpenAI Assistant platform.

Contributing

Contributions are welcomed to enhance the functionality of GPT Crawler. If you have suggestions or improvements, consider submitting a pull request for review.

By following this guide, users can effectively utilize GPT Crawler to amalgamate and generate comprehensive knowledge files for custom GPT creation, enhancing both learning and application across various digital platforms.