GPT Crawler: A Comprehensive Guide
GPT Crawler is a versatile and powerful tool designed to crawl websites and generate knowledge files. These files can then be used to create custom Generative Pre-trained Transformers (GPTs) from one or multiple web pages. Whether you're looking to synthesize information from a single site or combine data from various sources, GPT Crawler simplifies the process.
Example
One practical example of GPT Crawler's capabilities is its integration with Builder.io's documentation. By crawling Builder.io’s documentation pages, it can generate a comprehensive file that serves as the foundation for a custom GPT. This custom GPT can efficiently answer queries related to integrating Builder.io into other sites.
Try it out: You can see this custom GPT in action by exploring this link. Note that some features might require a subscription to a paid ChatGPT plan.
Getting Started
To capitalize on the functionalities of GPT Crawler, it's essential to understand how to configure and operate it, either locally or through alternative methods such as Docker or APIs.
Running Locally
To run GPT Crawler on your local machine, follow these key steps:
-
Clone the Repository
Ensure Node.js version 16 or higher is installed on your system. Use the following command to clone the repository:
git clone https://github.com/builderio/gpt-crawler
-
Install Dependencies
Next, navigate to the cloned directory and install the necessary dependencies using:
npm i
-
Configure the Crawler
Adjust the configuration settings by editing the
config.ts
file. You need to specify details such as URL, link patterns to match, and the CSS selector for grabbing text. Here's an example configuration designed for Builder.io:export const defaultConfig: Config = { url: "https://www.builder.io/c/docs/developers", match: "https://www.builder.io/c/docs/**", selector: `.docs-builder-container`, maxPagesToCrawl: 50, outputFileName: "output.json", };
-
Run Your Crawler
Execute the crawler with the command:
npm start
Alternative Methods
Docker Container: The tool can be executed within a container environment. Modifications can be made in the containerapp
directory's config.ts
. The output file will be situated in the designated data directory.
API Usage: GPT Crawler can be launched as an API server using Express JS. Run the server with:
npm run start:server
The server listens on port 3000 by default, and you can interact with it through POST requests to the /crawl
endpoint.
Uploading Your Data to OpenAI
Upon successful crawling, a file named output.json
is generated. This file can be uploaded to OpenAI to construct custom GPTs or assistants.
Creating a Custom GPT
For those looking to offer a user interface for their generated knowledge, creating a custom GPT via OpenAI's platform is an ideal solution. After navigating to OpenAI Chat, you can create a GPT and upload the necessary files.
If the file size exceeds limits, consider splitting the file or reducing its size via configurations in config.ts
.
Creating a Custom Assistant
To enable API access to the generated knowledge for product integration, you can create a custom assistant by uploading your file through the OpenAI Assistant platform.
Contributing
Contributions are welcomed to enhance the functionality of GPT Crawler. If you have suggestions or improvements, consider submitting a pull request for review.
By following this guide, users can effectively utilize GPT Crawler to amalgamate and generate comprehensive knowledge files for custom GPT creation, enhancing both learning and application across various digital platforms.