AI Web Scraper: A Comprehensive Overview
The AI Web Scraper is an innovative tool designed to simplify the process of gathering information from web pages. By leveraging the latest in artificial intelligence technology, specifically GPT-4, it allows users to automatically extract data from HTML sources according to their specific requirements. This project generates and runs scraping code to gather the information you need efficiently.
Prerequisites
Before diving into the world of AI-assisted web scraping, ensure you have the necessary tools at your disposal:
- Python 3.x: The programming language used to develop and run the application.
- Python Packages: Specific packages are needed and are listed in the
requirements.txt
file provided in the project. - OpenAI GPT-4 API Key: Access this key through OpenAI's services to enable the AI capabilities of the scraper.
Installation Steps
To start using the AI Web Scraper, follow these straightforward steps:
-
Clone the Repository: First, copy the project files onto your local machine by executing:
git clone https://github.com/dirkjbreeuwer/gpt-automated-web-scraper
-
Navigate to the Project Directory: Move into the directory where the project files are located using:
cd gpt-automated-web-scraper
-
Install Required Packages: Ensure you have all the necessary Python packages installed by running:
pip install -r requirements.txt
-
Set Up the API Key:
-
Obtain an API key from OpenAI by following their registration and subscription information.
-
Rename the file named
.env.example
to.env
in the directory of the project. -
Insert your API key in the
.env
file with this line:OPENAI_API_KEY=YOUR_API_KEY
Replace
YOUR_API_KEY
with the actual key.
-
Using the AI Web Scraper
To operate the AI Web Scraper, execute the gpt-scraper.py
script in your command line with specific arguments tailored to your scraping needs.
Command-line Arguments
Customize your scraping operation by using these command-line options:
--source
: Defines the URL or file path of the HTML source to scrape.--source-type
: Identifies whether the source is online ("url"
) or local ("file"
).--requirements
: Specifies the data extraction needs.--target-string
: Provides an example text fragment that helps the AI narrow down where the desired data is within the HTML, crucial given GPT-4's token limits.
Example Command
Here is how you can command the AI Web Scraper to fetch information:
python3 gpt-scraper.py --source-type "url" --source "https://www.scrapethissite.com/pages/forms/" --requirements "Print a JSON file with all the information available for the Chicago Blackhawks" --target-string "Chicago Blackhawks"
In the command above, adapt the --source
, --requirements
, and --target-string
to fit your specific web scraping project.
Licensing Information
The AI Web Scraper is distributed under the MIT License, allowing you the freedom to modify and utilize the tool as you see fit for your projects.
This cutting-edge tool makes web data extraction easier and more accessible, empowering users to gather information efficiently using artificial intelligence.