Introduction to ScrapeGraphAI
ScrapeGraphAI is a web scraping Python library designed to simplify the extraction of information from websites and local documents such as XML, HTML, JSON, and Markdown. The library utilizes advanced technologies, including Large Language Models (LLM) and direct graph logic, to create efficient scraping pipelines. In essence, you just specify the information you need, and ScrapeGraphAI does the rest for you.
Quick Installation
To get started with ScrapeGraphAI, you can easily install it using Python's package installer. It is recommended to set it up in a virtual environment to avoid any conflicts with other libraries you might be using:
pip install scrapegraphai
playwright install
Optional dependencies can also be installed to extend the functionality of the library, such as supporting additional language models, enabling advanced semantic processing, and managing browser options effectively.
Usage
ScrapeGraphAI provides various standard scraping pipelines to extract information. One of the most commonly used is the SmartScraperGraph
which simplifies extracting details from a single webpage using a user prompt and a source URL. For example:
import json
from scrapegraphai.graphs import SmartScraperGraph
# Define the configuration for the scraping pipeline
graph_config = {
"llm": {
"api_key": "YOUR_OPENAI_APIKEY",
"model": "openai/gpt-4o-mini",
},
"verbose": True,
"headless": False,
}
# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
prompt="Find some information about what does the company do, the name and a contact email.",
source="https://scrapegraphai.com/",
config=graph_config
)
# Run the pipeline
result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))
This example will yield a tidy dictionary of key information such as the company name, its function, and contact email from the specified website.
Versatile Pipelines
Aside from the single-page scraper, ScrapeGraphAI offers several other pipelines:
- SearchGraph: Scrapes multiple pages from search engine results.
- SpeechGraph: Scrapes a page and creates an audio file of the extracted information.
- ScriptCreatorGraph: Extracts data and generates a Python script for future use.
- MultiPage Pipelines: Extends functionalities to handle multi-page scraping with parallel LLM calls.
ScrapeGraphAI supports various LLMs through APIs, including OpenAI, Groq, Azure, and Gemini, or local models via Ollama.
Demos and Documentation
For those looking to see ScrapeGraphAI in action, a demonstration is available through a Streamlit application or it can be tried directly in a web environment using Google Colab. Comprehensive documentation is available online to guide users through the features and options offered by ScrapeGraphAI.
Contributing and Community
The ScrapeGraphAI team welcomes contributions and engages with the community through platforms like Discord, LinkedIn, and Twitter. They encourage developers and enthusiasts to participate in discussions and improvements of the project.
Telemetry
To improve the quality and user experience of ScrapeGraphAI, anonymous usage metrics are collected. Users can choose to opt-out if they prefer not to share data.
In summary, ScrapeGraphAI is a robust and versatile tool that simplifies the process of web scraping through intuitive interfaces and advanced technological support, empowering users to efficiently gather and utilize online data.