Scrapegraph-ai - Streamlined Web Data Extraction Using Language Models and Graph Logic

Introduction to ScrapeGraphAI

ScrapeGraphAI is a web scraping Python library designed to simplify the extraction of information from websites and local documents such as XML, HTML, JSON, and Markdown. The library utilizes advanced technologies, including Large Language Models (LLM) and direct graph logic, to create efficient scraping pipelines. In essence, you just specify the information you need, and ScrapeGraphAI does the rest for you.

Quick Installation

To get started with ScrapeGraphAI, you can easily install it using Python's package installer. It is recommended to set it up in a virtual environment to avoid any conflicts with other libraries you might be using:

pip install scrapegraphai

playwright install

Optional dependencies can also be installed to extend the functionality of the library, such as supporting additional language models, enabling advanced semantic processing, and managing browser options effectively.

Usage

ScrapeGraphAI provides various standard scraping pipelines to extract information. One of the most commonly used is the SmartScraperGraph which simplifies extracting details from a single webpage using a user prompt and a source URL. For example:

import json
from scrapegraphai.graphs import SmartScraperGraph

# Define the configuration for the scraping pipeline
graph_config = {
    "llm": {
        "api_key": "YOUR_OPENAI_APIKEY",
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="Find some information about what does the company do, the name and a contact email.",
    source="https://scrapegraphai.com/",
    config=graph_config
)

# Run the pipeline
result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))

This example will yield a tidy dictionary of key information such as the company name, its function, and contact email from the specified website.

Versatile Pipelines

Aside from the single-page scraper, ScrapeGraphAI offers several other pipelines:

SearchGraph: Scrapes multiple pages from search engine results.
SpeechGraph: Scrapes a page and creates an audio file of the extracted information.
ScriptCreatorGraph: Extracts data and generates a Python script for future use.
MultiPage Pipelines: Extends functionalities to handle multi-page scraping with parallel LLM calls.

ScrapeGraphAI supports various LLMs through APIs, including OpenAI, Groq, Azure, and Gemini, or local models via Ollama.

Demos and Documentation

For those looking to see ScrapeGraphAI in action, a demonstration is available through a Streamlit application or it can be tried directly in a web environment using Google Colab. Comprehensive documentation is available online to guide users through the features and options offered by ScrapeGraphAI.

Contributing and Community

The ScrapeGraphAI team welcomes contributions and engages with the community through platforms like Discord, LinkedIn, and Twitter. They encourage developers and enthusiasts to participate in discussions and improvements of the project.

Telemetry

To improve the quality and user experience of ScrapeGraphAI, anonymous usage metrics are collected. Users can choose to opt-out if they prefer not to share data.

In summary, ScrapeGraphAI is a robust and versatile tool that simplifies the process of web scraping through intuitive interfaces and advanced technological support, empowering users to efficiently gather and utilize online data.