llm-scraper - Utilize TypeScript and LLMs for Comprehensive Webpage Data Extraction

Introduction to LLM Scraper

LLM Scraper is a versatile TypeScript library that empowers users to extract structured data from any webpage with the help of Language Learning Models (LLMs). This innovative tool is designed to cater to developers and businesses seeking efficient data extraction solutions without getting bogged down in complex processes.

Key Features

Support for Multiple Providers: LLM Scraper is compatible with various AI providers, including local solutions like Ollama and GGUF, as well as cloud-based services such as OpenAI and Vercel AI SDK. This flexibility allows users to choose the provider that best fits their needs.
Schema Definition with Zod: It utilizes Zod for defining schemas, ensuring that the data extracted is structured and validated according to user requirements.
Type-Safety: Written in TypeScript, the library offers full type-safety, minimizing errors and enhancing code reliability.
Integration with Playwright: Built on the robust Playwright framework, LLM Scraper can handle web page rendering and interaction seamlessly.
Streaming Capabilities: It supports streaming objects, which allows for real-time data handling, especially with Vercel AI SDK.
Code-Generation Feature: A newly introduced feature enables users to generate code for reusable Playwright scripts, streamlining repetitive scraping tasks.
Multiple Formatting Modes: LLM Scraper provides diverse formatting options, such as:
- html for raw HTML content
- markdown for markdown content
- text for plain text extraction using Readability.js
- image for capturing screenshots (available in multi-modal mode)

Getting Started

To embark on using LLM Scraper, follow these simple steps:

Install Dependencies: Start by installing the necessary npm packages using the command:
```
npm i zod playwright llm-scraper
```
Initialize LLM: Depending on the AI provider you choose, initialize the LLM with the corresponding setup. For instance, using OpenAI:
```
npm i @ai-sdk/openai
```
And then:
```
import { openai } from '@ai-sdk/openai'
const llm = openai.chat('gpt-4o')
```
Create a Scraper Instance: Use the initialized LLM to create a new scraper instance:
```
import LLMScraper from 'llm-scraper'
const scraper = new LLMScraper(llm)
```

Practical Example

Here’s a brief example demonstrating how to extract the top stories from Hacker News:

Launch a browser with the Playwright framework.
Define the schema for the data you wish to extract.
Execute the scraper on a webpage and view the extracted data.

import { chromium } from 'playwright'
import { z } from 'zod'
import { openai } from '@ai-sdk/openai'
import LLMScraper from 'llm-scraper'

const browser = await chromium.launch()
const llm = openai.chat('gpt-4o')
const scraper = new LLMScraper(llm)

const page = await browser.newPage()
await page.goto('https://news.ycombinator.com')

const schema = z.object({
  top: z
    .array(
      z.object({
        title: z.string(),
        points: z.number(),
        by: z.string(),
        commentsURL: z.string(),
      })
    )
    .length(5)
    .describe('Top 5 stories on Hacker News'),
})

const { data } = await scraper.run(page, schema, { format: 'html' })
console.log(data.top)

await page.close()
await browser.close()

Streaming and Code Generation

LLM Scraper enhances workflow efficiency through its streaming and code-generation features. Users can replace the run function with stream to handle data streams or utilize the generate function to create reusable scripts, automating the scraping process.

Contribution

LLM Scraper is an open-source project, welcoming contributions from the community. Users are encouraged to report bugs or suggest improvements through issues or pull requests.

In summary, LLM Scraper offers a comprehensive solution for data extraction from web pages, combining the power of LLMs with flexible provider support, robust schema definition, and efficient code generation and streaming capabilities.