x-crawl: A Comprehensive Guide
Introduction
x-crawl is an advanced Node.js library designed to simplify the intricacies of web crawling through artificial intelligence (AI) assistance. This library stands out due to its flexibility in usage and the power that AI brings to enhance the efficiency, intelligence, and convenience of web scraping tasks.
x-crawl comprises two main components:
- Crawler: A set of APIs and functionalities that can operate even without AI assistance.
- AI: Utilizes OpenAI's sophisticated AI model to streamline complex processes.
Key Features
- AI Assistance: The AI component significantly enhances the efficiency, intelligence, and convenience of the web crawling process.
- Flexible Writing: The library's crawling API adapts to various configurations, each offering unique benefits.
- Versatile Use Cases: Capable of crawling both dynamic and static pages, as well as accessing data from interfaces and files.
- Page Control: Supports automated actions on dynamic pages including keyboard inputs and event handling.
- Device Fingerprinting: Offers zero-configuration or customizable settings to prevent fingerprint tracking.
- Asynchronous & Synchronous Crawling: Allows for both synchronous and asynchronous operations without needing to switch APIs.
- Interval Crawling: Supports no interval, fixed intervals, or random intervals to handle high concurrency situations.
- Failed Retry: Allows for customizable retry attempts to prevent failure due to temporary issues.
- Rotation Proxy: Automatically rotates proxies with each retry, accommodating custom error counts and HTTP status codes.
- Priority Queue: Enables targeting and prioritizing specific crawling tasks.
- Crawling Information: Outputs color-coded logs to the terminal for better control and information.
- TypeScript Support: Provides full type support through generics, enhancing the development experience.
Sponsors
x-crawl is sponsored by:
- Capsolver: An AI-driven solution for seamless captcha solving, aiding in automatic web unblocking.
- 123proxy: Offers enterprise-grade HTTP proxy IPs, with free trials and cashback promotions.
AI-Assisted Crawling
The integration of AI in x-crawl makes it adept at handling frequent website updates that often challenge traditional crawlers. These updates can include changes in class names or structures, which conventional strategies rely on. By using AI, x-crawl interprets and processes the semantic structure of web pages more accurately, thus ensuring efficiency and accuracy.
For example, AI in x-crawl can extract image links from websites without being dependent on specific class names or structures. This capability is vital as it provides resilience against structural changes in web pages, making data extraction more reliable and intelligent.
Example Usage
Below is an example demonstrating how x-crawl can be used in conjunction with AI to gather images of highly-rated vacation rentals from a website:
import { createCrawl, createCrawlOpenAI } from 'x-crawl'
// Initialize a crawler
const crawlApp = createCrawl({
maxRetry: 3,
intervalTime: { max: 2000, min: 1000 }
})
// Initialize an AI application
const crawlOpenAIApp = createCrawlOpenAI({
clientOptions: { apiKey: process.env['OPENAI_API_KEY'] },
defaultModel: { chatModel: 'gpt-4-turbo-preview' }
})
// Execute a page crawl
crawlApp.crawlPage('https://www.airbnb.cn/s/select_homes').then(async (res) => {
const { page, browser } = res.data
// Wait for and capture the HTML of a specific element
const targetSelector = '[data-tracking-id="TOP_REVIEWED_LISTINGS"]'
await page.waitForSelector(targetSelector)
const highlyHTML = await page.$eval(targetSelector, (el) => el.innerHTML)
// Use AI to extract unique image links
const srcResult = await crawlOpenAIApp.parseElements(
highlyHTML,
`Get the image link, don't source it inside, and de-duplicate it`
)
browser.close()
// Download files using the crawled data
crawlApp.crawlFile({
targets: srcResult.elements.map((item) => item.src),
storeDirs: './upload'
})
})
By relying on AI's ability to understand and parse web pages, x-crawl provides a robust solution to handle dynamic web content, ensuring successful data collection even when website content changes frequently. This makes x-crawl an indispensable tool for developers looking for an effective and intelligent web scraping solution.