crawlee - Streamlined web scraping and automation with flexible configuration options

Introduction to Crawlee

Crawlee is a versatile web scraping and browser automation library designed to make building reliable web scrapers fast and efficient. This library allows developers to perform web crawling activities, helping them collect data and store it either on disks or in the cloud. One of its most compelling features is its ability to mimic human behavior, which helps it bypass modern bot detection systems with ease.

Installation and Getting Started

System Requirements

Crawlee requires Node.js version 16 or higher to function. For those looking to use Crawlee on Python, there is an option for early adopters to try.

Using Crawlee CLI

The easiest way to start using Crawlee is via its command-line interface (CLI). This tool will install all necessary dependencies and provide sample code to kickstart your project. Here’s how to do it:

npx crawlee create my-crawler
cd my-crawler
npm start

Manual Installation

For manual installation, especially if integrating Crawlee into an existing project, you can install it along with the Playwright library like so:

npm install crawlee playwright

A basic example using PlaywrightCrawler to crawl and process web pages can then be set up, allowing for asynchronous handling of web page requests, link extraction, and data storage.

Key Features of Crawlee

Crawlee offers an array of features tailored to web crawling and scraping:

Unified Interface: Use a single interface for both HTTP and headless browser crawling.
Persistent Queue: Queue management for URLs is available, supporting both breadth-first and depth-first strategies.
Scalable Storage: Flexibility in storing data, whether tabular or file-based.
Automatic Scaling: Efficiently uses system resources available.
Proxy Support: It includes integrated proxy rotation and session management.
Custom Hooks: Developers can customize lifecycles with hooks.
Command-Line Interface (CLI): Aids in project setup and management.
Error Handling and Retries: Configurable routing paths and retry mechanisms.
Ready-to-Deploy Dockerfiles: Useful for containerized deployments.
TypeScript Compatibility: Written in TypeScript, ensuring robust type-checking and error detection.

HTTP Crawling Capabilities

HTTP2 Support: Seamless HTTP2 protocol support, including proxy usage.
Header Automation: Automatic browser-like header generation.
TLS Fingerprinting: Replicates browser TLS fingerprints.
HTML Parsing: Fast HTML parsing with Cheerio and JSDOM support.
JSON API Scraping: Directly scrape JSON APIs without additional configuration.

Real Browser Crawling

JavaScript Rendering: Capable of rendering JavaScript content and taking screenshots.
Headless and Headful Modes: Supports both invisible (headless) and visible (headful) modes.
Fingerprint Generation: Able to create human-like fingerprints with no additional settings required.
Browser Management: Automatic management of browser processes.
Multiple Browser Options: Compatible with Playwright and Puppeteer, supporting browsers like Chrome, Firefox, and Webkit.

Usage on Apify Platform

While Crawlee is open-source and can run in various environments, it is also easily deployable on the Apify platform since it is developed by Apify. For more detailed deployments, developers can visit the Apify SDK website.

Support and Community

Developers encountering bugs or issues are encouraged to raise them on Crawlee’s GitHub repository. For questions, the community can engage through Stack Overflow, GitHub Discussions, or join the Discord server dedicated to Crawlee.

Contributing

Crawlee welcomes code contributions and ideas for improvements. Contributors can submit issues or pull requests and are invited to review the contribution guidelines and code of conduct available on GitHub.

License

Crawlee is distributed under the Apache License 2.0. For more specifics, users can refer to the LICENSE.md file on Crawlee’s GitHub repository.

By offering a powerful suite of features, Crawlee simplifies the web scraping and browser automation process, making it a valuable tool for developers engaged in data acquisition and analysis activities, while also ensuring ease of use and adaptability across projects of varying complexities.