crawl4ai - Optimize your web data extraction processes using AI-driven web crawling solutions

🔥🕷️ Crawl4AI: A User-Friendly Web Crawler & Scraper Tailored for AI

Crawl4AI is a groundbreaking tool designed to streamline the process of asynchronous web crawling and data extraction, making it highly compatible with large language models (LLMs) and AI-driven applications. This open-source project offers both simplicity and efficiency, making web scraping accessible to developers and AI enthusiasts alike.

🌟 Introducing the Crawl4AI Assistant

The Crawl4AI Assistant acts as your AI-powered copilot, guiding you through complex crawling and extraction tasks. With its support, users can:

Effortlessly generate code for complicated web tasks.
Access customized support and examples.
Learn the ropes with detailed, step-by-step instructions.

Latest Updates in Version 0.3.72 ✨

Crawl4AI recently unveiled several exciting features:

Markdown generation for quick extraction of main article content.
An advanced "Magic" mode that bypasses sophisticated anti-bot systems.
Enhanced support for seamless switching among multiple browsers, such as Chromium, Firefox, and WebKit.
Innovative chunking strategies for data processing.
A more efficient caching system to boost performance.
Optimized batch processing with automatic rate limiting.

Explore Crawl4AI Today!

Dive into the world of Crawl4AI with our Colab Notebook and detailed Documentation.

Key Features ✨

Free and Open-Source: Completely open-source, allowing for flexible use and modification.
High Performance: Executes tasks rapidly, often outpacing commercial alternatives.
Multiple Output Formats: Supports JSON, cleaned HTML, and Markdown outputs, making it friendly for LLM integration.
Browser Compatibility: Operates across different browsers seamlessly.
Concurrent URL Processing: Capable of handling multiple URLs simultaneously.
Media and Link Extraction: Efficiently extracts various media types and links.
Data Extraction and Customization: Offers metadata extraction, user-agent customization, and complex JavaScript execution.
Proxy and Session Management: Supports proxy usage with authentication and manages multi-page crawling sessions effortlessly.
Enhanced Image and Content Handling: Features improved image processing and a sophisticated system for handling delayed content loading.

Installation 🛠️

Crawl4AI offers multiple installation paths, depending on user preference and project needs:

Using pip for Python:
- Basic asynchronous installation:
```
pip install crawl4ai
```
- If needed, install Playwright using:
```
playwright install
```
Synchronous Version via Selenium:
```
pip install crawl4ai[sync]
```

For Developers:

git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .

A Docker version will soon be available for containerized environments.

Quick Start and Advanced Usage 🚀

Initiate your first crawl using the following Python script:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://www.nbcnews.com/business")
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

For advanced crawling scenarios, such as executing JavaScript, using CSS selectors, or employing proxies, Crawl4AI offers extensive configurability and customization options.

Speed and Performance Comparison 🚀

Crawl4AI prides itself on speed, offering superior performance compared to many paid services. Testing against Firecrawl, a competitor, Crawl4AI demonstrated faster execution and enhanced data extraction capabilities.

Documentation and Contribution 📚

For comprehensive details on using Crawl4AI, visit our Documentation Website. Contributions from developers are warmly welcomed. For guidelines, see our contribution page.

License and Contact 📄

Crawl4AI is available under the Apache 2.0 License. For more information or to share feedback, reach out via:

GitHub: unclecode
Twitter: @unclecode
Website: crawl4ai.com

Happy Crawling! 🕸️🚀