π₯π·οΈ Crawl4AI: A User-Friendly Web Crawler & Scraper Tailored for AI
Crawl4AI is a groundbreaking tool designed to streamline the process of asynchronous web crawling and data extraction, making it highly compatible with large language models (LLMs) and AI-driven applications. This open-source project offers both simplicity and efficiency, making web scraping accessible to developers and AI enthusiasts alike.
π Introducing the Crawl4AI Assistant
The Crawl4AI Assistant acts as your AI-powered copilot, guiding you through complex crawling and extraction tasks. With its support, users can:
- Effortlessly generate code for complicated web tasks.
- Access customized support and examples.
- Learn the ropes with detailed, step-by-step instructions.
Latest Updates in Version 0.3.72 β¨
Crawl4AI recently unveiled several exciting features:
- Markdown generation for quick extraction of main article content.
- An advanced "Magic" mode that bypasses sophisticated anti-bot systems.
- Enhanced support for seamless switching among multiple browsers, such as Chromium, Firefox, and WebKit.
- Innovative chunking strategies for data processing.
- A more efficient caching system to boost performance.
- Optimized batch processing with automatic rate limiting.
Explore Crawl4AI Today!
Dive into the world of Crawl4AI with our Colab Notebook and detailed Documentation.
Key Features β¨
- Free and Open-Source: Completely open-source, allowing for flexible use and modification.
- High Performance: Executes tasks rapidly, often outpacing commercial alternatives.
- Multiple Output Formats: Supports JSON, cleaned HTML, and Markdown outputs, making it friendly for LLM integration.
- Browser Compatibility: Operates across different browsers seamlessly.
- Concurrent URL Processing: Capable of handling multiple URLs simultaneously.
- Media and Link Extraction: Efficiently extracts various media types and links.
- Data Extraction and Customization: Offers metadata extraction, user-agent customization, and complex JavaScript execution.
- Proxy and Session Management: Supports proxy usage with authentication and manages multi-page crawling sessions effortlessly.
- Enhanced Image and Content Handling: Features improved image processing and a sophisticated system for handling delayed content loading.
Installation π οΈ
Crawl4AI offers multiple installation paths, depending on user preference and project needs:
-
Using pip for Python:
- Basic asynchronous installation:
pip install crawl4ai
- If needed, install Playwright using:
playwright install
- Basic asynchronous installation:
-
Synchronous Version via Selenium:
pip install crawl4ai[sync]
-
For Developers:
git clone https://github.com/unclecode/crawl4ai.git cd crawl4ai pip install -e .
A Docker version will soon be available for containerized environments.
Quick Start and Advanced Usage π
Initiate your first crawl using the following Python script:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
For advanced crawling scenarios, such as executing JavaScript, using CSS selectors, or employing proxies, Crawl4AI offers extensive configurability and customization options.
Speed and Performance Comparison π
Crawl4AI prides itself on speed, offering superior performance compared to many paid services. Testing against Firecrawl, a competitor, Crawl4AI demonstrated faster execution and enhanced data extraction capabilities.
Documentation and Contribution π
For comprehensive details on using Crawl4AI, visit our Documentation Website. Contributions from developers are warmly welcomed. For guidelines, see our contribution page.
License and Contact π
Crawl4AI is available under the Apache 2.0 License. For more information or to share feedback, reach out via:
- GitHub: unclecode
- Twitter: @unclecode
- Website: crawl4ai.com
Happy Crawling! πΈοΈπ