search-result-scraper-markdown - Utilize AI-Driven Web Scraping for Markdown Output with FastAPI

Introduction to Search Result Scraper with Markdown Output

Description

The Search Result Scraper is an innovative tool designed for efficiently extracting and transforming web search results into Markdown format. By harnessing the power of FastAPI, SearXNG, and Browserless, it delivers a streamlined experience for developers seeking concise and formatted search data. Additionally, the inclusion of AI integration adds a layer of intelligent filtering, ensuring that the most pertinent information is highlighted. This project offers a robust alternative to other tools like Jina.ai, FireCrawl AI, Exa AI, and 2markdown.

Features

FastAPI: Ensures fast and easy API development.
SearXNG: A versatile metasearch engine combining multiple search results.
Browserless: Facilitates browser automation for efficient web scraping.
Markdown Output: Transforms HTML content into Markdown seamlessly.
Proxy Support: Provides anonymous and secure web scraping access.
AI Integration: Improves search result relevance through AI filtering.
YouTube Transcriptions: Retrieves text from YouTube videos for easy access.
Image and Video Search: Offers visual content search capabilities via SearXNG.

Installation Prerequisites

To set up this project, ensure the following are installed:

Python 3.11 or later
Virtualenv for environment management
Docker for container-based deployment

Setup Instructions

Docker Setup

For a simplified setup using Docker:

Clone the repository:

git clone https://github.com/essamamdani/search-result-scraper-markdown.git
cd search-result-scraper-markdown

Build and run using Docker Compose:
```
docker compose up --build
```

Manual Setup

For a manual setup without Docker:

Clone the repository:

git clone https://github.com/essamamdani/search-result-scraper-markdown.git
cd search-result-scraper-markdown

Create and activate a virtual environment:

virtualenv venv
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
Configure environment variables in a .env file at the root.
Run Docker services for SearXNG and Browserless:
```
./run-services.sh
```

Start the application:

uvicorn main:app --host 0.0.0.0 --port 8000

Usage

Performing a Search

To query search results:

curl "http://localhost:8000/?q=python&num_results=5&format=json" # JSON format
curl "http://localhost:8000/?q=python&num_results=5" # Default Markdown format

Fetching URL Content

To convert a webpage to Markdown:

curl "http://localhost:8000/r/https://example.com&format=json" # JSON format
curl "http://localhost:8000/r/https://example.com" # Default Markdown

Fetching Images and Videos

For images:

curl "http://localhost:8000/images?q=puppies&num_results=5"

For videos:

curl "http://localhost:8000/videos?q=cooking+recipes&num_results=5"

Using Proxies

Enhance privacy through proxy services, facilitated by partnerships like Geonode.

Roadmap

Key milestones include establishing strong AI integration and enhancing media content extraction capabilities.

License

Distributed under the MIT License, encouraging open collaboration.

Contributions

Developers are invited to contribute by submitting Pull Requests for continuous improvement.

Additional Resources

For more details, visit the project repository and follow updates through the star history chart.