Introduction to Search Result Scraper with Markdown Output
Description
The Search Result Scraper is an innovative tool designed for efficiently extracting and transforming web search results into Markdown format. By harnessing the power of FastAPI, SearXNG, and Browserless, it delivers a streamlined experience for developers seeking concise and formatted search data. Additionally, the inclusion of AI integration adds a layer of intelligent filtering, ensuring that the most pertinent information is highlighted. This project offers a robust alternative to other tools like Jina.ai, FireCrawl AI, Exa AI, and 2markdown.
Features
- FastAPI: Ensures fast and easy API development.
- SearXNG: A versatile metasearch engine combining multiple search results.
- Browserless: Facilitates browser automation for efficient web scraping.
- Markdown Output: Transforms HTML content into Markdown seamlessly.
- Proxy Support: Provides anonymous and secure web scraping access.
- AI Integration: Improves search result relevance through AI filtering.
- YouTube Transcriptions: Retrieves text from YouTube videos for easy access.
- Image and Video Search: Offers visual content search capabilities via SearXNG.
Installation Prerequisites
To set up this project, ensure the following are installed:
- Python 3.11 or later
- Virtualenv for environment management
- Docker for container-based deployment
Setup Instructions
Docker Setup
For a simplified setup using Docker:
- Clone the repository:
git clone https://github.com/essamamdani/search-result-scraper-markdown.git cd search-result-scraper-markdown
- Build and run using Docker Compose:
docker compose up --build
Manual Setup
For a manual setup without Docker:
- Clone the repository:
git clone https://github.com/essamamdani/search-result-scraper-markdown.git cd search-result-scraper-markdown
- Create and activate a virtual environment:
virtualenv venv source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
- Configure environment variables in a
.env
file at the root. - Run Docker services for SearXNG and Browserless:
./run-services.sh
- Start the application:
uvicorn main:app --host 0.0.0.0 --port 8000
Usage
Performing a Search
To query search results:
curl "http://localhost:8000/?q=python&num_results=5&format=json" # JSON format
curl "http://localhost:8000/?q=python&num_results=5" # Default Markdown format
Fetching URL Content
To convert a webpage to Markdown:
curl "http://localhost:8000/r/https://example.com&format=json" # JSON format
curl "http://localhost:8000/r/https://example.com" # Default Markdown
Fetching Images and Videos
For images:
curl "http://localhost:8000/images?q=puppies&num_results=5"
For videos:
curl "http://localhost:8000/videos?q=cooking+recipes&num_results=5"
Using Proxies
Enhance privacy through proxy services, facilitated by partnerships like Geonode.
Roadmap
Key milestones include establishing strong AI integration and enhancing media content extraction capabilities.
License
Distributed under the MIT License, encouraging open collaboration.
Contributions
Developers are invited to contribute by submitting Pull Requests for continuous improvement.
Additional Resources
For more details, visit the project repository and follow updates through the star history chart.