AutoScraper: A Smart, Automatic, Fast, and Lightweight Web Scraper for Python
AutoScraper is a powerful tool designed to make the process of web scraping easy and efficient. This project offers an automatic solution for extracting data from web pages by providing a URL or HTML content of a page and a list of sample data to extract. This data can be in the form of text, URLs, or any HTML tag values. AutoScraper learns the required scraping patterns and provides similar elements from the page. Once trained, this tool can be used on new URLs to extract similar or exact data elements from those pages.
Installation
AutoScraper is compatible with Python 3 and can be installed easily through various methods:
- From the Git repository using pip:
$ pip install git+https://github.com/alirezamika/autoscraper.git
- Directly from PyPI:
$ pip install autoscraper
- From the source code:
$ python setup.py install
How to Use
Getting Similar Results
AutoScraper can fetch related data entries efficiently. For example, to extract post titles from a StackOverflow page:
from autoscraper import AutoScraper
url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'
wanted_list = ["What are metaclasses in Python?"]
scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)
This outputs a list of similar post titles from the given page.
Getting Exact Results
You can also extract exact data points, such as live stock prices from Yahoo Finance:
from autoscraper import AutoScraper
url = 'https://finance.yahoo.com/quote/AAPL/'
wanted_list = ["124.81"]
scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)
For more tailored requests, you can include custom requests
module parameters, like proxies or headers.
Additional Examples
AutoScraper can be used for various purposes, such as scraping GitHub repo details like the about text, star count, and issues link:
from autoscraper import AutoScraper
url = 'https://github.com/alirezamika/autoscraper'
wanted_list = ['A Smart, Automatic, Fast and Lightweight Web Scraper for Python', '6.2k', 'https://github.com/alirezamika/autoscraper/issues']
scraper = AutoScraper()
scraper.build(url, wanted_list)
Saving the Model
You can save the trained AutoScraper model for future use:
scraper.save('yahoo-finance')
To load the saved model:
scraper.load('yahoo-finance')
Tutorials and Further Learning
For more advanced usage, you can refer to this gist and a helpful article on integrating AutoScraper with Flask: Create an API from Any Website in Less Than 5 Minutes.
Issues and Support
If any issues arise while using AutoScraper, feel free to open a problem report for help.
Additionally, if you'd like to support the project, consider buying the developer a coffee:
Happy Coding! ❤️