autoscraper - Streamline Web Data Collection with AutoScraper for Python

AutoScraper: A Smart, Automatic, Fast, and Lightweight Web Scraper for Python

AutoScraper is a powerful tool designed to make the process of web scraping easy and efficient. This project offers an automatic solution for extracting data from web pages by providing a URL or HTML content of a page and a list of sample data to extract. This data can be in the form of text, URLs, or any HTML tag values. AutoScraper learns the required scraping patterns and provides similar elements from the page. Once trained, this tool can be used on new URLs to extract similar or exact data elements from those pages.

Installation

AutoScraper is compatible with Python 3 and can be installed easily through various methods:

From the Git repository using pip:

$ pip install git+https://github.com/alirezamika/autoscraper.git

Directly from PyPI:
```
$ pip install autoscraper
```
From the source code:
```
$ python setup.py install
```

How to Use

Getting Similar Results

AutoScraper can fetch related data entries efficiently. For example, to extract post titles from a StackOverflow page:

from autoscraper import AutoScraper

url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'
wanted_list = ["What are metaclasses in Python?"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

This outputs a list of similar post titles from the given page.

Getting Exact Results

You can also extract exact data points, such as live stock prices from Yahoo Finance:

from autoscraper import AutoScraper

url = 'https://finance.yahoo.com/quote/AAPL/'
wanted_list = ["124.81"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

For more tailored requests, you can include custom requests module parameters, like proxies or headers.

Additional Examples

AutoScraper can be used for various purposes, such as scraping GitHub repo details like the about text, star count, and issues link:

from autoscraper import AutoScraper

url = 'https://github.com/alirezamika/autoscraper'
wanted_list = ['A Smart, Automatic, Fast and Lightweight Web Scraper for Python', '6.2k', 'https://github.com/alirezamika/autoscraper/issues']

scraper = AutoScraper()
scraper.build(url, wanted_list)

Saving the Model

You can save the trained AutoScraper model for future use:

scraper.save('yahoo-finance')

To load the saved model:

scraper.load('yahoo-finance')

Tutorials and Further Learning

For more advanced usage, you can refer to this gist and a helpful article on integrating AutoScraper with Flask: Create an API from Any Website in Less Than 5 Minutes.

Issues and Support

If any issues arise while using AutoScraper, feel free to open a problem report for help.

Additionally, if you'd like to support the project, consider buying the developer a coffee:

Happy Coding! ❤️