Introduction to news-please
Overview
news-please is an open-source project designed to make the task of collecting and analyzing news from various online sources much simpler. This tool functions as a news crawler capable of pulling structured data from almost any news website. It features an ability to follow internal hyperlinks, parse RSS feeds, and handles both current and archived articles. By simply providing the root URL of a desired news website, users can completely crawl it. news-please leverages powerful libraries like Scrapy, Newspaper, and Readability to enhance its capabilities.
Key Features
- Ease of Setup: With a simple installation using pip, users just need to add the URLs of the target pages, and the tool does the rest.
- CLI and Library Modes: It can operate via a Command Line Interface (CLI) or be integrated into Python programs as a library.
- Extensive Data Storage Options: Stores extracted data in multiple formats, including JSON files, and supports storage in PostgreSQL, ElasticSearch, Redis, and others.
- Commoncrawl.org Integration: Allows extraction of articles from the extensive archive provided by commoncrawl.org, filtering them by publishers or date.
Extracted Information
The tool extracts valuable information from articles, including:
- The headline, lead paragraph, and main text.
- Main image and author names.
- Publication date and language.
Usage Modes
CLI Mode
Ideal for users who prefer a command-line interface. It allows storing data in various formats and supports simple yet extensive configuration.
Library Mode
Ideal for developers who want to incorporate news-please functionality directly into their own Python code. This mode requires a list of URLs and processes each to extract necessary information.
Commoncrawl Integration
This mode enhances the tool's ability by tapping into commoncrawl.org's archive, which provides a wealth of news article data. It facilitates filtering by publisher and date range.
Getting Started
news-please requires Python 3.8+. Installation is straightforward with pip:
$ pip install news-please
Sample Usage in Code
Using news-please in a Python program involves merely importing it and calling its functions to extract news:
from newsplease import NewsPlease
article = NewsPlease.from_url('https://www.nytimes.com/some-article')
print(article.title)
Storage Options
news-please offers multiple storage options for extracted data:
- ElasticSearch allows for powerful query capabilities and versioning of articles.
- PostgreSQL is used for structured storage, with a recommendation to use
psycopg2
for production environments. - Redis provides a fast, in-memory data store compatible with AWS Elasticache and GCP MemoryStore.
Support and Contributions
For questions or to contribute, users are encouraged to use the GitHub Discussions page, and developers are welcome to contribute through pull requests. The team appreciates donations to help further the development and maintenance of the project.
Conclusion
news-please is a robust and flexible tool for anyone needing to crawl and extract news articles efficiently. Its multiple functionalities cater to different user needs, from casual data extraction to integration into larger software systems, making it a versatile choice in the domain of news analysis.