news-please - Efficient Open-Source Crawler for News Article Extraction

Introduction to news-please

Overview

news-please is an open-source project designed to make the task of collecting and analyzing news from various online sources much simpler. This tool functions as a news crawler capable of pulling structured data from almost any news website. It features an ability to follow internal hyperlinks, parse RSS feeds, and handles both current and archived articles. By simply providing the root URL of a desired news website, users can completely crawl it. news-please leverages powerful libraries like Scrapy, Newspaper, and Readability to enhance its capabilities.

Key Features

Ease of Setup: With a simple installation using pip, users just need to add the URLs of the target pages, and the tool does the rest.
CLI and Library Modes: It can operate via a Command Line Interface (CLI) or be integrated into Python programs as a library.
Extensive Data Storage Options: Stores extracted data in multiple formats, including JSON files, and supports storage in PostgreSQL, ElasticSearch, Redis, and others.
Commoncrawl.org Integration: Allows extraction of articles from the extensive archive provided by commoncrawl.org, filtering them by publishers or date.

Extracted Information

The tool extracts valuable information from articles, including:

The headline, lead paragraph, and main text.
Main image and author names.
Publication date and language.

Usage Modes

CLI Mode

Ideal for users who prefer a command-line interface. It allows storing data in various formats and supports simple yet extensive configuration.

Library Mode

Ideal for developers who want to incorporate news-please functionality directly into their own Python code. This mode requires a list of URLs and processes each to extract necessary information.

Commoncrawl Integration

This mode enhances the tool's ability by tapping into commoncrawl.org's archive, which provides a wealth of news article data. It facilitates filtering by publisher and date range.

Getting Started

news-please requires Python 3.8+. Installation is straightforward with pip:

$ pip install news-please

Sample Usage in Code

Using news-please in a Python program involves merely importing it and calling its functions to extract news:

from newsplease import NewsPlease
article = NewsPlease.from_url('https://www.nytimes.com/some-article')
print(article.title)

Storage Options

news-please offers multiple storage options for extracted data:

ElasticSearch allows for powerful query capabilities and versioning of articles.
PostgreSQL is used for structured storage, with a recommendation to use psycopg2 for production environments.
Redis provides a fast, in-memory data store compatible with AWS Elasticache and GCP MemoryStore.

Support and Contributions

For questions or to contribute, users are encouraged to use the GitHub Discussions page, and developers are welcome to contribute through pull requests. The team appreciates donations to help further the development and maintenance of the project.

Conclusion

news-please is a robust and flexible tool for anyone needing to crawl and extract news articles efficiently. Its multiple functionalities cater to different user needs, from casual data extraction to integration into larger software systems, making it a versatile choice in the domain of news analysis.