Project Icon

trafilatura

Flexible and Efficient Web Text Extraction and Data Structuring Tool

Product DescriptionTrafilatura is a versatile Python package and CLI tool for efficient text extraction from the web. It transforms raw HTML into structured data, filtering out unwanted elements. Without requiring a database, it supports multiple output formats including JSON, CSV, and XML, making it suitable for both academic and professional applications in NLP and social sciences due to its superior benchmark performance.
Project Details