#information extraction
ontogpt
OntoGPT is a Python package utilizing large language models and ontology-based grounding to extract structured information from text. It supports command line use and features a minimalist web app interface. The package integrates multiple model APIs, such as OpenAI and Azure, by setting API keys, and supports open models through the ollama package to enhance flexibility. OntoGPT effectively converts unstructured text into structured data, aiding biological data management without bias. Its capabilities and assessments are well-documented for verification and reproduction.
news-please
News-please is an open-source tool designed for efficient extraction of news articles from websites, supporting recent and archived content. Utilizing advanced libraries such as Scrapy and Newspaper, it enables precise extraction and offers functionality as both a command-line tool and a Python library. The tool supports storage in JSON, PostgreSQL, and ElasticSearch, facilitating the management of extensive news datasets. Additionally, explore its related projects for sentiment analysis and event extraction to enhance your news analysis.
InvoiceNet
InvoiceNet facilitates invoice data extraction through a simple interface, supporting PDFs, JPGs, and PNGs. Users can train custom models using their datasets and configure invoice fields as needed. Extracted data is easily saved, improving process efficiency. Although pre-trained models are limited, InvoiceNet provides tools for developing a large public invoice dataset. It is compatible with Ubuntu 20.04 and Windows 10, offering straightforward installation instructions. This tool eases data preparation and model training, leveraging advanced machine learning.
wiseflow
Explore a comprehensive tool that extracts useful information from diverse online sources, removing unnecessary noise. This tool effectively supports most news pages with an advanced web parser and asynchronous tasks. It includes a sophisticated LLM-based tagging system, offering dynamic explanations for complex tags. Ideal for seamless integration in localized environments with minimal resource use, and supports multiple SDK languages.
pyresparser
Pyresparser is a resume parsing tool that extracts key details such as names, emails, phone numbers, and skills from PDF and DOCx files using spaCy and NLTK libraries. Easily installable with pip, it supports command-line usage and Python integration, offering customization through regex and skills CSV files. Results are delivered in JSON format, making it ideal for HR professionals and developers to simplify resume processing.
awesome-bioie
Discover how state-of-the-art language models and freely accessible resources transform the extraction of information from unstructured biomedical data into reliable knowledge. This comprehensive guide offers detailed insights into the newest methods, datasets, and key technologies without promotional language, aiming to support advancements in both clinical and scientific research fields. Navigate through in-depth analyses, practical tutorials, and a wide array of shared datasets backed by open science initiatives, fostering a commitment to data transparency and accessibility in BioIE's continuously evolving environment.
Feedback Email: [email protected]