TextDescriptives - Enhanced Text Analysis through SpaCy Extension for Comprehensive Metric Calculation

Introduction to TextDescriptives

TextDescriptives is a powerful Python library designed for those interested in extracting a wide range of metrics from text data. It operates by utilizing spaCy's version 3 pipeline components and extensions, providing users a robust framework for text analysis.

Installation

Getting started with TextDescriptives is straightforward. Simply use the following command to install the library:

pip install textdescriptives

Latest Updates

TextDescriptives has made some exciting advancements recently:

A new web application is now available, allowing users to extract and download text metrics without writing any code. This app is accessible here.
The release of version 2.0 has introduced a new API and a new component to enhance its functionality. This version also comes with updated documentation and tutorials to facilitate user comprehension. One significant addition is the coherence component, which allows for the calculation of semantic coherence between sentences. Further details can be found in the documentation.

Quick Start Guide

To swiftly begin extracting metrics, employ the extract_metrics function. Users can check available methods with:

import textdescriptives as td
td.get_valid_metrics()

This will return a set of metrics like quality, readability, descriptive statistics, dependency distance, and more.

When using TextDescriptives, specifying the spacy_model parameter is optional if the language (lang) is already set. Otherwise, the library will automatically download a suitable model.

For example, to extract all available metrics from a piece of text:

import textdescriptives as td

text = "The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it."
df = td.extract_metrics(text=text, lang="en", metrics=None)

Integration with spaCy

TextDescriptives can be seamlessly integrated with spaCy pipelines. To do so, import both spaCy and TextDescriptives and add the necessary component(s) to your pipeline:

import spacy
import textdescriptives as td

# Load a spaCy model
nlp = spacy.load("en_core_web_sm")

# Add all TextDescriptives components
nlp.add_pipe("textdescriptives/all")

# Process text
doc = nlp("The world is changed. I feel it in the water...")

Extracting Metrics

TextDescriptives offers convenient functions to pull metrics from a Doc object into a Pandas DataFrame or a dictionary, facilitating easy data manipulation and analysis.

Documentation

Comprehensive documentation, including Jupyter notebook tutorials, is available for users to master TextDescriptives. The tutorials can be found in the docs/tutorials folder or on the documentation site.

Some useful resources:

[Getting started guide]: Introduction and basic usage instructions.
[Demo]: Live demonstration of TextDescriptives.
[Tutorials]: Step-by-step guides to maximize the library's potential.
[API References]: Detailed documentation of the library’s API.
[Paper]: Preprint of the TextDescriptives research paper.

By making complex text metric extraction accessible and understandable through well-documented methods and tutorials, TextDescriptives serves as an invaluable tool for those delving into text analytics.