Introduction to php-text-analysis
The php-text-analysis project is a comprehensive PHP library designed to perform Information Retrieval (IR) and Natural Language Processing (NLP) tasks. This library offers a range of tools that enable users to manipulate and analyze text using PHP, highlighting its versatile capabilities in document handling and text analysis.
Key Features
The php-text-analysis library includes a variety of tools for different text analysis purposes:
- Document Classification: Automatically categorize text documents into predefined classes.
- Sentiment Analysis: Assess text to determine its emotional tone, such as positive, negative, or neutral sentiments.
- Document Comparison: Compare multiple documents to identify similarities or differences.
- Frequency Analysis: Analyze the frequency of words or terms within text documents.
- Tokenization: Break down text into individual words or phrases, known as tokens.
- Stemming: Reduce words to their base or root form, which is useful for analysis.
- Collocations with Pointwise Mutual Information: Identify word combinations that frequently appear together.
- Lexical Diversity: Measure the variety of unique words used in a text.
- Corpus Analysis: Analyze large collections of text (corpora) for various linguistic or thematic elements.
- Text Summarization: Condense long pieces of text into shorter summaries while retaining key information.
Documentation
For those interested in exploring the php-text-analysis library further, comprehensive documentation is available. The library's usage and features are detailed in an accompanying book and a dedicated wiki section. These resources can be accessed and contributed to via the library's GitHub repository.
Installation
Installing php-text-analysis in your PHP project is straightforward. You can add the library to your project using Composer, a popular dependency manager for PHP:
composer require yooper/php-text-analysis
Once installed, you can start using its various functionalities.
Functionalities and Examples
Tokenization
Tokenization is the process of splitting text into individual components like words. Here's a basic example:
$tokens = tokenize($text);
For customized tokenization, you can specify a different tokenizer class:
$tokens = tokenize($text, \TextAnalysis\Tokenizers\PennTreeBankTokenizer::class);
Normalization
Normalization typically involves converting text to a consistent format, such as lowercasing. You can customize this behavior:
$normalizedTokens = normalize_tokens($tokens, 'mb_strtolower');
$normalizedTokens = normalize_tokens($tokens, function($token) { return mb_strtoupper($token); });
Frequency Distributions
Calculate how frequently each token appears in the text:
$freqDist = freq_dist(tokenize($text));
Ngram Generation
Generate ngrams, which are contiguous sequences of words. By default, bigrams are created:
$bigrams = ngrams($tokens);
Customize ngrams, such as creating trigrams with a specific delimiter:
$trigrams = ngrams($tokens, 3, '|');
Stemming
Reduce words to their base forms (stems):
$stemmedTokens = stem($tokens);
Choose a different stemmer if needed:
$stemmedTokens = stem($tokens, \TextAnalysis\Stemmers\MorphStemmer::class);
Keyword Extraction with Rake
Extract important keywords using the Rake algorithm, which requires pre-cleaned data:
$rake = rake($tokens, 3);
$results = $rake->getKeywordScores();
Sentiment Analysis with Vader
Perform sentiment analysis using the Vader algorithm. Ensure your data is normalized beforehand:
$sentimentScores = vader($tokens);
Document Classification with Naive Bayes
Classify documents using the Naive Bayes method, with an example focusing on categorizing types of food:
$nb = naive_bayes();
$nb->train('mexican', tokenize('taco nacho enchilada burrito'));
$nb->train('american', tokenize('hamburger burger fries pop'));
$nb->predict(tokenize('my favorite food is a burrito'));
In summary, php-text-analysis is a powerful library providing a suite of tools for text analysis within the PHP ecosystem. Whether you're analyzing sentiment, summarizing text, or categorizing documents, this library can help enhance your PHP projects with advanced linguistic capabilities.