Introduction to Similarity
similarity is a comprehensive Java-based toolkit designed to compute the similarity score between text strings. It serves as a valuable resource for tasks such as text similarity computation and sentiment analysis. Developed with the goal of disseminating methods used in natural language processing for similarity measurement, similarity stands out due to its practical functionality, high efficiency, clear architecture, up-to-date corpora, and customization capabilities.
Features of similarity
similarity offers a wide array of functionalities across various text granularities:
-
Word Similarity Calculation:
- Cilin Similarity Method [Recommended]: Based on the Chinese thesaurus.
- Chinese Semantic Method.
- WordNet-based Word Similarity.
- Character Edit Distance Method.
-
Phrase Similarity Calculation:
- Simple Phrase Similarity [Recommended]: A straightforward approach based on character matching and position.
-
Sentence Similarity Calculation:
- Morphological and Word Sequence Combination [Recommended]: Considers both textual similarity and word order.
- Various Edit Distance Algorithms.
-
Paragraph Similarity Calculation:
- Cosine Similarity [Recommended]: Weighs word frequency and part-of-speech to compute similarity.
- Several distance and similarity algorithms such as Euclidean distance, Jaccard similarity, and SimHash + Hamming distance.
-
Semantic Analysis Using WordNet Primitives:
- Word Semantic Primitive Trees.
-
Sentiment Analysis:
- Determines the degree of positive and negative sentiment in text.
-
Approximate Word Matching:
- Using Word2vec for suggesting synonyms.
The thoughtfully designed modules within similarity are low-coupled and lazily loaded, with a dictionary published in plain text, ensuring ease of customization and use for training personal corpora.
Usage
To integrate similarity into a project, add the necessary dependencies in your build configuration:
Maven
Add the JitPack repository and the similarity dependency to your pom.xml
:
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
<dependency>
<groupId>com.github.shibing624</groupId>
<artifactId>similarity</artifactId>
<version>1.1.6</version>
</dependency>
Gradle
Follow JitPack instructions to include the library.
Example Usage
To demonstrate how similarity works, here's a simple Java example for computing word similarity and sentiment tendency:
import org.xm.Similarity;
import org.xm.tendency.word.HownetWordTendency;
public class demo {
public static void main(String[] args) {
double result = Similarity.cilinSimilarity("电动车", "自行车");
System.out.println(result);
String word = "混蛋";
HownetWordTendency hownetWordTendency = new HownetWordTendency();
result = hownetWordTendency.getTendency(word);
System.out.println(word + " 词语情感趋势值:" + result);
}
}
Demonstration of Functions
-
Word Similarity uses the Cilin Similarity method, which is highly recommended for its effectiveness in assessing the similarity of words based on a thesaurus.
-
Phrase Similarity: A method focusing on shared characters and their positions to determine similarity between phrases.
-
Sentence Similarity: Builds on morphological and word sequence methods to evaluate similarity considering both letters and order.
-
Paragraph Similarity: Utilizes techniques like cosine similarity which assess the weight and frequency of words to measure similarity.
-
Sentiment Analysis: Leveraging a primitive tree approach to analyze the sentiment of words.
-
Synonym Recommendation: Uses trained word vectors to suggest synonyms, which can be highly beneficial in language modeling tasks.
Similar projects include pytextclassifier, which enhances sentiment analysis through deep learning and SVM.
Contact and Contribution
Users and developers are encouraged to contribute to the project by submitting improvements and new ideas. The licensing under Apache 2.0 allows free use for commercial purposes with attribution. Contributions should include comprehensive tests and passing all unit tests before a pull request is made.
For suggestions or inquiries, issues can be submitted on GitHub, or the developer can be contacted via email or WeChat for group discussions.
References and Further Reading
Several foundational texts and studies are referenced to provide a deeper understanding of the methods and technologies integrated into similarity.
In summary, similarity is a dynamic and adaptable tool that delivers a wide range of approaches for calculating text similarity, along with powerful sentiment analysis features, making it invaluable for developers dealing with natural language processing tasks.