sensitive-word - Efficient Sensitive Word Detection Tool Utilizing DFA Algorithm

Introduction to the Sensitive-Word Project

The sensitive-word project is a high-performance tool for detecting and managing sensitive words, utilizing the efficient DFA algorithm to provide robust functionality. Aimed at developers and organizations needing to filter sensitive content, this tool offers a comprehensive solution for various needs in content moderation and management.

Purpose and Goals

The primary goal of the sensitive-word project is to deliver a highly effective tool for managing sensitive words. The tool is built on DFA (Deterministic Finite Automaton) algorithm, providing a swift mechanism for processing a vast dictionary of over 60,000 words. Although the source file originally included more than 180,000 words, it was condensed to improve performance and accuracy. Continuous updates aim to enhance the word repository and refine algorithm performance.

Features

Extensive Word Library: Boasts a database of over 60,000 words, with regular updates to keep it current.
Fluent API Interface: Designed for elegance and simplicity in use.
High Performance: Capable of handling up to 70,000 queries per second (QPS) seamlessly.
Comprehensive Operations: Supports operations like word detection, replacement, and various transformations.
Flexible Format Conversion: Includes features such as full-width and half-width character interchange, case swaps for alphabets, number transformations, and simplified to traditional Chinese conversion.
Advanced Detection Strategies: Can detect sensitive terms, emails, numbers, URLs, and IPV4 addresses.
Custom Replacement Strategies: Allows users to define their replacement methodologies.
Personalized Whitelists and Blacklists: Enabling user-defined sensitive word lists and exemptions.
Dynamic Data Updates: Real-time updates to the sensitive word list without requiring a full system reset.
Tagging Interface for Sensitive Words: Helps categorize and manage sensitive words efficiently.
Special Character Handling: Permits skipping specific characters for more flexible matching.
Incremental List Updates: Supports adding or modifying blacklist and whitelist entries without a full list reinitialization.

Recent Updates

V0.19.0

Introductory support for single word additions/deletions without full initialization.
Added null implementation for allow/deny lists.

V0.20.0

Enabled full-word matching for alphanumeric strings.

V0.21.0

Resolved issues where long whitelist entries could mistakenly include blacklisted terms.
Introduced single entry editing for whitelists.

Quick Start Guide

Requirements

Before getting started, ensure you have JDK 1.8+ and Maven 3.x installed.

Maven Integration

Add the following dependency to your Maven project:

<dependency>
    <groupId>com.github.houbb</groupId>
    <artifactId>sensitive-word</artifactId>
    <version>0.21.0</version>
</dependency>

Core Methods and Usage

Key functionalities of the SensitiveWordHelper class include:

Checking for Sensitive Words: Determine if a string contains sensitive content.
Finding and Replacing Words: Locate and sanitize sensitive words using default or custom strategies.

Example: Check if a Text Contains Sensitive Words

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";
Assert.assertTrue(SensitiveWordHelper.contains(text));

Example: Retrieve the First Sensitive Word

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";
String word = SensitiveWordHelper.findFirst(text);
Assert.assertEquals("五星红旗", word);

Additional Capabilities

The sensitive-word tool offers versatile handling for various scenarios to improve detection rates, including case insensitivity, number format recognition, URL/IP detection, handling duplicate words, and more.

Further exploration of these capabilities is vital for tailoring the sensitive-word tool to meet specific needs related to content moderation and sensitive word management in different applications.