spark-nlp - Multilingual NLP Library with Advanced Annotation Capabilities

Spark NLP: State-of-the-Art Natural Language Processing & LLMs Library

Spark NLP is an advanced Natural Language Processing (NLP) library built on top of Apache Spark. It is designed to offer simple, high-performance, and accurate NLP annotations in machine learning pipelines, which can efficiently scale in distributed environments.

What Does Spark NLP Offer?

Spark NLP stands out with over 83,000 pretrained pipelines and models available in more than 200 languages. It caters to numerous NLP tasks such as:

Tokenization and Word Segmentation
Part-of-Speech Tagging
Word and Sentence Embeddings
Named Entity Recognition
Dependency Parsing
Spell Checking, Text Classification, and Sentiment Analysis
Token Classification
Machine Translation across 180 languages
Summarization, Question Answering, and Text Generation
Image Classification and Captioning
Automatic Speech Recognition
Zero-Shot Learning

State-of-the-Art Transformers

Spark NLP is unique in being the only open-source NLP library in production that offers cutting-edge transformers like BERT, CamemBERT, and RoBERTa, among many others. These transformers are accessible not only to Python and R but also in the JVM ecosystem (Java, Scala, and Kotlin), thanks to its integration with Apache Spark.

Model Importing Support

Spark NLP makes it easy to import models from popular machine learning frameworks, including:

TensorFlow
ONNX
OpenVINO
Llama.cpp (GGUF)

This flexibility allows users to incorporate models from diverse sources into their NLP workflows effectively.

Quick Start Guide

To get started with Spark NLP using Python and PySpark, ensure Java 8 or 11 is installed. You can create a Python environment using Conda and install Spark NLP and PySpark. Here is a brief code snippet to demonstrate usage:

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp

# Start Spark NLP Session
spark = sparknlp.start()

# Load a pre-trained pipeline
pipeline = PretrainedPipeline('explain_document_dl', lang='en')

# Your testing dataset
text = """
The Mona Lisa is a 16th-century oil painting created by Leonardo.
It's held at the Louvre in Paris.
"""

# Annotate your dataset
result = pipeline.annotate(text)

# View results
print(result['entities'])

Compatibility and Support

Spark NLP version 5.5.1 is compatible with Apache Spark versions 3.0 to 3.5, making it accessible across numerous platforms, including Databricks and Amazon EMR. It supports several Python and Scala versions, ensuring wide usability.

Installation Options

Whether you prefer command-line interfaces or specific programming languages like Scala or Python, Spark NLP offers installation flexibility. Detailed instructions are provided for various platforms, ensuring a smooth setup process.

Community and Contributions

Spark NLP thrives on community support and encouragement for contributions, be it ideas, documentation, or bug reports. Users can engage through GitHub, Slack, and other platforms for discussions and updates.

Final Note

Spark NLP not only facilitates the integration of state-of-the-art NLP capabilities into machine learning pipelines but also supports offline use. Visit the Spark NLP website for an exhaustive list of features, models, and support documentation.