spark-nlp by JohnSnowLabs

NLP library for scalable ML pipelines

Created 8 years ago

4,095 stars

Top 11.9% on SourcePulse

View on GitHub

3 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Luis Capelo

Cofounder of Lightning AI

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

Spark NLP is a comprehensive library for state-of-the-art Natural Language Processing (NLP) and Large Language Models (LLMs) built on Apache Spark. It offers scalable, performant, and accurate NLP annotations for distributed machine learning pipelines, targeting data scientists and engineers working with large datasets. The library boasts over 100,000 pre-trained pipelines and models across 200+ languages, supporting a wide array of NLP tasks and modern transformer architectures.

How It Works

Spark NLP leverages Apache Spark's distributed computing capabilities to process NLP tasks at scale. It provides a unified API for a vast range of NLP functionalities, from basic tokenization and parsing to advanced tasks like named entity recognition, sentiment analysis, and question answering. The library's strength lies in its native integration with Spark ML, allowing users to build complex, end-to-end NLP pipelines that can be seamlessly deployed on distributed clusters. It also supports importing models from TensorFlow, ONNX, OpenVINO, and Llama.cpp (GGUF), enhancing interoperability.

Quick Start & Requirements

Install: pip install spark-nlp==6.0.0 pyspark==3.3.1
Prerequisites: Java 8 or 11 (Oracle or OpenJDK).
Usage: Initialize SparkSession with sparknlp.start(gpu=True) for GPU support or sparknlp.start(apple_silicon=True) for macOS M1/M2.
Documentation: https://sparknlp.org/
Examples: spark-nlp/examples

Highlighted Details

Supports over 100,000 pre-trained pipelines and models in 200+ languages.
Integrates state-of-the-art transformers like BERT, Llama-2, Mistral, and Vision Transformers.
Offers native support for Python, R, and JVM languages (Java, Scala, Kotlin).
Provides seamless integration with TensorFlow, ONNX, OpenVINO, and Llama.cpp (GGUF) models.

Maintenance & Community

The project is actively maintained by John Snow Labs and has a strong community presence on Slack and GitHub for discussions, bug reports, and contributions.

Licensing & Compatibility

Apache 2.0 License. Compatible with commercial use and closed-source applications.

Limitations & Caveats

Experimental support for M1/M2 and AArch64 architectures may have limitations due to community-driven dependency building. Compatibility with older Apache Spark versions (pre-3.0) is not supported.

Health Check

Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days