spark-nlp  by JohnSnowLabs

NLP library for scalable ML pipelines

created 7 years ago
4,020 stars

Top 12.4% on sourcepulse

GitHubView on GitHub
Project Summary

Spark NLP is a comprehensive library for state-of-the-art Natural Language Processing (NLP) and Large Language Models (LLMs) built on Apache Spark. It offers scalable, performant, and accurate NLP annotations for distributed machine learning pipelines, targeting data scientists and engineers working with large datasets. The library boasts over 100,000 pre-trained pipelines and models across 200+ languages, supporting a wide array of NLP tasks and modern transformer architectures.

How It Works

Spark NLP leverages Apache Spark's distributed computing capabilities to process NLP tasks at scale. It provides a unified API for a vast range of NLP functionalities, from basic tokenization and parsing to advanced tasks like named entity recognition, sentiment analysis, and question answering. The library's strength lies in its native integration with Spark ML, allowing users to build complex, end-to-end NLP pipelines that can be seamlessly deployed on distributed clusters. It also supports importing models from TensorFlow, ONNX, OpenVINO, and Llama.cpp (GGUF), enhancing interoperability.

Quick Start & Requirements

  • Install: pip install spark-nlp==6.0.0 pyspark==3.3.1
  • Prerequisites: Java 8 or 11 (Oracle or OpenJDK).
  • Usage: Initialize SparkSession with sparknlp.start(gpu=True) for GPU support or sparknlp.start(apple_silicon=True) for macOS M1/M2.
  • Documentation: https://sparknlp.org/
  • Examples: spark-nlp/examples

Highlighted Details

  • Supports over 100,000 pre-trained pipelines and models in 200+ languages.
  • Integrates state-of-the-art transformers like BERT, Llama-2, Mistral, and Vision Transformers.
  • Offers native support for Python, R, and JVM languages (Java, Scala, Kotlin).
  • Provides seamless integration with TensorFlow, ONNX, OpenVINO, and Llama.cpp (GGUF) models.

Maintenance & Community

The project is actively maintained by John Snow Labs and has a strong community presence on Slack and GitHub for discussions, bug reports, and contributions.

Licensing & Compatibility

Apache 2.0 License. Compatible with commercial use and closed-source applications.

Limitations & Caveats

Experimental support for M1/M2 and AArch64 architectures may have limitations due to community-driven dependency building. Compatibility with older Apache Spark versions (pre-3.0) is not supported.

Health Check
Last commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
21
Issues (30d)
3
Star History
80 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.