NLP library for scalable ML pipelines
Top 12.4% on sourcepulse
Spark NLP is a comprehensive library for state-of-the-art Natural Language Processing (NLP) and Large Language Models (LLMs) built on Apache Spark. It offers scalable, performant, and accurate NLP annotations for distributed machine learning pipelines, targeting data scientists and engineers working with large datasets. The library boasts over 100,000 pre-trained pipelines and models across 200+ languages, supporting a wide array of NLP tasks and modern transformer architectures.
How It Works
Spark NLP leverages Apache Spark's distributed computing capabilities to process NLP tasks at scale. It provides a unified API for a vast range of NLP functionalities, from basic tokenization and parsing to advanced tasks like named entity recognition, sentiment analysis, and question answering. The library's strength lies in its native integration with Spark ML, allowing users to build complex, end-to-end NLP pipelines that can be seamlessly deployed on distributed clusters. It also supports importing models from TensorFlow, ONNX, OpenVINO, and Llama.cpp (GGUF), enhancing interoperability.
Quick Start & Requirements
pip install spark-nlp==6.0.0 pyspark==3.3.1
sparknlp.start(gpu=True)
for GPU support or sparknlp.start(apple_silicon=True)
for macOS M1/M2.Highlighted Details
Maintenance & Community
The project is actively maintained by John Snow Labs and has a strong community presence on Slack and GitHub for discussions, bug reports, and contributions.
Licensing & Compatibility
Apache 2.0 License. Compatible with commercial use and closed-source applications.
Limitations & Caveats
Experimental support for M1/M2 and AArch64 architectures may have limitations due to community-driven dependency building. Compatibility with older Apache Spark versions (pre-3.0) is not supported.
1 day ago
Inactive