Llama-2-Open-Source-LLM-CPU-Inference  by kennethleungty

CPU inference for document Q&A using open-source LLMs

created 2 years ago
964 stars

Top 39.0% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a guide and implementation for running open-source Large Language Models (LLMs), specifically Llama 2, on CPU inference for document question-answering (Q&A). It targets developers and researchers seeking private, self-managed LLM deployments without the high costs associated with GPU hardware, enabling local document analysis.

How It Works

The project leverages a combination of technologies for efficient CPU-based LLM inference. It utilizes GGML-quantized models, specifically Llama-2-7B-Chat, which are optimized for CPU execution. LangChain acts as the orchestration framework, integrating C Transformers for Python bindings to the C/C++ GGML library. Document processing involves FAISS for efficient similarity search and Sentence-Transformers (all-MiniLM-L6-v2) to create vector embeddings of the documents, enabling semantic retrieval for Q&A.

Quick Start & Requirements

  • Install/Run: Place a GGML binary file (e.g., Llama-2-7B-Chat-GGML) into the models/ folder. Run from the project directory: poetry run python main.py "<user query>".
  • Prerequisites: GGML binary file (e.g., from Hugging Face), Poetry for dependency management.
  • Resources: CPU inference, no specific GPU or CUDA version mentioned.
  • Docs: Step-by-step guide available at TowardsDataScience: https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8.

Highlighted Details

  • Enables local, private LLM deployment on CPU.
  • Uses GGML-quantized models for CPU optimization.
  • Integrates LangChain, C Transformers, FAISS, and Sentence-Transformers.
  • Focuses on document Q&A use cases.

Maintenance & Community

No specific contributors, sponsorships, or community links (Discord/Slack) are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state the license for the project's code or the included models. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project focuses on a specific quantized model (Llama-2-7B-Chat-GGML) and may not be directly compatible with other LLM architectures or quantization formats without modification. Performance will be heavily dependent on the user's CPU capabilities.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.