Llama-2-Open-Source-LLM-CPU-Inference by kennethleungty

CPU inference for document Q&A using open-source LLMs

Created 2 years ago

972 stars

Top 38.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Elvis Saravia

Founder of DAIR.AI

Project Summary

This project provides a guide and implementation for running open-source Large Language Models (LLMs), specifically Llama 2, on CPU inference for document question-answering (Q&A). It targets developers and researchers seeking private, self-managed LLM deployments without the high costs associated with GPU hardware, enabling local document analysis.

How It Works

The project leverages a combination of technologies for efficient CPU-based LLM inference. It utilizes GGML-quantized models, specifically Llama-2-7B-Chat, which are optimized for CPU execution. LangChain acts as the orchestration framework, integrating C Transformers for Python bindings to the C/C++ GGML library. Document processing involves FAISS for efficient similarity search and Sentence-Transformers (all-MiniLM-L6-v2) to create vector embeddings of the documents, enabling semantic retrieval for Q&A.

Quick Start & Requirements

Install/Run: Place a GGML binary file (e.g., Llama-2-7B-Chat-GGML) into the models/ folder. Run from the project directory: poetry run python main.py "<user query>".
Prerequisites: GGML binary file (e.g., from Hugging Face), Poetry for dependency management.
Resources: CPU inference, no specific GPU or CUDA version mentioned.
Docs: Step-by-step guide available at TowardsDataScience: https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8.

Highlighted Details

Enables local, private LLM deployment on CPU.
Uses GGML-quantized models for CPU optimization.
Integrates LangChain, C Transformers, FAISS, and Sentence-Transformers.
Focuses on document Q&A use cases.

Maintenance & Community

No specific contributors, sponsorships, or community links (Discord/Slack) are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state the license for the project's code or the included models. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project focuses on a specific quantized model (Llama-2-7B-Chat-GGML) and may not be directly compatible with other LLM architectures or quantization formats without modification. Performance will be heavily dependent on the user's CPU capabilities.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days