C++ inference for real-time chatting with RAG
Top 51.5% on sourcepulse
This project provides a pure C++ implementation for running various large language models (LLMs) for real-time chatting, supporting both CPU and GPU inference with advanced quantization techniques. It targets developers and researchers looking for efficient, local LLM deployment with features like RAG and continuous chatting, offering a performant alternative to Python-heavy frameworks.
How It Works
The core of ChatLLM.cpp is built upon ggerganov/ggml
, leveraging its C++ tensor library for accelerated memory-efficient inference. It employs object-oriented programming to manage similarities across different Transformer-based models, enabling support for a wide range of architectures. Key optimizations include int4/int8 quantization, an optimized KV cache, and parallel computing for enhanced performance.
Quick Start & Requirements
git clone --recursive https://github.com/foldl/chatllm.cpp.git && cd chatllm.cpp
followed by git submodule update --init --recursive
.pip install -r requirements.txt
). CMake for building.convert.py
to transform Hugging Face models to the project's GGML format.cmake -B build && cmake --build build -j --config Release
../build/bin/main -m model.bin
(interactive mode: ./build/bin/main -m model.bin -i
).Highlighted Details
Maintenance & Community
This is a hobby project under active development. While feature PRs are not accepted, bug fix PRs are welcome.
Licensing & Compatibility
The project's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The generated .bin
file format is different from GGUF used by llama.cpp
. The project is a hobbyist endeavor, and PRs for new features are not accepted, potentially impacting future development direction.
2 days ago
1 day