OpenAI-compatible API for ChatGLM models and embeddings
Top 61.7% on sourcepulse
This project provides an OpenAI-compatible API endpoint for the ChatGLM-6B/ChatGLM2-6B large language models and Chinese embedding models. It targets developers and researchers who want to integrate these powerful Chinese NLP models into applications that already use the OpenAI API structure, offering a seamless transition and local deployment option.
How It Works
The project wraps ChatGLM models and Chinese embedding models within a FastAPI web server, exposing endpoints that mimic the OpenAI API. It leverages Hugging Face for model hosting and allows users to specify models, including quantized versions (e.g., chatglm-6b-int4
) for reduced VRAM usage. The architecture supports multi-GPU inference and integrates with tunneling services like ngrok or Cloudflare Tunnel for external accessibility.
Quick Start & Requirements
config.toml.example
to config.toml
, and install dependencies via pip install -r requirements.txt
.python main.py
with arguments like --llm_model
(e.g., chatglm2-6b-int4
) and --tunnel
(e.g., cloudflared
). Multi-GPU support is available via CUDA_VISIBLE_DEVICES
and the --gpus
argument.Highlighted Details
int4
, int8
) for lower VRAM consumption.Maintenance & Community
The project is actively maintained by ninehills. Community interaction channels are not explicitly mentioned in the README.
Licensing & Compatibility
The repository's license is not specified in the README. Compatibility for commercial use or closed-source linking would depend on the underlying ChatGLM model licenses and any specific terms set by the project maintainers.
Limitations & Caveats
The project requires a stable internet connection for model downloads from Hugging Face. The ngrok tunnel is noted as being limited for non-paid users. The README implies that multi-GPU support for LLM inference is currently experimental.
2 years ago
1 day