Inference code for Llama 2 models (deprecated)
Top 0.4% on sourcepulse
This repository provides inference code for Meta's Llama models, specifically Llama 2. It's designed for researchers and businesses to load and run pre-trained and fine-tuned language models, ranging from 7B to 70B parameters, enabling experimentation and application development.
How It Works
The project utilizes PyTorch and a model-parallelism approach for efficient inference. It allows loading model weights and tokenizers, with specific scripts for text completion and chat-based interactions. The architecture supports varying model-parallel (MP) values depending on model size (7B=1, 13B=2, 70B=8) and allows customization of sequence length and batch size for hardware optimization.
Quick Start & Requirements
pip install -e .
within a conda environment with PyTorch/CUDA.wget
, md5sum
, PyTorch with CUDA support. Model weights must be downloaded separately from Meta's website after accepting their license.Highlighted Details
Maintenance & Community
This repository is deprecated in favor of a consolidated Llama Stack. New development and support are directed to llama-models
, PurpleLlama
, llama-toolchain
, llama-agentic-system
, and llama-cookbook
. Issues can be filed on these new repositories.
Licensing & Compatibility
Model weights and code are licensed for both research and commercial entities. An Acceptable Use Policy is provided.
Limitations & Caveats
This repository is deprecated. Users are directed to use the new Llama Stack repositories for current development and support. The README notes that testing has not covered all potential use scenarios, and users should consult the Responsible Use Guide.
6 months ago
1+ week