open_llama by openlm-research

Open-source reproduction of LLaMA models

Created 2 years ago

7,524 stars

Top 6.8% on SourcePulse

View on GitHub

19 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Elie Bursztein

Cybersecurity Lead at Google DeepMind

Zhuohan Li

Coauthor of vLLM

and 15 more!

Project Summary

OpenLLaMA provides open-source reproductions of Meta AI's LLaMA models (3B, 7B, and 13B parameters), trained on 1T tokens using permissively licensed datasets. It targets researchers and developers seeking LLaMA-compatible models without restrictive licensing, offering PyTorch and JAX weights for broad integration.

How It Works

OpenLLaMA replicates the LLaMA architecture and training methodology, including hyperparameters and context length. The v1 models use the RedPajama dataset, while v2 models incorporate Falcon, StarCoder, and parts of RedPajama. This approach ensures compatibility with existing LLaMA implementations while utilizing openly available data. Training was performed on TPU-v4s using the JAX-based EasyLM framework, employing data parallelism and ZeRO stage 3 for efficiency.

Quick Start & Requirements

Install/Run: Load via Hugging Face Transformers (transformers library).
Prerequisites: PyTorch, transformers. For v2 models, avoid the fast tokenizer; use LlamaTokenizer or AutoTokenizer(use_fast=False).
Resources: Requires sufficient VRAM for model size (e.g., 7B model with torch_dtype=torch.float16).
Docs: Hugging Face Transformers LLaMA documentation

Highlighted Details

Permissively licensed under Apache 2.0.
v2 models offer improved performance and dataset mixture.
Tokenizer merges multiple spaces, potentially impacting code generation tasks for v1 models.
Comparable performance to original LLaMA and GPT-J on various benchmarks.

Maintenance & Community

Developed by Xinyang Geng and Hao Liu from Berkeley AI Research. Feedback is welcomed via GitHub issues.

Licensing & Compatibility

Apache 2.0 license. Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The v1 tokenizer's handling of whitespace may cause issues with code generation tasks. The project notes potential data contamination on specific tasks (CB, WSC) for their models.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days