PhoGPT by VinAIResearch

Vietnamese generative model (research paper & models)

Created 2 years ago

795 stars

Top 44.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Project Summary

PhoGPT is a series of state-of-the-art generative language models specifically trained for Vietnamese. It offers a 3.7B parameter base model (PhoGPT-4B) and a chat-tuned variant (PhoGPT-4B-Chat), both featuring an 8192 context length. These models are designed for researchers and developers working with Vietnamese NLP tasks, providing a powerful foundation for applications requiring Vietnamese text generation and understanding.

How It Works

PhoGPT-4B was pre-trained from scratch on a massive 102B token Vietnamese corpus. The PhoGPT-4B-Chat variant is then fine-tuned on a dataset of 70K instructional prompts and responses, augmented with 290K conversational turns. This approach leverages a large context window and a specialized vocabulary to capture the nuances of the Vietnamese language, aiming for superior performance in generative tasks compared to existing open-source models.

Quick Start & Requirements

Inference: Supports vLLM, Text Generation Inference, and llama.cpp. For llama.cpp, requires compilation and conversion to GGUF format.
Transformers: Can be run directly using Hugging Face Transformers with torch.bfloat16 or torch.float16.
Dependencies: Python, PyTorch, Transformers, potentially CUDA for GPU acceleration. bitsandbytes for quantization.
Resources: Loading in float16 requires ~7GB GPU memory.
Docs: Technical report available at arXiv:2311.02945.

Highlighted Details

3.7B parameter model trained on 102B Vietnamese tokens.
8192 context length for extended text processing.
Chat variant fine-tuned on instruction-following and conversational data.
Supports various inference engines and quantization methods.

Maintenance & Community

Developed by VinAI Research. Further details on fine-tuning can be found in the llm-foundry documentation.

Licensing & Compatibility

The README does not explicitly state the license. However, models hosted on Hugging Face are typically under Apache 2.0 or similar permissive licenses unless otherwise specified. Compatibility for commercial use would require explicit license confirmation.

Limitations & Caveats

The model is noted to perform poorly on reasoning, coding, and mathematics tasks. It may generate harmful, biased, or factually incorrect content, requiring cautious use and output validation.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days