Code-LMs  by VHellendoorn

Code LLM guide for using pre-trained large language models of source code

created 3 years ago
1,835 stars

Top 24.1% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides access to PolyCoder, a family of large language models specifically trained on source code across 12 programming languages. It offers pre-trained checkpoints and guidance for researchers and developers to leverage these models for code generation and analysis tasks, aiming to advance the understanding and application of LLMs in software engineering.

How It Works

PolyCoder models are based on the GPT-NeoX architecture, trained on a 249GB multi-lingual corpus of source code. The project provides checkpoints hosted on Zenodo and a modified GPT-NeoX toolkit (available via a public fork) for running inference and evaluation. Key modifications include handling whitespace tokens (tabs, newlines) essential for code structure, enabling more accurate code generation.

Quick Start & Requirements

  • Huggingface Transformers: pip install transformers==4.23.0
  • Checkpoint Download: Download .tar files from Zenodo (e.g., 2-7B-150K.tar, ~6GB).
  • GPU: Recommended; models require ~6GB VRAM for the 2.7B parameter version. CPU usage is not tested.
  • Docker: docker pull vhellendoorn/code-lms-neox:base
  • Code Generation: Use ./deepy.py generate.py configs/text_generation.yml checkpoints/configs/local_setup.yml checkpoints/configs/2-7B.yml (adjust config for model size).
  • Resources: Checkpoints up to 6GB; Docker image is 5.4GB.
  • Docs: https://github.com/VHellendoorn/Code-LMs

Highlighted Details

  • Models range from 160M to 2.7B parameters.
  • Trained on 249GB of code from 12 languages.
  • Includes evaluation results on HumanEval and multilingual perplexity benchmarks.
  • Modified GPT-NeoX toolkit for improved whitespace handling.

Maintenance & Community

The project is associated with Vincent Hellendoorn. Further community interaction details are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state a license for the code or models. The dataset is described as publicly available.

Limitations & Caveats

Models are not explicitly trained for problem-solving benchmarks like HumanEval and may perform poorly compared to models trained on natural language. There's a potential issue with generating new files after reaching the end of a predicted sequence due to possible missing end-of-document tokens. Whitespace sensitivity requires careful handling of input formatting.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
22 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Travis Fischer Travis Fischer(Founder of Agentic), and
6 more.

codellama by meta-llama

0.1%
16k
Inference code for CodeLlama models
created 1 year ago
updated 11 months ago
Feedback? Help us improve.