Code-LMs by VHellendoorn

Code LLM guide for using pre-trained large language models of source code

Created 4 years ago

1,840 stars

Top 23.3% on SourcePulse

View on GitHub

3 Experts Love This Project

Nir Gazit

Cofounder of Traceloop

Johannes Hagemann

Cofounder of Prime Intellect

Andreas Jansson

Cofounder of Replicate

Project Summary

This repository provides access to PolyCoder, a family of large language models specifically trained on source code across 12 programming languages. It offers pre-trained checkpoints and guidance for researchers and developers to leverage these models for code generation and analysis tasks, aiming to advance the understanding and application of LLMs in software engineering.

How It Works

PolyCoder models are based on the GPT-NeoX architecture, trained on a 249GB multi-lingual corpus of source code. The project provides checkpoints hosted on Zenodo and a modified GPT-NeoX toolkit (available via a public fork) for running inference and evaluation. Key modifications include handling whitespace tokens (tabs, newlines) essential for code structure, enabling more accurate code generation.

Quick Start & Requirements

Huggingface Transformers: pip install transformers==4.23.0
Checkpoint Download: Download .tar files from Zenodo (e.g., 2-7B-150K.tar, ~6GB).
GPU: Recommended; models require ~6GB VRAM for the 2.7B parameter version. CPU usage is not tested.
Docker: docker pull vhellendoorn/code-lms-neox:base
Code Generation: Use ./deepy.py generate.py configs/text_generation.yml checkpoints/configs/local_setup.yml checkpoints/configs/2-7B.yml (adjust config for model size).
Resources: Checkpoints up to 6GB; Docker image is 5.4GB.
Docs: https://github.com/VHellendoorn/Code-LMs

Highlighted Details

Models range from 160M to 2.7B parameters.
Trained on 249GB of code from 12 languages.
Includes evaluation results on HumanEval and multilingual perplexity benchmarks.
Modified GPT-NeoX toolkit for improved whitespace handling.

Maintenance & Community

The project is associated with Vincent Hellendoorn. Further community interaction details are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state a license for the code or models. The dataset is described as publicly available.

Limitations & Caveats

Models are not explicitly trained for problem-solving benchmarks like HumanEval and may perform poorly compared to models trained on natural language. There's a potential issue with generating new files after reaching the end of a predicted sequence due to possible missing end-of-document tokens. Whitespace sensitivity requires careful handling of input formatting.

Health Check

Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days