Code LLM guide for using pre-trained large language models of source code
Top 24.1% on sourcepulse
This repository provides access to PolyCoder, a family of large language models specifically trained on source code across 12 programming languages. It offers pre-trained checkpoints and guidance for researchers and developers to leverage these models for code generation and analysis tasks, aiming to advance the understanding and application of LLMs in software engineering.
How It Works
PolyCoder models are based on the GPT-NeoX architecture, trained on a 249GB multi-lingual corpus of source code. The project provides checkpoints hosted on Zenodo and a modified GPT-NeoX toolkit (available via a public fork) for running inference and evaluation. Key modifications include handling whitespace tokens (tabs, newlines) essential for code structure, enabling more accurate code generation.
Quick Start & Requirements
pip install transformers==4.23.0
.tar
files from Zenodo (e.g., 2-7B-150K.tar
, ~6GB).docker pull vhellendoorn/code-lms-neox:base
./deepy.py generate.py configs/text_generation.yml checkpoints/configs/local_setup.yml checkpoints/configs/2-7B.yml
(adjust config for model size).Highlighted Details
Maintenance & Community
The project is associated with Vincent Hellendoorn. Further community interaction details are not explicitly provided in the README.
Licensing & Compatibility
The README does not explicitly state a license for the code or models. The dataset is described as publicly available.
Limitations & Caveats
Models are not explicitly trained for problem-solving benchmarks like HumanEval and may perform poorly compared to models trained on natural language. There's a potential issue with generating new files after reaching the end of a predicted sequence due to possible missing end-of-document tokens. Whitespace sensitivity requires careful handling of input formatting.
1 year ago
Inactive