Evaluation harness for LLMs on Verilog code generation and spec-to-RTL tasks
Top 90.7% on sourcepulse
This repository provides an evaluation harness for benchmarking Large Language Models (LLMs) on Verilog hardware description language (HDL) code generation tasks. It targets researchers and engineers evaluating LLMs for hardware design automation, offering improved prompts, support for specification-to-RTL tasks, and detailed error analysis.
How It Works
The harness utilizes a Makefile
to orchestrate the evaluation workflow, supporting two primary tasks: code-complete-iccad2023
and spec-to-rtl
. It manages datasets as plain text files and allows flexible configuration of LLM parameters such as model choice, in-context learning examples (0-4 shots), number of samples, temperature, and top-p. The evaluation process involves generating Verilog code from LLM prompts and then verifying its correctness using iverilog
and verilator
.
Quick Start & Requirements
make
iverilog
(v12, v13 not supported)verilator
python3
(v3.11.0 recommended, e.g., via conda create -n codex python=3.11
)langchain
, langchain-openai
, langchain-nvidia-ai-endpoints
iverilog
from source (v12 branch).Highlighted Details
iverilog
compilation errors.Maintenance & Community
The project is associated with NVlabs and has published research papers detailing its methodology and findings. Links to relevant papers are provided for citation.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project is currently Linux-only and requires manual compilation of a specific iverilog
version (v12). MachineEval
is not supported, and the original Pass@10 metric is no longer reported. A Dockerfile and prebuilt JSONL support are planned but not yet available.
2 weeks ago
1 week