nCPU by robertcprice

GPU-native CPU executes operations via trained neural networks

Created 7 months ago

651 stars

Top 50.5% on SourcePulse

Project Summary

This project introduces nCPU, a novel CPU architecture where all components—registers, memory, and program counter—are represented as GPU tensors, and ALU operations are executed by trained neural networks. It targets researchers and power users exploring alternative computing paradigms and offers a unique approach to hardware acceleration by leveraging deep learning models for fundamental arithmetic and logic operations, potentially enabling new forms of computation.

How It Works

The nCPU architecture runs entirely on the GPU, with all state managed as PyTorch tensors. Instruction fetch, decode, execution, and state updates occur on-device, eliminating host CPU round-trips. Each ALU operation is routed through a specific trained neural network model: addition uses a Kogge-Stone carry-lookahead network, multiplication employs a learned byte-pair lookup table, bitwise operations utilize neural truth tables, and shifts are handled by attention-based bit routing. This model-native approach aims for high accuracy and explores the transferability of classical hardware design principles to neural architectures.

Quick Start & Requirements

Installation: pip install -e ".[dev]"
Execution: python main.py --program programs/sum_1_to_10.asm
Prerequisites: PyTorch, GPU with Metal support (benchmarks specific to Apple Silicon MPS backend, PyTorch 2.10.0).
Resources: ~135 MB for 23 trained models.
Links: Official Docs, Research Paper, Benchmarks, DOOM Demo

Highlighted Details

Achieves 100% accuracy on integer arithmetic, validated by 347 automated tests.
Multiplication is 12x faster than addition (21 µs vs. 248 µs) due to parallel LUT lookups versus O(log n) carry propagation.
Kogge-Stone carry-lookahead implemented via a trained network yields a 3.3x speedup for ADD/SUB/CMP operations.
Vectorized shift operations achieve a 6.5x speedup through attention-based routing.
Offers two modes: Neural Mode (default, model inference) and Fast Mode (native tensor ops, targeting 1.35M IPS on Apple Silicon).
Includes native Metal GPU implementations (MLX and Rust) for zero CPU-GPU synchronization.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), or project roadmap are provided in the README.

Licensing & Compatibility

The project is released under the MIT License, permitting commercial use and integration with closed-source projects without significant restrictions.

Limitations & Caveats

As a research runtime, nCPU may not be production-ready. Performance benchmarks are primarily demonstrated on Apple Silicon, and broader hardware compatibility for optimal performance is not detailed. The project explores a highly experimental architecture, and long-term maintenance or community support is not explicitly indicated.

nCPU by robertcprice

Explore Similar Projects

cider by Mininglamp-AI

fp6_llm by usyd-fsalab

CUDA-L2 by deepreinforce-ai

glake by antgroup

BitBLAS by microsoft

femtoGPT by keyvank

intel-xpu-backend-for-triton by intel

TileKernels by deepseek-ai

bolt by huawei-noah

DeepBench by baidu-research

intel-extension-for-pytorch by intel

lectures by gpu-mode