backend.ai  by lablup

Container-based computing cluster platform

created 8 years ago
576 stars

Top 55.9% on SourcePulse

GitHubView on GitHub
Project Summary

Backend.AI is a container-based computing cluster platform designed for multi-tenant, on-demand computation sessions. It supports a wide range of programming languages and ML frameworks, with pluggable heterogeneous accelerator support including GPUs (CUDA, ROCm), TPUs, and NPUs. The platform is ideal for research institutions, data science teams, and organizations needing scalable, isolated compute environments.

How It Works

Backend.AI utilizes a distributed architecture comprising a central Manager for routing and scaling, and Agents running on compute nodes to manage containers. It employs Sokovan as its orchestrator. Sessions are exposed via REST and GraphQL APIs, with direct WebSocket tunneling for in-container applications like Jupyter, VSCode, and SSH. Its storage abstraction layer (vfolders) provides unified access to network storage, with customizable access controls.

Quick Start & Requirements

Highlighted Details

  • Pluggable heterogeneous accelerator support (CUDA, ROCm, TPU, IPU, NPUs).
  • Integrated support for Jupyter, VSCode Server, and SSH within compute sessions.
  • vfolders for unified, permission-controlled network storage access.
  • REST and GraphQL API endpoints for programmatic control.
  • SCIE-based installer for self-contained executables.

Maintenance & Community

  • Active development with clear versioning and migration guides.
  • Client SDKs available for Python, Java, and JavaScript.
  • Links to legacy per-package repositories are provided for historical context.

Licensing & Compatibility

  • Server-side components: LGPLv3.
  • Shared libraries and client SDKs: MIT License.
  • Commercial consulting and licensing options are available via contact@lablup.com.

Limitations & Caveats

The README mentions an "enterprise edition" with additional features, implying some functionality may be proprietary. While a single-node development script is provided, multi-node production setup details are deferred to external documentation.

Health Check
Last commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
185
Issues (30d)
143
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Travis Fischer Travis Fischer(Founder of Agentic).

pezzo by pezzolabs

0.4%
3k
Open-source LLMOps platform for streamlining AI workflows
created 2 years ago
updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

airweave by airweave-ai

0.5%
3k
Semantic MCP server for AI agents
created 7 months ago
updated 1 day ago
Feedback? Help us improve.