clearml-agent  by clearml

MLOps orchestration for distributed AI workloads

Created 6 years ago
278 stars

Top 93.3% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

ClearML Agent provides a distributed scheduler and orchestration solution for ML/DL/GenAI workloads, simplifying MLOps and LLMOps. It targets ML engineers and researchers seeking to manage experiments across diverse compute resources, offering automated execution, resource utilization optimization, and simplified cluster management with minimal DevOps overhead.

How It Works

The ClearML Agent functions as a job scheduler that monitors specified queues, retrieves experiments, and manages their execution. It automates the creation of isolated execution environments using virtual environments or Docker containers, clones the relevant code, installs dependencies (including automatic PyTorch version selection based on CUDA), executes the task, and streams logs and progress back to the ClearML Server UI. This approach enables a "fire-and-forget" execution model with flexible resource allocation across bare-metal, Kubernetes, and HPC environments.

Quick Start & Requirements

  • Primary install: pip install clearml-agent
  • Prerequisites: Requires a ClearML Server (self-hosted or free tier). Supports Linux, macOS, and Windows. Optional integration with Kubernetes and SLURM is available.
  • Setup: Designed for minimal configuration.
  • Links: Dockerfiles are available in the docker folder. Kubernetes integration details can be found via the clearml-helm-charts repository. Example automation scripts are located in the ClearML example/automation folder.

Highlighted Details

  • Supports hybrid execution environments, combining on-premises and cloud resources without complex configuration.
  • Offers optional Kubernetes integration, adding scheduling capabilities to clusters without requiring direct user access to Kubernetes.
  • Features "Services Mode" for running persistent tasks like auto-scalers, controllers, and optimizers, though this mode is currently CPU-only.
  • Provides fractional GPU support, allowing multiple isolated containers to share a single GPU with defined resource limits.
  • Enables priority-based job queuing and management via the ClearML Web UI.

Maintenance & Community

The project actively promotes community support through GitHub stars. However, the provided README does not detail specific contributors, sponsorships, or community channels like Discord or Slack.

Licensing & Compatibility

The project is licensed under the Apache License, Version 2.0. This permissive license generally allows for commercial use and integration with closed-source projects.

Limitations & Caveats

The "Services Mode" currently supports CPU-only configurations, limiting its direct application for GPU-intensive background services. The full functionality and orchestration capabilities are dependent on the availability and configuration of a ClearML Server instance.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
14 more.

BIG-bench by google

0.1%
3k
Collaborative benchmark for probing and extrapolating LLM capabilities
Created 4 years ago
Updated 1 year ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
16 more.

text-to-text-transfer-transformer by google-research

0.1%
6k
Unified text-to-text transformer for NLP research
Created 6 years ago
Updated 6 months ago
Feedback? Help us improve.