clearml-agent by clearml

MLOps orchestration for distributed AI workloads

Created 6 years ago

284 stars

Top 92.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

ClearML Agent provides a distributed scheduler and orchestration solution for ML/DL/GenAI workloads, simplifying MLOps and LLMOps. It targets ML engineers and researchers seeking to manage experiments across diverse compute resources, offering automated execution, resource utilization optimization, and simplified cluster management with minimal DevOps overhead.

How It Works

The ClearML Agent functions as a job scheduler that monitors specified queues, retrieves experiments, and manages their execution. It automates the creation of isolated execution environments using virtual environments or Docker containers, clones the relevant code, installs dependencies (including automatic PyTorch version selection based on CUDA), executes the task, and streams logs and progress back to the ClearML Server UI. This approach enables a "fire-and-forget" execution model with flexible resource allocation across bare-metal, Kubernetes, and HPC environments.

Quick Start & Requirements

Primary install: pip install clearml-agent
Prerequisites: Requires a ClearML Server (self-hosted or free tier). Supports Linux, macOS, and Windows. Optional integration with Kubernetes and SLURM is available.
Setup: Designed for minimal configuration.
Links: Dockerfiles are available in the docker folder. Kubernetes integration details can be found via the clearml-helm-charts repository. Example automation scripts are located in the ClearML example/automation folder.

Highlighted Details

Supports hybrid execution environments, combining on-premises and cloud resources without complex configuration.
Offers optional Kubernetes integration, adding scheduling capabilities to clusters without requiring direct user access to Kubernetes.
Features "Services Mode" for running persistent tasks like auto-scalers, controllers, and optimizers, though this mode is currently CPU-only.
Provides fractional GPU support, allowing multiple isolated containers to share a single GPU with defined resource limits.
Enables priority-based job queuing and management via the ClearML Web UI.

Maintenance & Community

The project actively promotes community support through GitHub stars. However, the provided README does not detail specific contributors, sponsorships, or community channels like Discord or Slack.

Licensing & Compatibility

The project is licensed under the Apache License, Version 2.0. This permissive license generally allows for commercial use and integration with closed-source projects.

Limitations & Caveats

The "Services Mode" currently supports CPU-only configurations, limiting its direct application for GPU-intensive background services. The full functionality and orchestration capabilities are dependent on the availability and configuration of a ClearML Server instance.

Health Check

Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days