awesome-ai-infrastructures  by 1duo

AI infrastructures for scalable ML production workflows

Created 7 years ago
428 stars

Top 69.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository curates real-world AI infrastructures and production machine learning systems, pipelines, and platforms. It serves as a valuable resource for engineers and researchers seeking to understand the technology stacks required for stable, scalable, and reliable ML training and inference in production environments. The collection aims to provide a broad overview of how complex ML systems are architected and deployed.

How It Works

The repository lists and categorizes various actively maintained AI infrastructure projects. It focuses on the overall architectures of end-to-end ML training pipelines, scalable inference solutions for cloud and edge devices, compiler and optimization stacks for diverse hardware, and novel approaches to large-scale distributed training. The advantage lies in its structured presentation of production-ready ML systems rather than isolated frameworks.

Quick Start & Requirements

This repository is a curated list of resources and does not require installation or specific prerequisites. It serves as a directory to explore individual projects.

Highlighted Details

  • Key Platforms: Comprehensive coverage of major AI infrastructures including Google's TFX and Kubeflow, NVIDIA's RAPIDS, Uber's Michelangelo, Facebook's FBLearner, Intel's BigDL, Amazon's SageMaker, and Microsoft's NNI.
  • Performance Benchmarks: Details significant milestones in large-scale distributed training, showcasing ImageNet training times from hours down to minutes/seconds using advanced techniques like LARS, mixed-precision, and optimized communication protocols on GPUs and TPUs.
  • Hardware & Deployment Focus: Features solutions for GPU acceleration (RAPIDS, H2O4GPU), on-device inference (TensorFlow Lite, Core ML), and compiler optimization stacks (TVM, MLIR, TensorRT) for diverse hardware backends.
  • ML Lifecycle Management: Includes platforms like MLflow for end-to-end ML lifecycle management, covering tracking, projects, and model deployment.
  • Specialized Tools: Highlights AutoML capabilities (Auto-Keras, NNI, TransmogriFai), model compression (PocketFlow, Distiller), and distributed execution frameworks (Project Ray, BigDL).

Maintenance & Community

The list is maintained for personal learning purposes, with an open invitation for contributions, forks, and pull requests. No specific community channels or maintainer details beyond the originating companies of the listed projects are provided.

Licensing & Compatibility

The license for the curated list itself is not specified in the README. The individual projects linked within the repository are subject to their own respective licenses.

Limitations & Caveats

This repository is a meta-list and does not provide direct tooling or code for implementation. It serves as a directory and educational resource, requiring users to explore individual projects for their specific needs. The information is presented "in no specific order."

Health Check
Last Commit

6 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
3 more.

LitServe by Lightning-AI

1.2%
4k
AI inference pipeline framework
Created 1 year ago
Updated 12 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
20 more.

alpa by alpa-projects

0.0%
3k
Auto-parallelization framework for large-scale neural network training and serving
Created 4 years ago
Updated 1 year ago
Feedback? Help us improve.