GPU Kubernetes is hard. Aligning kernels, drivers, container runtimes, operators, and Kubernetes versions is a version compatibility minefield. A single misconfigured component can take down an entire GPU fleet, and root cause analysis can take days. Typically, these known-good configurations live as tribal knowledge in “runbooks” and internal pipelines, not as portable, reproducible artifacts.
The recently open-sourced AI Cluster Runtime (AICR) project is designed to solve that friction. It provides optimized, validated, and reproducible configurations for a given GPU-accelerated Kubernetes cluster.
The Workflow
AICR utilizes a composable, standards-based approach using YAML recipes and Helm bundles. The workflow is broken down into four stages:
- Snapshot (optional): Captures the current state of an existing GPU cluster
- Recipe: Generates an optimized, version-locked config for your environment
- Bundle: Converts the recipe into deployment-ready artifacts for a given deployer (e.g. Helm or ArgoCD, more too come soon)
- Deploy: Applies the bundle using your existing CD pipeline — this step is yours, no new tooling required
- Validate: Verifies deployment, performance, and conformance against the cluster
AICR is about the optimized configuration for, and validation of your cluster, not the deployment or management of your cluster where it leans on the wealth of existing open source options.
Getting Started
The AICR CLI can be installed on Linux, macOS, or Windows (via WSL). You can use Homebrew:
brew tap NVIDIA/aicr
brew install aicr
Alternatively, you can use the install script which does the implicit SLSA Level 3 provenance verification:
curl -sfL https://raw.githubusercontent.com/NVIDIA/aicr/main/install | bash -s
Generating a Recipe
Recipe captures the exact constraints and component versions required for your target state. You can specify your environment explicitly:
aicr recipe \
--service eks \
--accelerator h100 \
--intent training \
--os ubuntu \
--platform kubeflow \
--output recipe.yaml
Or, you can point aicr at any existing GPU-accelerated Kubernetes cluster (EKS, GKE, AKS, or self-managed) to auto-discover that state:
aicr snapshot \
--namespace aicr-validation \
--node-selector nodeGroup=gpu-worker \
--output snapshot.yaml
aicr recipe \
--snapshot snapshot.yaml \
--intent training \
--node-selector nodeGroup=gpu-worker \
--output recipe.yaml
Bundling for Deployment
Once your recipe.yaml is generated, you convert it into deployable artifacts. For modern GitOps workflows, you can output directly to an OCI registry, or optionally include bundle cryptographic attestation using Sigstore:
aicr bundle \
--recipe recipe.yaml \
--deployer argocd \
--output oci://ghcr.io/nvidia/bundle \
--attest
Supply chain security is built into every layer. Every bundle includes checksums for all generated artifacts end-to-end, and the project itself ships with SLSA Level 3 provenance, SPDX SBOMs, and cosign image attestations.
Validate the Cluster
Once deployed, AICR validates the cluster against the same recipe used to create the bundle. Validation runs across three phases:
| Phase | What it proves |
|---|---|
deployment | Fabric is properly configured and operational |
performance | Synthetic workloads meet expected throughput |
conformance | Overall cluster conformance is verified |
These phases can be run individually or all at once. Validation reports are output in Common Test Report Format (CTRF) to enable programmatic post-processing.
aicr validate \
--recipe recipe.yaml \
--phase all \
--output report.json
To verify CNCF AI Conformance, you can also add --evidence-dir to export evidence artifacts for CNCF submission.
Contributing
The matrix of potential platform, GPU, OS, intent, and service configurations is bigger than any one team to manage. We built the framework, but we do need the community to help fill it in. If you’ve already validated and optimized a new configuration combination, run performance benchmarks, or written new deployment tests, please contribute to this project by opening a PR.
Do star the repo, try it on your cluster, and open an issue with your findings: github.com/NVIDIA/aicr.