AI Cluster Runtime: Reproducible Configs for GPU-Accelerated Kubernetes Clusters

GPU Kubernetes is hard. Aligning kernels, drivers, container runtimes, operators, and Kubernetes versions is a version compatibility minefield. A single misconfigured component can take down an entire GPU fleet, and root cause analysis can take days. Typically, these known-good configurations live as tribal knowledge in “runbooks” and internal pipelines, not as portable, reproducible artifacts.

The recently open-sourced AI Cluster Runtime (AICR) project is designed to solve that friction. It provides optimized, validated, and reproducible configurations for a given GPU-accelerated Kubernetes cluster.

The Workflow

AICR utilizes a composable, standards-based approach using YAML recipes and Helm bundles. The workflow is broken down into four stages:

Snapshot (optional): Captures the current state of an existing GPU cluster
Recipe: Generates an optimized, version-locked config for your environment
Bundle: Converts the recipe into deployment-ready artifacts for a given deployer (e.g. Helm or ArgoCD, more too come soon)
Deploy: Applies the bundle using your existing CD pipeline — this step is yours, no new tooling required
Validate: Verifies deployment, performance, and conformance against the cluster

AICR is about the optimized configuration for, and validation of your cluster, not the deployment or management of your cluster where it leans on the wealth of existing open source options.

Getting Started

The AICR CLI can be installed on Linux, macOS, or Windows (via WSL). You can use Homebrew:

brew tap NVIDIA/aicr
brew install aicr

Alternatively, you can use the install script which does the implicit SLSA Level 3 provenance verification:

curl -sfL https://raw.githubusercontent.com/NVIDIA/aicr/main/install | bash -s

Generating a Recipe

Recipe captures the exact constraints and component versions required for your target state. You can specify your environment explicitly:

aicr recipe \
  --service eks \
  --accelerator h100 \
  --intent training \
  --os ubuntu \
  --platform kubeflow \
  --output recipe.yaml

Or, you can point aicr at any existing GPU-accelerated Kubernetes cluster (EKS, GKE, AKS, or self-managed) to auto-discover that state:

aicr snapshot \
    --namespace aicr-validation \
    --node-selector nodeGroup=gpu-worker \
    --output snapshot.yaml

aicr recipe \
  --snapshot snapshot.yaml \
  --intent training \
  --node-selector nodeGroup=gpu-worker \
  --output recipe.yaml

Bundling for Deployment

Once your recipe.yaml is generated, you convert it into deployable artifacts. For modern GitOps workflows, you can output directly to an OCI registry, or optionally include bundle cryptographic attestation using Sigstore:

aicr bundle \
  --recipe recipe.yaml \
  --deployer argocd \
  --output oci://ghcr.io/nvidia/bundle \
  --attest

Supply chain security is built into every layer. Every bundle includes checksums for all generated artifacts end-to-end, and the project itself ships with SLSA Level 3 provenance, SPDX SBOMs, and cosign image attestations.

Validate the Cluster

Once deployed, AICR validates the cluster against the same recipe used to create the bundle. Validation runs across three phases:

Phase	What it proves
`deployment`	Fabric is properly configured and operational
`performance`	Synthetic workloads meet expected throughput
`conformance`	Overall cluster conformance is verified

These phases can be run individually or all at once. Validation reports are output in Common Test Report Format (CTRF) to enable programmatic post-processing.

aicr validate \
  --recipe recipe.yaml \
  --phase all \
  --output report.json

To verify CNCF AI Conformance, you can also add --evidence-dir to export evidence artifacts for CNCF submission.

Contributing

The matrix of potential platform, GPU, OS, intent, and service configurations is bigger than any one team to manage. We built the framework, but we do need the community to help fill it in. If you’ve already validated and optimized a new configuration combination, run performance benchmarks, or written new deployment tests, please contribute to this project by opening a PR.

Do star the repo, try it on your cluster, and open an issue with your findings: github.com/NVIDIA/aicr.

The Workflow#

Getting Started#

Generating a Recipe#

Bundling for Deployment#

Validate the Cluster#

Contributing#