AI Cluster Runtime: Reproducible Configs for GPU-Accelerated Kubernetes Clusters

GPU Kubernetes is hard. Aligning kernels, drivers, container runtimes, operators, and Kubernetes versions is a version compatibility minefield. A single misconfigured component can take down an entire GPU fleet, and root cause analysis can take days. Typically, these known-good configurations live as tribal knowledge in “runbooks” and internal pipelines, not as portable, reproducible artifacts. ...

2026-03-12 · 3 min · Mark Chmarny

How to debug container image content

When dealing with file permissions in a non-root image or building apps that include static content (like css or templates), I sometime get an error resulting from the final image content mismatch with my expectations. Most of the time the errors are pretty obvious, simple fix and rebuild will do. Sometimes though, you want to take a look into the image and understand what the actual layout looks like in there. ...

2019-08-27 · 2 min · Mark Chmarny

How to run containerized workloads in GCE VM

While the idea of a serverless platform and long running workloads does seem somewhat “unnatural” at first, smart people are already working on that (looking at you @Knative community). In the meantime, a simple approach is sometimes all you may need. ...

2019-08-21 · 4 min · Mark Chmarny