Scaling Actions Runner Controller on EKS Without Melting Your Cluster
After about a year of running Actions Runner Controller (ARC) as the primary CI substrate for a mid-sized engineering org, I've collected enough scars to be useful. This post is the writeup I wish I'd had when we started — the parts that aren't in the README.
We're running the modern gha-runner-scale-set mode (the one that talks to GitHub's actions service over a long-lived HTTP/2 connection), on EKS 1.31, with Karpenter handling node provisioning. Workloads are a mix of Linux x86, Linux arm64, and a handful of larger jobs that need GPUs.
The mental model that actually matters
The single most useful reframing: a runner scale set is not an autoscaling pool of long-lived workers. It's a queue consumer. The listener pod opens a session with actions.githubusercontent.com, gets told "you have N jobs assigned, please produce N ephemeral runner pods", and then ARC creates exactly that many pods. Each pod takes one job and dies.
That has two consequences people consistently get wrong:
- Pod startup latency is your CI latency. If your runner image is 4 GB and your node has to pull it cold, your "queued" time is now image-pull time. This dominates everything else.
- You cannot smooth bursts with
minRunners.minRunnerskeeps idle pods warm, which keeps idle nodes warm, which costs money 24/7. It does not give you headroom for a 500-job burst at 9:03 AM when everyone pushes their morning PR.
Image strategy: this is 80% of the win
The default runner image works. It is also enormous and pulls a kitchen-sink toolchain you mostly don't use. Build a slim base per language family:
FROM ghcr.io/actions/actions-runner:2.319.1
USER root
RUN apt-get update && apt-get install -y --no-install-recommends \
git curl ca-certificates jq unzip \
&& rm -rf /var/lib/apt/lists/*
# Language-specific layer
COPY --from=node:20-slim /usr/local /usr/local
USER runner
Then — and this is the part most teams skip — pre-pull the image onto every node. With Karpenter, you do this by baking it into a custom AMI, or by running a DaemonSet that does nothing but crictl pull your runner images on node startup. Image pull goes from 45 seconds to ~0.5 seconds. Your p95 queue time drops by an order of magnitude.
Sizing the listener and the controller
The ARC controller itself is tiny. The listeners are the interesting part — there's one listener pod per runner scale set, and it holds a persistent connection to GitHub. Two failure modes I've hit:
- Listener OOM under burst. When 800 jobs land at once, the listener has to track all of their assignments. Default memory limit (
64Mi) is laughably low above ~200 concurrent jobs. We run with256Mirequest,512Milimit. - Listener restarts during deploys. If you
kubectl rollout restartthe controller, every listener restarts, every session re-establishes, and for ~10 seconds GitHub has nowhere to dispatch jobs. They queue, then resolve. It's fine, but alert routing needs to know this is not an incident.
Node pools, not "one big pool"
Mixing CI workloads on one node pool is a trap. We run four Karpenter NodePools:
| Pool | Instances | Used for |
|---|---|---|
ci-small |
c7i.large, c7i.xlarge, spot |
90% of jobs — lint, unit tests |
ci-large |
c7i.4xlarge, spot with on-demand fallback |
integration tests, large builds |
ci-arm |
c7g.xlarge, spot |
arm64 container builds |
ci-gpu |
g6.xlarge, on-demand only |
ML eval jobs |
Each runner scale set has nodeSelector and tolerations matched to one pool. A misconfigured workflow can't accidentally schedule a 100-runner integration test burst onto your GPU nodes. (Ask me how I know.)
Spot interruptions and the two-minute warning
About 4% of our jobs run on spot nodes that get reclaimed mid-job. GitHub Actions does not retry a job that died because its runner pod vanished — the job fails with a useless "The runner has received a shutdown signal" message. Two mitigations:
- Karpenter consolidation off during business hours. We disable consolidation between 08:00 and 19:00 UTC. Saves maybe 8% in cost; costs us about 30 spurious CI failures per week, which is a great trade.
- A reusable
retry-on-runner-losscomposite action. It checks the job's failure annotation, and if it matches the shutdown signature, it requeues viaworkflow_dispatch. Crude but effective.
Observability: what to actually watch
Most ARC dashboards I've seen are useless because they show pod counts. Pod counts are an output, not a signal. The metrics that predict pain:
gha_assigned_jobsvsgha_running_jobsgap. This is queue depth. If it's growing, you're behind.- Image pull duration p95, scraped from kubelet. If this trends up, something changed in your image and you'll feel it tomorrow.
- Node provision latency from Karpenter. When this exceeds ~90 seconds, your "fast" jobs aren't fast anymore.
- Listener session age. A listener that hasn't re-sessioned in 6+ hours is healthy. One that's churning every few minutes is being rate-limited or has a network problem.
What I'd do differently
If I were starting over on a new cluster today:
- Start with the runner image work before anything else. Slim image + pre-pulled = you can defer a lot of other tuning.
- Don't bother with
minRunnersabove zero until you have data showing you need it. - Set up the listener memory limits up front. The default is wrong.
- Put each team on its own runner scale set, not a shared one. Noisy-neighbor isolation is free and the labels make billing tractable.
ARC is genuinely good software now in a way it wasn't in 2023. Most of the pain in 2026 is no longer ARC itself — it's the choices you make around it.