Scaling Actions Runner Controller on EKS Without Melting Your Cluster

After about a year of running Actions Runner Controller (ARC) as the primary CI substrate for a mid-sized engineering org, I've collected enough scars to be useful. This post is the writeup I wish I'd had when we started — the parts that aren't in the README.

We're running the modern gha-runner-scale-set mode (the one that talks to GitHub's actions service over a long-lived HTTP/2 connection), on EKS 1.31, with Karpenter handling node provisioning. Workloads are a mix of Linux x86, Linux arm64, and a handful of larger jobs that need GPUs.

The mental model that actually matters

The single most useful reframing: a runner scale set is not an autoscaling pool of long-lived workers. It's a queue consumer. The listener pod opens a session with actions.githubusercontent.com, gets told "you have N jobs assigned, please produce N ephemeral runner pods", and then ARC creates exactly that many pods. Each pod takes one job and dies.

That has two consequences people consistently get wrong:

Pod startup latency is your CI latency. If your runner image is 4 GB and your node has to pull it cold, your "queued" time is now image-pull time. This dominates everything else.
You cannot smooth bursts with minRunners. minRunners keeps idle pods warm, which keeps idle nodes warm, which costs money 24/7. It does not give you headroom for a 500-job burst at 9:03 AM when everyone pushes their morning PR.

Image strategy: this is 80% of the win

The default runner image works. It is also enormous and pulls a kitchen-sink toolchain you mostly don't use. Build a slim base per language family:

FROM ghcr.io/actions/actions-runner:2.319.1

USER root
RUN apt-get update && apt-get install -y --no-install-recommends \
      git curl ca-certificates jq unzip \
  && rm -rf /var/lib/apt/lists/*

# Language-specific layer
COPY --from=node:20-slim /usr/local /usr/local
USER runner

Then — and this is the part most teams skip — pre-pull the image onto every node. With Karpenter, you do this by baking it into a custom AMI, or by running a DaemonSet that does nothing but crictl pull your runner images on node startup. Image pull goes from 45 seconds to ~0.5 seconds. Your p95 queue time drops by an order of magnitude.

Sizing the listener and the controller

The ARC controller itself is tiny. The listeners are the interesting part — there's one listener pod per runner scale set, and it holds a persistent connection to GitHub. Two failure modes I've hit:

Listener OOM under burst. When 800 jobs land at once, the listener has to track all of their assignments. Default memory limit (64Mi) is laughably low above ~200 concurrent jobs. We run with 256Mi request, 512Mi limit.
Listener restarts during deploys. If you kubectl rollout restart the controller, every listener restarts, every session re-establishes, and for ~10 seconds GitHub has nowhere to dispatch jobs. They queue, then resolve. It's fine, but alert routing needs to know this is not an incident.

Node pools, not "one big pool"

Mixing CI workloads on one node pool is a trap. We run four Karpenter NodePools:

Pool	Instances	Used for
`ci-small`	`c7i.large`, `c7i.xlarge`, spot	90% of jobs — lint, unit tests
`ci-large`	`c7i.4xlarge`, spot with on-demand fallback	integration tests, large builds
`ci-arm`	`c7g.xlarge`, spot	arm64 container builds
`ci-gpu`	`g6.xlarge`, on-demand only	ML eval jobs

Each runner scale set has nodeSelector and tolerations matched to one pool. A misconfigured workflow can't accidentally schedule a 100-runner integration test burst onto your GPU nodes. (Ask me how I know.)

Spot interruptions and the two-minute warning

About 4% of our jobs run on spot nodes that get reclaimed mid-job. GitHub Actions does not retry a job that died because its runner pod vanished — the job fails with a useless "The runner has received a shutdown signal" message. Two mitigations:

Karpenter consolidation off during business hours. We disable consolidation between 08:00 and 19:00 UTC. Saves maybe 8% in cost; costs us about 30 spurious CI failures per week, which is a great trade.
A reusable retry-on-runner-loss composite action. It checks the job's failure annotation, and if it matches the shutdown signature, it requeues via workflow_dispatch. Crude but effective.

Observability: what to actually watch

Most ARC dashboards I've seen are useless because they show pod counts. Pod counts are an output, not a signal. The metrics that predict pain:

gha_assigned_jobs vs gha_running_jobs gap. This is queue depth. If it's growing, you're behind.
Image pull duration p95, scraped from kubelet. If this trends up, something changed in your image and you'll feel it tomorrow.
Node provision latency from Karpenter. When this exceeds ~90 seconds, your "fast" jobs aren't fast anymore.
Listener session age. A listener that hasn't re-sessioned in 6+ hours is healthy. One that's churning every few minutes is being rate-limited or has a network problem.

What I'd do differently

If I were starting over on a new cluster today:

Start with the runner image work before anything else. Slim image + pre-pulled = you can defer a lot of other tuning.
Don't bother with minRunners above zero until you have data showing you need it.
Set up the listener memory limits up front. The default is wrong.
Put each team on its own runner scale set, not a shared one. Noisy-neighbor isolation is free and the labels make billing tractable.

ARC is genuinely good software now in a way it wasn't in 2023. Most of the pain in 2026 is no longer ARC itself — it's the choices you make around it.