This post details setting up GitHub Actions Runners using ARC (Actions Runner Controller) on AWS using EKS. This includes terraform code for provisioning the infrastructure and a custom runner image for runners. It also includes optimizations for cost and performance using Karpenter for autoscaling and other best practices.
Setup
We setup Karpenter v1.0.2 and EKS using Terraform to provision the infrastructure. Complete setup code is available here: https://github.com/WarpBuilds/github-arc-setup
EKS Cluster Setup
The EKS cluster was provisioned using Terraform and runs on Kubernetes v1.30.
A key aspect of our setup was using a dedicated node group for essential add-ons, keeping them isolated from other workloads. The default-ng
node group utilizes t3.xlarge
instance types, with taints to ensure that only critical workloads, such as Networking, DNS management, Node management, ARC controllers etc. can be scheduled on these nodes.
1module "eks" {
2 source = "terraform-aws-modules/eks/aws"
3 cluster_name = local.cluster_name
4 cluster_version = "1.30"
5 cluster_endpoint_public_access = true
6
7 cluster_addons = {
8 coredns = {}
9 eks-pod-identity-agent = {}
10 kube-proxy = {}
11 vpc-cni = {}
12 }
13
14 subnet_ids = var.private_subnet_ids
15 vpc_id = var.vpc_id
16
17 eks_managed_node_groups = {
18 default-ng = {
19 desired_capacity = 2
20 max_capacity = 5
21 min_capacity = 1
22
23 instance_types = ["t3.xlarge"]
24
25 subnet_ids = var.private_subnet_ids
26
27 taints = {
28 addons = {
29 key = "CriticalAddonsOnly"
30 value = "true"
31 effect = "NO_SCHEDULE"
32 }
33 }
34 }
35 }
36
37 node_security_group_tags = merge(local.tags, {
38 "karpenter.sh/discovery" = local.cluster_name
39 })
40
41 enable_cluster_creator_admin_permissions = true
42 tags = local.tags
43}
Private Subnets and NAT Gateway
The EKS nodes are in private subnets, allowing them to communicate with external resources through a NAT Gateway. This configuration ensures node connectivity without exposing them directly to external traffic.
Karpenter for Autoscaling
Karpenter provides fast and flexible autoscaling of the nodes to optimize cost and resource efficiency. We explore a few variations of configuration to reduce over-provisioning and unnecessary costs.
- Karpenter v1.0.2: We chose the latest version of karpenter at the time of writing.
- Amazon Linux 2023 (AL2023): The default NodeClass provisions nodes with AL2023, and each node is configured with 300GiB of EBS storage. This additional storage is crucial for workloads that require high disk usage, such as CI/CD runners, preventing out-of-disk errors commonly encountered with default node storage (17GiB). This needs to be increased based on the number of jobs expected to run on a node in parallel.
- Private Subnet Selection: The NodeClass is configured to use the private subnets created earlier. This ensures that nodes are spun up in a secure, isolated environment, consistent with the EKS cluster's network setup.
- m7a Node Families: Using the NodePool resource, we restricted node provisioning to the m7a instance family. These instances were chosen for their performance-to-cost efficiency and are only provisioned in the us-east-1a and us-east-1b Availability Zones.
- On-demand Instances: While Karpenter supports Spot Instances for cost savings, we opted for on-demand instances for an equivalent cost comparison.
- Consolidation Policy: We configured a 5-minute consolidation delay, preventing premature node terminations that could disrupt workflows. Karpenter will only consolidate nodes once they are underutilized for at least 5 minutes, ensuring stable operations during peak workloads.
1module "karpenter" {
2 source = "terraform-aws-modules/eks/aws//modules/karpenter"
3 cluster_name = module.eks.cluster_name
4
5 enable_pod_identity = true
6 create_pod_identity_association = true
7
8 create_instance_profile = true
9
10 node_iam_role_additional_policies = {
11 AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
12 }
13
14 tags = local.tags
15}
16
17resource "helm_release" "karpenter-crd" {
18 namespace = "karpenter"
19 create_namespace = true
20 name = "karpenter-crd"
21 repository = "oci://public.ecr.aws/karpenter"
22 chart = "karpenter-crd"
23 version = "1.0.2"
24 wait = true
25 values = []
26}
27
28resource "helm_release" "karpenter" {
29 depends_on = [helm_release.karpenter-crd]
30 namespace = "karpenter"
31 create_namespace = true
32 name = "karpenter"
33 repository = "oci://public.ecr.aws/karpenter"
34 chart = "karpenter"
35 version = "1.0.2"
36 wait = true
37
38 skip_crds = true
39
40 values = [
41 <<-EOT
42 serviceAccount:
43 name: ${module.karpenter.service_account}
44 settings:
45 clusterName: ${module.eks.cluster_name}
46 clusterEndpoint: ${module.eks.cluster_endpoint}
47 EOT
48 ]
49}
50
51resource "kubectl_manifest" "karpenter_node_class" {
52 yaml_body = <<-YAML
53 apiVersion: karpenter.k8s.aws/v1beta1
54 kind: EC2NodeClass
55 metadata:
56 name: default
57 spec:
58 amiFamily: AL2023
59 detailedMonitoring: true
60 blockDeviceMappings:
61 - deviceName: /dev/xvda
62 ebs:
63 volumeSize: 300Gi
64 volumeType: gp3
65 deleteOnTermination: true
66 iops: 5000
67 throughput: 500
68 instanceProfile: ${module.karpenter.instance_profile_name}
69 subnetSelectorTerms:
70 - tags:
71 karpenter.sh/discovery: ${module.eks.cluster_name}
72 securityGroupSelectorTerms:
73 - tags:
74 karpenter.sh/discovery: ${module.eks.cluster_name}
75 tags:
76 karpenter.sh/discovery: ${module.eks.cluster_name}
77 Project: arc-test-praj
78 YAML
79
80 depends_on = [
81 helm_release.karpenter,
82 helm_release.karpenter-crd
83 ]
84}
85
86resource "kubectl_manifest" "karpenter_node_pool" {
87 yaml_body = <<-YAML
88 apiVersion: karpenter.sh/v1beta1
89 kind: NodePool
90 metadata:
91 name: default
92 spec:
93 template:
94 spec:
95 tags:
96 Project: arc-test-praj
97 nodeClassRef:
98 name: default
99 requirements:
100 - key: "karpenter.k8s.aws/instance-category"
101 operator: In
102 values: ["m"]
103 - key: "karpenter.k8s.aws/instance-family"
104 operator: In
105 values: ["m7a"]
106 - key: "karpenter.k8s.aws/instance-cpu"
107 operator: In
108 values: ["4", "8", "16", "32", "64"]
109 - key: "karpenter.k8s.aws/instance-generation"
110 operator: Gt
111 values: ["2"]
112 - key: "topology.kubernetes.io/zone"
113 operator: In
114 values: ["us-east-1a", "us-east-1b"]
115 - key: "kubernetes.io/arch"
116 operator: In
117 values: ["amd64"]
118 - key: "karpenter.sh/capacity-type"
119 operator: In
120 values: ["on-demand"]
121 limits:
122 cpu: 1000
123 disruption:
124 consolidationPolicy: WhenEmpty
125 consolidateAfter: 5m
126 YAML
127
128 depends_on = [
129 kubectl_manifest.karpenter_node_class
130 ]
131}
Variant #2: We also ran another setup with a single job per node to compare the performance and cost implications of running multiple jobs on a single node.
1- key: "karpenter.k8s.aws/instance-cpu"
2- operator: In
3- values: ["4", "8", "16", "32", "64"]
4+ key: "karpenter.k8s.aws/instance-cpu"
5+ operator: In
6+ values: ["8"]
Actions Runner Controller and Runner Scale Set
Once Karpenter was configured, we proceeded to set up the GitHub Actions Runner Controller (ARC) and the Runner Scale Set using Helm.
The ARC setup was deployed with Helm using the following command and values:
1helm upgrade arc \
2 --namespace "${NAMESPACE}" \
3 oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
4 --values runner-set-values.yaml --install
1tolerations:
2 - key: "CriticalAddonsOnly"
3 operator: "Equal"
4 value: "true"
5 effect: "NoSchedule"
This configuration applies tolerations to the controller, enabling it to run on nodes with the CriticalAddonsOnly
taint i.e. default-ng
nodegroup, ensuring it doesn't interfere with other runner workloads.
Next, we set up the Runner Scale Set using another Helm command:
1helm upgrade warp-praj-arc-test oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set --namespace ${NAMESPACE} --values values.yaml --install
The key points for our Runner Scale Set configuration:
- GitHub App Integration: We connected our runners to GitHub via a GitHub App, enabling the runners to operate at the organization level.
- Listener Tolerations: Like the controller, the listener template also included tolerations to allow it to run on the
default-ng
node group. - Custom Image for Runners: We used a custom Docker image for the runner pods (detailed in the next section).
- Resource Requirements: To simulate high-performance runners, the runner pods were configured to require 8 CPU cores and 32 GiB of RAM, which aligns with the performance of an 8x runner used in the workflows.
1githubConfigUrl: "https://github.com/Warpbuilds"
2githubConfigSecret:
3 github_app_id: "<APP_ID>"
4 github_app_installation_id: "<APP_INSTALLATION_ID>"
5 github_app_private_key: |
6 -----BEGIN RSA PRIVATE KEY-----
7 [your-private-key-contents]
8 -----END RSA PRIVATE KEY-----
9 github_token: ""
10
11listenerTemplate:
12 spec:
13 containers:
14 - name: listener
15 securityContext:
16 runAsUser: 1000
17 tolerations:
18 - key: "CriticalAddonsOnly"
19 operator: "Equal"
20 value: "true"
21 effect: "NoSchedule"
22
23template:
24 spec:
25 containers:
26 - name: runner
27 image: <public_ecr_image_url>
28 command: ["/home/runner/run.sh"]
29 resources:
30 requests:
31 cpu: "4"
32 memory: "16Gi"
33 limits:
34 cpu: "8"
35 memory: "32Gi"
36
37controllerServiceAccount:
38 namespace: arc-systems
39 name: arc-gha-rs-controller
Custom Image for Runner Pods
By default, the Runner Scale Sets use GitHub's official actions-runner
image. However, this image doesn't include essential utilities such as wget, curl, and git, which are required by various workflows.
To address this, we created a custom Docker image based on GitHub's runner image, adding the necessary tools. This image was hosted in a public ECR repository and was used by the runner pods during our tests. The custom image allowed us to run workflows without missing dependencies and ensured smooth execution.
1FROM ghcr.io/actions/actions-runner:2.319.1
2RUN sudo apt-get update && sudo apt-get install -y wget curl unzip git
3RUN sudo apt-get clean && sudo rm -rf /var/lib/apt/lists/*
This approach ensures that our runners were always equipped with the required utilities, preventing errors and reducing friction during the workflow runs.
Tagging Infrastructure for Cost Tracking
In order to track costs effectively during the ARC setup, the infra resources created with this process are tagged, along with collecting hourly data. AWS Cost Explorer allows us to monitor and attribute costs to specific resources based on these tags. This was essential for calculating the true cost of running ARC, with all costs like EC2, EBS, VPC, S3, NAT Gateway, data ingress/egress etc. included.
Running workflows
We use PostHog
OSS as an example repo to demonstrate the cost comparison on real world use cases over 960 jobs. The duty cycle is a representative 2 hour period, where there is a continuous load of commits, each triggering a job every few minutes.
PostHog's Frontend CI Workflow
To simulate real-world use-case, we leveraged PostHog's Frontend CI workflow. This workflow is designed to run a series of frontend checks, followed by two sets of jobs: one for code quality checks and another for executing a matrix of Jest tests.
You can view the workflow file here: PostHog Frontend CI Workflow
Auto-Commit Simulation Script
To ensure continuous triggering of the Frontend CI workflow, we developed an automated commit script in JavaScript. This script generates commits every minute on the forked PostHog repository, which in turn triggers the CI workflow.
The script is designed to run for two hours, ensuring a consistent workload over an extended period for accurate cost measurement. The results were then analyzed to compare the costs of using ARC versus WarpBuild's BYOC runners.
Commit simulation script:
1const { exec } = require("child_process");
2const fs = require("fs");
3const path = require("path");
4
5const repoPath = "arc-setup/posthog";
6const frontendDir = path.join(repoPath, "frontend");
7const intervalTime = 1 * 60 * 1000; // Every Minute
8const maxRunTime = 2 * 60 * 60 * 1000; // 2 hours
9
10const setupGitConfig = () => {
11 exec('git config user.name "Auto Commit Script"', { cwd: repoPath });
12 exec('git config user.email "[email protected]"', { cwd: repoPath });
13};
14
15const makeCommit = () => {
16 const logFilePath = path.join(frontendDir, "commit_log.txt");
17
18 // Create the frontend directory if it doesn't exist
19 if (!fs.existsSync(frontendDir)) {
20 fs.mkdirSync(frontendDir);
21 }
22
23 // Write to commit_log.txt in the frontend directory
24 fs.appendFileSync(
25 logFilePath,
26 `Auto commit in frontend at ${new Date().toISOString()}\n`,
27 );
28
29 // Add, commit, and push changes
30 exec(`git add ${logFilePath}`, { cwd: repoPath }, (err) => {
31 if (err) return console.error("Error adding file:", err);
32 exec(
33 `git commit -m "Auto commit at ${new Date().toISOString()}"`,
34 { cwd: repoPath },
35 (err) => {
36 if (err) return console.error("Error committing changes:", err);
37 exec("git push origin master", { cwd: repoPath }, (err) => {
38 if (err) return console.error("Error pushing changes:", err);
39 console.log("Changes pushed successfully");
40 });
41 },
42 );
43 });
44};
45
46setupGitConfig();
47const interval = setInterval(makeCommit, intervalTime);
48
49// Stop the script after 2 hours
50setTimeout(() => {
51 clearInterval(interval);
52 console.log("Script completed after 2 hours");
53}, maxRunTime);
Results
Performance and Scalability
The following metrics showcase the average time taken by ARC Runners for jobs in the Frontend-CI workflow. All the jobs are run on the same underlying CPU family (m7a) and request the same amount of resources (vcpu and memory).
Test | ARC (Varied Node Sizes) | ARC (1 Job Per Node) |
---|---|---|
Code Quality Checks | ~9 minutes 30 seconds | ~7 minutes |
Jest Test (FOSS) | ~2 minutes 10 seconds | ~1 minute 30 seconds |
Jest Test (EE) | ~1 minute 35 seconds | ~1 minute 25 seconds |
ARC runners with varied node sizes exhibited slower performance primarily because multiple runners shared disk and network resources on the same node, causing bottlenecks despite larger node sizes.
To address these bottlenecks, we tested a 1 Job Per Node configuration with ARC, where each job ran on its own node. This approach significantly improved performance. However, it introduced higher job start delays due to the time required to provision new nodes.
Note: Job start delays are directly influenced by the time needed to provision a new node and pull the container image. Larger image sizes increase pull times, leading to longer delays. If the image size is reduced, additional tools would need to be installed during the action run, increasing the overall workflow run time.
Node spin up and image pull takes ~45s to 1.5m for
arc
runners. This is a significant overhead for workflows that run multiple jobs. Using
Cost Comparison
Category | ARC (Varied Node Sizes) | ARC (1 Job Per Node) |
---|---|---|
Total Jobs Ran | 960 | 960 |
Node Type | m7a (varied vCPUs) | m7a.2xlarge |
Max K8s Nodes | 8 | 27 |
Storage | 300GiB per node | 150GiB per node |
IOPS | 5000 per node | 5000 per node |
Throughput | 500Mbps per node | 500Mbps per node |
Compute | $27.20 | $22.98 |
EC2-Other | $18.45 | $19.39 |
VPC | $0.23 | $0.23 |
S3 | $0.001 | $0.001 |
Total Cost | $45.88 | $42.60 |
The cost comparison shows that ARC with 1 job per node is more cost effective than ARC with varied node sizes. This is also the more performant setup.
Conclusion
ARC provides a flexible and scalable solution for running GitHub Actions workflows. It is important to configure it correctly to avoid performance bottlenecks and optimize costs.
However, it comes with operational overhead (kubernetes cluster management, terraform, etc.) and continuous maintenance for maintenance at scale and keeping the runner binaries updated.
Despite these challenges, ARC is a powerful tool for running GitHub Actions workflows at scale being 10x cheaper than the default Github Actions runners.
Tip
WarpBuild provides the same flexibility as actions-runner-controller
but with none of the operational complexity.
WarpBuild runners are also more cost effective than ARC runners, with a ~41% cost saving.
Get started with WarpBuild in ~3 minutes for faster job start times, caching backed by object storage, and easy to use dashboards. Book a call or get started today!