Cost comparison: GitHub Actions Runner Controller (ARC) and WarpBuild

18 September 202426 minute read

In this case study, we will explore the cost, flexibility, and management aspects of running your own GitHub Actions Runners using ARC (Actions Runner Controller) vs. using WarpBuild's Bring Your Own Cloud (BYOC) offering on AWS.

TL;DR

In this case study, we compare setting up GitHub's Action Runner Controller on EKS using Karpenter for autoscaling, with WarpBuild's BYOC offering. We found that ARC comes with significant operational overhead and efficiency challenges. On the other hand, WarpBuild's BYOC solution provides better performance, ease of use, and lower operational costs, making it a more suitable choice for teams, especially with large volumes of CI/CD workflows.

Cost Comparison Highlights: The cost comparison is for a representative 2 hour period, where there is a continuous load of commits, each triggering a job. We use PostHog OSS as an example repo to demonstrate the cost comparison on real world use cases over 960 jobs.

  • ARC Setup Cost (for the analyzed period): $42.60
  • WarpBuild BYOC Cost: $25.20

This is effectively a ~41% cost savings. Cost Comparison You can find the detailed cost comparison here.

The following sections describe the setup of ARC Runners on EKS, and the assumptions that went into this.

Setting up ARC Runners on EKS

We setup Karpenter v1 and EKS using Terraform to provision the infrastructure. This approach provided more control, automation, and consistency in deploying and managing the EKS cluster and related resources.

Complete setup code is available @ https://github.com/WarpBuilds/github-arc-setup

EKS Cluster Setup

The EKS cluster was provisioned using Terraform and runs on Kubernetes v1.30. A key aspect of our setup was using a dedicated node group for essential add-ons, keeping them isolated from other workloads. The default-ng node group utilizes t3.xlarge instance types, with taints to ensure that only critical workloads, such as Networking, DNS management, Node management, ARC controllers etc. can be scheduled on these nodes.

1module "eks" {
2  source                           = "terraform-aws-modules/eks/aws"
3  cluster_name                     = local.cluster_name
4  cluster_version                  = "1.30"
5  cluster_endpoint_public_access   = true
6
7  cluster_addons = {
8    coredns                     = {}
9    eks-pod-identity-agent      = {}
10    kube-proxy                  = {}
11    vpc-cni                     = {}
12  }
13
14  subnet_ids = var.private_subnet_ids
15  vpc_id     = var.vpc_id
16
17  eks_managed_node_groups = {
18    default-ng = {
19      desired_capacity = 2
20      max_capacity     = 5
21      min_capacity     = 1
22
23      instance_types = ["t3.xlarge"]
24
25      subnet_ids = var.private_subnet_ids
26
27      taints = {
28        addons = {
29          key    = "CriticalAddonsOnly"
30          value  = "true"
31          effect = "NO_SCHEDULE"
32        }
33      }
34    }
35  }
36
37  node_security_group_tags = merge(local.tags, {
38    "karpenter.sh/discovery" = local.cluster_name
39  })
40
41  enable_cluster_creator_admin_permissions = true
42  tags                                     = local.tags
43}

Private Subnets and NAT Gateway

To secure our infrastructure, we placed the EKS nodes in private subnets, allowing them to communicate with external resources through a NAT Gateway. This configuration ensured that the nodes could still access the internet for essential tasks without exposing them directly to external traffic. Using private subnets with a NAT Gateway enhanced the security posture of the cluster while allowing for the necessary external connectivity.

Karpenter for Autoscaling

To manage autoscaling of the nodes and optimize cost and resource efficiency, we utilized Karpenter, which offers a more flexible and cost-effective alternative to the Kubernetes Cluster Autoscaler. Karpenter allows nodes to be created and terminated dynamically based on real-time resource needs, reducing over-provisioning and unnecessary costs.

We deployed Karpenter using Terraform and Helm, with some notable configurations:

  • Karpenter v1.0.2: We chose the latest version of karpenter at the time of writing.
  • Amazon Linux 2023 (AL2023): The default NodeClass provisions nodes with AL2023, and each node is configured with 300GiB of EBS storage. This additional storage is crucial for workloads that require high disk usage, such as CI/CD runners, preventing out-of-disk errors commonly encountered with default node storage (17GiB). This needs to be increased based on the number of jobs expected to run on a node in parallel.
  • Private Subnet Selection: The NodeClass is configured to use the private subnets created earlier. This ensures that nodes are spun up in a secure, isolated environment, consistent with the EKS cluster's network setup.
  • m7a Node Families: Using the NodePool resource, we restricted node provisioning to the m7a instance family. These instances were chosen for their performance-to-cost efficiency and are only provisioned in the us-east-1a and us-east-1b Availability Zones.
  • On-demand Instances: While Karpenter supports Spot Instances for cost savings, we opted for on-demand instances for an equivalent cost comparison.
  • Consolidation Policy: We configured a 5-minute consolidation delay, preventing premature node terminations that could disrupt workflows. Karpenter will only consolidate nodes once they are underutilized for at least 5 minutes, ensuring stable operations during peak workloads.
1module "karpenter" {
2  source       = "terraform-aws-modules/eks/aws//modules/karpenter"
3  cluster_name = module.eks.cluster_name
4
5  enable_pod_identity             = true
6  create_pod_identity_association = true
7
8  create_instance_profile = true
9
10  node_iam_role_additional_policies = {
11    AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
12  }
13
14  tags = local.tags
15}
16
17resource "helm_release" "karpenter-crd" {
18  namespace        = "karpenter"
19  create_namespace = true
20  name             = "karpenter-crd"
21  repository       = "oci://public.ecr.aws/karpenter"
22  chart            = "karpenter-crd"
23  version          = "1.0.2"
24  wait             = true
25  values           = []
26}
27
28resource "helm_release" "karpenter" {
29  depends_on       = [helm_release.karpenter-crd]
30  namespace        = "karpenter"
31  create_namespace = true
32  name             = "karpenter"
33  repository       = "oci://public.ecr.aws/karpenter"
34  chart            = "karpenter"
35  version          = "1.0.2"
36  wait             = true
37
38  skip_crds = true
39
40  values = [
41    <<-EOT
42    serviceAccount:
43      name: ${module.karpenter.service_account}
44    settings:
45      clusterName: ${module.eks.cluster_name}
46      clusterEndpoint: ${module.eks.cluster_endpoint}
47    EOT
48  ]
49}
50
51resource "kubectl_manifest" "karpenter_node_class" {
52  yaml_body = <<-YAML
53    apiVersion: karpenter.k8s.aws/v1beta1
54    kind: EC2NodeClass
55    metadata:
56      name: default
57    spec:
58      amiFamily: AL2023
59      detailedMonitoring: true
60      blockDeviceMappings:
61        - deviceName: /dev/xvda
62          ebs:
63            volumeSize: 300Gi
64            volumeType: gp3
65            deleteOnTermination: true
66            iops: 5000
67            throughput: 500
68      instanceProfile: ${module.karpenter.instance_profile_name}
69      subnetSelectorTerms:
70        - tags:
71            karpenter.sh/discovery: ${module.eks.cluster_name}
72      securityGroupSelectorTerms:
73        - tags:
74            karpenter.sh/discovery: ${module.eks.cluster_name}
75      tags:
76        karpenter.sh/discovery: ${module.eks.cluster_name}
77        Project: arc-test-praj
78  YAML
79
80  depends_on = [
81    helm_release.karpenter,
82    helm_release.karpenter-crd
83  ]
84}
85
86resource "kubectl_manifest" "karpenter_node_pool" {
87  yaml_body = <<-YAML
88    apiVersion: karpenter.sh/v1beta1
89    kind: NodePool
90    metadata:
91      name: default
92    spec:
93      template:
94        spec:
95          tags:
96            Project: arc-test-praj
97          nodeClassRef:
98            name: default
99          requirements:
100            - key: "karpenter.k8s.aws/instance-category"
101              operator: In
102              values: ["m"]
103            - key: "karpenter.k8s.aws/instance-family"
104              operator: In
105              values: ["m7a"]
106            - key: "karpenter.k8s.aws/instance-cpu"
107              operator: In
108              values: ["4", "8", "16", "32", "64"]
109            - key: "karpenter.k8s.aws/instance-generation"
110              operator: Gt
111              values: ["2"]
112            - key: "topology.kubernetes.io/zone"
113              operator: In
114              values: ["us-east-1a", "us-east-1b"]
115            - key: "kubernetes.io/arch"
116              operator: In
117              values: ["amd64"]
118            - key: "karpenter.sh/capacity-type"
119              operator: In
120              values: ["on-demand"]
121      limits:
122        cpu: 1000
123      disruption:
124        consolidationPolicy: WhenEmpty
125        consolidateAfter: 5m
126  YAML
127
128  depends_on = [
129    kubectl_manifest.karpenter_node_class
130  ]
131}

We also ran another setup with a single job per node to compare the performance and cost implications of running multiple jobs on a single node.

1- key: "karpenter.k8s.aws/instance-cpu"
2- operator: In
3- values: ["4", "8", "16", "32", "64"]
4+ key: "karpenter.k8s.aws/instance-cpu"
5+ operator: In
6+ values: ["8"]

Actions Runner Controller and Runner Scale Set

Once Karpenter was configured, we proceeded to set up the GitHub Actions Runner Controller (ARC) and the Runner Scale Set using Helm.

The ARC setup was deployed with Helm using the following command and values:

1helm upgrade arc \
2    --namespace "${NAMESPACE}" \
3    oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
4    --values runner-set-values.yaml --install
1tolerations:
2	- key: "CriticalAddonsOnly"
3	operator: "Equal"
4	value: "true"
5	effect: "NoSchedule"

This configuration applies tolerations to the controller, enabling it to run on nodes with the CriticalAddonsOnly taint i.e. default-ng nodegroup, ensuring it doesn't interfere with other runner workloads.

Next, we set up the Runner Scale Set using another Helm command:

1helm upgrade warp-praj-arc-test oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set --namespace ${NAMESPACE} --values values.yaml --install

The key points for our Runner Scale Set configuration:

  • GitHub App Integration: We connected our runners to GitHub via a GitHub App, enabling the runners to operate at the organization level.\
  • Listener Tolerations: Like the controller, the listener template also included tolerations to allow it to run on the default-ng node group.
  • Custom Image for Runners: We used a custom Docker image for the runner pods (detailed in the next section).
  • Resource Requirements: To simulate high-performance runners, the runner pods were configured to require 8 CPU cores and 32 GiB of RAM, which aligns with the performance of an 8x runner used in the workflows.
1githubConfigUrl: "https://github.com/Warpbuilds"
2githubConfigSecret:
3  github_app_id: "<APP_ID>"
4  github_app_installation_id: "<APP_INSTALLATION_ID>"
5  github_app_private_key: |
6    -----BEGIN RSA PRIVATE KEY-----
7    [your-private-key-contents]
8    -----END RSA PRIVATE KEY-----
9  github_token: ""
10
11listenerTemplate:
12  spec:
13    containers:
14      - name: listener
15        securityContext:
16          runAsUser: 1000
17    tolerations:
18      - key: "CriticalAddonsOnly"
19        operator: "Equal"
20        value: "true"
21        effect: "NoSchedule"
22
23template:
24  spec:
25    containers:
26      - name: runner
27        image: <public_ecr_image_url>
28        command: ["/home/runner/run.sh"]
29        resources:
30          requests:
31            cpu: "4"
32            memory: "16Gi"
33          limits:
34            cpu: "8"
35            memory: "32Gi"
36
37controllerServiceAccount:
38  namespace: arc-systems
39  name: arc-gha-rs-controller

Custom Image for Runner Pods

By default, the Runner Scale Sets use GitHub's official actions-runner image. However, this image doesn't include essential utilities such as wget, curl, and git, which are required by various workflows.

To address this, we created a custom Docker image based on GitHub's runner image, adding the necessary tools. This image was hosted in a public ECR repository and was used by the runner pods during our tests. The custom image allowed us to run workflows without missing dependencies and ensured smooth execution.

1FROM ghcr.io/actions/actions-runner:2.319.1
2RUN sudo apt-get update && sudo apt-get install -y wget curl unzip git
3RUN sudo apt-get clean && sudo rm -rf /var/lib/apt/lists/*

This approach ensured that our runners were always equipped with the required utilities, preventing errors and reducing friction during the workflow runs.

Tagging Infrastructure for Cost Tracking

In order to track costs effectively during the ARC setup, we implemented cost allocation tags across all the resources that we used for the setup along with collecting hourly data. AWS Cost Explorer allowed us to monitor and attribute costs to specific resources based on these tags. This was essential for calculating the true cost of running ARC compared to the WarpBuild BYOC solution.

Setting up BYOC Runners on WarpBuild

Adding Cloud Account

Setting up BYOC (Bring Your Own Cloud) runners on WarpBuild begins by connecting your own cloud account.

After signing up for WarpBuild, navigate to the BYOC page and follow the process to add your cloud account. This step is critical as it allows WarpBuild to provision and manage runners directly in your own AWS environment, providing greater control and flexibility.

Add Cloud Account Flow

Creating Stack

Once your cloud account is connected, you need to create a Stack in the WarpBuild dashboard. A WarpBuild Stack represents a group of essential infrastructure components, such as VPCs, subnets, and object storage buckets, provisioned in a specific region of your cloud account. These components are required for running CI workflows on WarpBuild.

Create Stack Flow

Custom Runner Creation

For this experiment, we also created a custom 8x runner. Although WarpBuild provides default stock runner configurations, creating a custom runner allowed us to match the specifications of the ARC runners.

WarpBuild runners are based on the Ubuntu 22.04 image, which is approximately 60GB in size. This image is pre-configured to work seamlessly with GitHub Actions workflows, offering better performance and compatibility than a general-purpose runner image.

While such an image would be impractical for an ARC setup due to the high storage costs incurred every time a new node is provisioned, WarpBuild manages this efficiently through its runner orchestration.

Create Runner Flow

Tagging Infrastructure for Cost Tracking

WarpBuild simplifies cost tracking for its users by automatically tagging all provisioned resources. This allows users to monitor and manage costs more effectively. Additionally, WarpBuild offers a dedicated dashboard where users can see real-time cost breakdowns, making cost management more transparent.

Workflow Simulation

PostHog's Frontend CI Workflow

To simulate real-world use-case, we leveraged PostHog's Frontend CI workflow. This workflow is designed to run a series of frontend checks, followed by two sets of jobs: one for code quality checks and another for executing a matrix of Jest tests. This setup provided a comprehensive load for both the ARC and WarpBuild BYOC runners, allowing us to assess their performance under typical CI workloads.

You can view the workflow file here: PostHog Frontend CI Workflow

Auto-Commit Simulation Script

To ensure continuous triggering of the Frontend CI workflow, we developed an automated commit script in JavaScript. This script generates commits every minute on the forked PostHog repository, which in turn triggers the CI workflow. Both the ARC and the WarpBuild BYOC runners simultaneously pick up these jobs, enabling us to track costs and performance over time.

The script is designed to run for two hours, ensuring a consistent workload over an extended period for accurate cost measurement. The results were then analyzed to compare the costs of using ARC versus WarpBuild's BYOC runners.

Commit simulation script:

1const { exec } = require("child_process");
2const fs = require("fs");
3const path = require("path");
4
5const repoPath = "arc-setup/posthog";
6const frontendDir = path.join(repoPath, "frontend");
7const intervalTime = 1 * 60 * 1000; // Every Minute
8const maxRunTime = 2 * 60 * 60 * 1000; // 2 hours
9
10const setupGitConfig = () => {
11	exec('git config user.name "Auto Commit Script"', { cwd: repoPath });
12	exec('git config user.email "[email protected]"', { cwd: repoPath });
13};
14
15const makeCommit = () => {
16	const logFilePath = path.join(frontendDir, "commit_log.txt");
17
18	// Create the frontend directory if it doesn't exist
19	if (!fs.existsSync(frontendDir)) {
20		fs.mkdirSync(frontendDir);
21	}
22
23	// Write to commit_log.txt in the frontend directory
24	fs.appendFileSync(
25		logFilePath,
26		`Auto commit in frontend at ${new Date().toISOString()}\n`,
27	);
28
29	// Add, commit, and push changes
30	exec(`git add ${logFilePath}`, { cwd: repoPath }, (err) => {
31		if (err) return console.error("Error adding file:", err);
32		exec(
33			`git commit -m "Auto commit at ${new Date().toISOString()}"`,
34			{ cwd: repoPath },
35			(err) => {
36				if (err) return console.error("Error committing changes:", err);
37				exec("git push origin master", { cwd: repoPath }, (err) => {
38					if (err) return console.error("Error pushing changes:", err);
39					console.log("Changes pushed successfully");
40				});
41			},
42		);
43	});
44};
45
46setupGitConfig();
47const interval = setInterval(makeCommit, intervalTime);
48
49// Stop the script after 2 hours
50setTimeout(() => {
51	clearInterval(interval);
52	console.log("Script completed after 2 hours");
53}, maxRunTime);

Cost Comparison

CategoryARC (Varied Node Sizes)WarpBuildARC (1 Job Per Node)
Total Jobs Ran960960960
Node Typem7a (varied vCPUs)m7a.2xlargem7a.2xlarge
Max K8s Nodes8-27
Storage300GiB per node150GiB per runner150GiB per node
IOPS5000 per node5000 per runner5000 per node
Throughput500Mbps per node500Mbps per runner500Mbps per node
Compute$27.20$20.83$22.98
EC2-Other$18.45$0.27$19.39
VPC$0.23$0.29$0.23
S3$0.001$0.01$0.001
WarpBuild Costs-$3.80-
Total Cost$45.88$25.20$42.60

Performance and Scalability

The following metrics showcase the average time taken by WarpBuild BYOC Runners and ARC Runners for jobs in the Frontend-CI workflow:

TestARC (Varied Node Sizes)WarpBuildARC (1 Job Per Node)
Code Quality Checks~9 minutes 30 seconds~7 minutes~7 minutes
Jest Test (FOSS)~2 minutes 10 seconds~1 minute 30 seconds~1 minute 30 seconds
Jest Test (EE)~1 minute 35 seconds~1 minute 25 seconds~1 minute 25 seconds

ARC runners exhibited slower performance primarily because multiple runners shared disk and network resources on the same node, causing bottlenecks despite larger node sizes. In contrast, WarpBuild's dedicated VM runners eliminated this resource contention, allowing jobs to complete faster.

To address these bottlenecks, we tested a 1 Job Per Node configuration with ARC, where each job ran on its own node. This approach significantly improved performance, matching the job times of WarpBuild runners. However, it introduced higher job start delays due to the time required to provision new nodes.

Note: Job start delays are directly influenced by the time needed to provision a new node and pull the container image. Larger image sizes increase pull times, leading to longer delays. If the image size is reduced, additional tools would need to be installed during the action run, increasing the overall workflow run time.

This is a trade-off that you don't have to make with WarpBuild. You can further enhance optimization by leveraging WarpBuild's features custom images, snapshot runners and more.

Conclusion

The cost and performance comparison between ARC and WarpBuild's BYOC offering demonstrates clear advantages to using WarpBuild. WarpBuild provides the same flexibility as ARC in configuring and scaling your own runners, but without the operational complexity and performance bottlenecks (such as resource contention on larger nodes) make it ideal for large-scale workloads. ARC's scalability is limited by node resources like disk I/O and network throughput, which can affect workflow performance despite using high-performance nodes.

WarpBuild simplifies the entire process, offering better performance with lower operational overhead and lower costs. It handles provisioning and scaling seamlessly while maintaining performance, making it the ideal option for CI/CD management for high performance teams.

Previous post

Supercharge your CI with Snapshot Runners

12 September 2024
GitHub ActionsCachingGitHubEngineering
Next post

WarpBuild's SOC 2 Certification

26 September 2024
Engineering