Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Listener is very slow to receive job messages and spawn worker pods. (20-30 seconds delay). #3534

Closed
4 tasks done
geekflyer opened this issue May 21, 2024 · 4 comments
Closed
4 tasks done
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@geekflyer
Copy link

geekflyer commented May 21, 2024

Checks

Controller Version

0.9.2

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Install ARC 0.9.2.
2. Create a PR on a github repo that requests an arc-runner
3. Observe that it takes 20-30 seconds until the listener creates a worker pod

Describe the bug

Listener is very slow to receive messages and spawn worker pods.

Describe the expected behavior

Listener should spawn worker pods immediately or with like max 5s delay after they've been submitted to github.com.

With the legacy summerwind controller and webhook-driven scaling the controller used to create new pods/scale up pretty much immediately. The new controller seems to be a large step back with regards to that, unless I'm missing somthing?

Controller Logs

# note in this example I submitted a PR that requests a runner to github at 2024-05-21T06:15:00Z . As you can see the listener only does it's thing about 26 seconds later.


{"severity":"info","ts":"2024-05-21T06:14:57Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":45}
{"severity":"info","ts":"2024-05-21T06:15:26Z","logger":"listener-app.listener","message":"Processing message","messageId":46,"messageType":"RunnerScaleSetJobMessages"}
{"severity":"info","ts":"2024-05-21T06:15:26Z","logger":"listener-app.listener","message":"New runner scale set statistics.","statistics":{"totalAvailableJobs":1,"totalAcquiredJobs":0,"totalAssignedJobs":0,"totalRunningJobs":0,"totalRegisteredRunners":0,"totalBusyRunners":0,"totalIdleRunners":0}}
{"severity":"info","ts":"2024-05-21T06:15:26Z","logger":"listener-app.listener","message":"Job available message received","jobId":2265756}
{"severity":"info","ts":"2024-05-21T06:15:26Z","logger":"listener-app.listener","message":"Acquiring jobs","count":1,"requestIds":"[2265756]"}
{"severity":"info","ts":"2024-05-21T06:15:26Z","logger":"listener-app.listener","message":"Jobs are acquired","count":1,"requestIds":"[2265756]"}
{"severity":"info","ts":"2024-05-21T06:15:26Z","logger":"listener-app.listener","message":"Deleting last message","lastMessageID":46}
{"severity":"info","ts":"2024-05-21T06:15:27Z","logger":"listener-app.worker.kubernetesworker","message":"Calculated target runner count","assigned job":0,"decision":0,"min":0,"max":100,"currentRunnerCount":0,"jobsCompleted":0}
{"severity":"info","ts":"2024-05-21T06:15:27Z","logger":"listener-app.worker.kubernetesworker","message":"Compare","original":"{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}","patch":"{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"patchID\":0,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
{"severity":"info","ts":"2024-05-21T06:15:27Z","logger":"listener-app.worker.kubernetesworker","message":"Preparing EphemeralRunnerSet update","json":"{\"spec\":{\"patchID\":0,\"replicas\":null}}"}
{"severity":"info","ts":"2024-05-21T06:15:27Z","logger":"listener-app.worker.kubernetesworker","message":"Ephemeral runner set scaled.","namespace":"arc-runners","name":"high-perf-docker-v2-cr5fw","replicas":0}
{"severity":"info","ts":"2024-05-21T06:15:27Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":46}
{"severity":"info","ts":"2024-05-21T06:15:32Z","logger":"listener-app.listener","message":"Processing message","messageId":47,"messageType":"RunnerScaleSetJobMessages"}
{"severity":"info","ts":"2024-05-21T06:15:32Z","logger":"listener-app.listener","message":"New runner scale set statistics.","statistics":{"totalAvailableJobs":0,"totalAcquiredJobs":1,"totalAssignedJobs":1,"totalRunningJobs":0,"totalRegisteredRunners":0,"totalBusyRunners":0,"totalIdleRunners":0}}
{"severity":"info","ts":"2024-05-21T06:15:32Z","logger":"listener-app.listener","message":"Job assigned message received","jobId":2265756}
{"severity":"info","ts":"2024-05-21T06:15:32Z","logger":"listener-app.listener","message":"Deleting last message","lastMessageID":47}
{"severity":"info","ts":"2024-05-21T06:15:33Z","logger":"listener-app.worker.kubernetesworker","message":"Calculated target runner count","assigned job":1,"decision":1,"min":0,"max":100,"currentRunnerCount":1,"jobsCompleted":0}
{"severity":"info","ts":"2024-05-21T06:15:33Z","logger":"listener-app.worker.kubernetesworker","message":"Compare","original":"{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}","patch":"{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":1,\"patchID\":71,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
{"severity":"info","ts":"2024-05-21T06:15:33Z","logger":"listener-app.worker.kubernetesworker","message":"Preparing EphemeralRunnerSet update","json":"{\"spec\":{\"patchID\":71,\"replicas\":1}}"}
{"severity":"info","ts":"2024-05-21T06:15:33Z","logger":"listener-app.worker.kubernetesworker","message":"Ephemeral runner set scaled.","namespace":"arc-runners","name":"high-perf-docker-v2-cr5fw","replicas":1}
{"severity":"info","ts":"2024-05-21T06:15:33Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":47}
@geekflyer geekflyer added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels May 21, 2024
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@geekflyer geekflyer changed the title Large delay in listener creating worker pods from when a job is submitted. Listener is very slow to receive job messages and spawn worker pods. (20-30 seconds delay). May 21, 2024
@nikola-jokic
Copy link
Contributor

Hey @geekflyer,

We are already working on improvements that should improve scaling performance.
Closing this one as a duplicate of #3276

@geekflyer
Copy link
Author

@nikola-jokic are you sure #3276 is the same? From the looks of it #3276 is about scaling delays under high job spikes.
The issue I opened has nothing to do with spikes/high load, but rather is about the message processing delay when creating a single pod (under no load) from 0->1.

@nikola-jokic
Copy link
Contributor

Hey @geekflyer,

Can you please check if this is happening again? I ran many jobs today and I haven't seen this delay.

I wanted to combine and link these two since they are both performance-based issues. Although I think the latency issue was a temporary problem, I closed this one and linked it to the one submitted before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

2 participants