Allow option to configure the kubernetes controllers MaxConcurrentReconciles #3021
Labels
attention
Requires attention
community
Community contribution
enhancement
New feature or request
needs triage
Requires review from the maintainers
What would you like added?
A clear and concise description of what you want to happen.
We would like to be able to configure the MaxConcurrentReconciles of the runnerpod and runner controllers.
Why is this needed?
The
runnerpod_controller
is currently making a lot of individual patches per runner pod (each one adds a different annotation to the pod). The controller is not configured to pass options to the k8s controller, so defaults to a MaxConcurrentReconciles of1
. There may be a way to configure this using system configuration, but could not find any docs in ARC or the controller-runtime project, and from looking at the code it looks like it should be passed in on creation. There may be an alternative means of configuring/injecting the setting as are not Golang or kubenetes-controller experts, in which case we would appreciate a pointer to the correct docs.The high number of requests the controller makes to the k8s api, and the single threaded nature of the controller worker means it is constantly behind and puts a lot of pressure during scaling. We have run multiple configurations with the HRA, from:
min: 0. max:60
min: 60. max:60
min: 100. max:200
min: 100. max:100
and we have found it performs best when no scaling is applied i.e.
min == max
. While we see the queue depth stay constant (at around 1000 requests in the queue), we observed more consistent performance this way.A clear and concise description of any alternative solutions or features you've considered.
Additional context
We are using the latest version of the summerwinds controller, using ephmeral runners with a
RunnerDeployment
and webhook driven scaling. We have a few different runnerDeployments, including our main one in the default organisation runner group. We forked the code and manually configured the MacConcurrentReconciles to 8 and noticed a very clear correlation.We use kyverno in our cluster, which takes anything from 20ms -> 2seconds to handle requests via the k8s api server. The workers are currently single threaded, and as the load increases it puts more and more pressure on kyverno to be quick, which isn't scalable long term.
Add any other context or screenshots about the feature request here.
This shows a clear correlation between the queue depth and the max number of reconciles.
We did see the number of reconcile errors increase due to dirty reads, however these seem to be retried successfully, and we see a big difference in our cluster (fewer pods stuck in terminating/completed) while offline runners are still being deregistered correctly.
The text was updated successfully, but these errors were encountered: