Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve metrics #3847

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

tiithansen
Copy link

This PR makes following changes:

  • Introduce new metrics gha_runner_job which can be used to link jobs to runner pods (metric is only exported while job is running).
  • Replace high cardinality histograms with last duration gauges as jobs can run with long irregular intervals which makes rate functions hard to use.
  • Add new duration gauge to show how long job sat in queue before being picked up.
  • Fix name label to always contain the clean runnerScaleSetName (value used in GHA job runs-on property to select runner).
  • Fix only export durations if both times used in duration calculations are set.
  • Remove job_workflow_ref, runner_id and runner_name from duration metrics as they will cause a creation of a new metric/series with each run.

Example queries:

Memory usage per job:

label_replace(gha_runner_job{repository=~"$repository.*", job_name=~"$job.*"}, "pod", "$1", "pod_name", "(.*)") * on(pod) group_right(job_name) sum(container_memory_working_set_bytes{container!=""}) by (pod, container)

CPU usage per job:

label_replace(gha_runner_job{repository=~"$repository.*", job_name=~"$job.*"}, "pod", "$1", "pod_name", "(.*)") * on(pod) group_right(job_name) sum(rate(container_cpu_usage_seconds_total{container!=""}[1m])) by (pod, container)

CPU Throttling:

label_replace(gha_runner_job{repository=~"$repository.*", job_name=~"$job.*"}, "pod", "$1", "pod_name", "(.*)") * on(pod) group_right(job_name) sum(
    sum by (container,pod)
        (rate(container_cpu_cfs_throttled_periods_total{container!=""}[1m]))
 /
    sum by (container,pod)
        (rate(container_cpu_cfs_periods_total{container!=""}[1m]))
) by (pod, container)

Screenshot from 2024-12-13 09-48-30

@@ -144,75 +144,25 @@ var (
completedJobsTotalLabels,
)

jobStartupDurationSeconds = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
jobLastStartupDurationSeconds = prometheus.NewGaugeVec(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth adding a comment (in addition to commit message) why Gague is used, while ideally Histogram seems better data type - might avoid a lot of WTFs and wasting time basically reverting this change ;)

Copy link

@atsu85 atsu85 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in commit description?

They cause a creation of new services with each job execution.

vs

They cause a creation of new series with each job execution.

Also worth mentioning cardinality explosion and OOMs

…on times.

Its difficult to calculate any duration is intervals between jobs are not frequent enough. Last duration would give a better overview.
They cause a creation of new series with each job execution leading to OOM kills and degraded performance.
…query memory, cpu and cpu throttling metrics
@tiithansen
Copy link
Author

@Link- any thoughts on this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants