Skip to content

Enhance Etcd Compaction Job Monitoring with Detailed Failure Reasons #11771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 15, 2025

Conversation

anveshreddy18
Copy link
Contributor

How to categorize this PR?

/area control-plane
/area monitoring
/kind enhancement

What this PR does / why we need it:

This PR adds two new plutono dashboard panes for the Etcd Compaction Job monitoring. Those are titled Deadline Exceeded Jobs & Disrupted Jobs. These two, as the name suggests, tells us in more detail about the reason for the compaction job failure.

The druid PR#1039 introduced a new prometheus label alongside the existing succeeded label, named failureReason which can take values such as :

  • preempted indicates that the compaction pod has been preempted by the scheduler.
  • evicted indicates that the compaction Pod has been evicted due to various eviction reasons outlined in Project controller is reconciling on gcm update #1037
  • deadlineExceeded indicates that compaction could not finish before the activeDeadlineSeconds of the job.
  • processFailure indicates that the compaction process has failed.
  • unknown indicates that the failure reason is not known.
  • none this is used when combined with the {label, value} pair succeeded:true.

This PR also consequently adapts the existing Failed Jobs board & the alert TooManyEtcdSnapshotCompactionJobsFailing that we have set for alerting when too many control-plane namespaces have compaction jobs failing.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Earlier, the Job was considered Failed even when it was due to pod getting disrupted or deadline exceeding, and was queried using the label:value pair {succeeded="false"}. But now that we have introduced the extra label to identify the reason for failure, we now consider the Job as Failed only when it is because of process failure or un-identified reason. And for that reason, we now use {succeeded="false", failureReason=~"processFailure|unknown"} label:value pair to query the failed jobs. This is reflected in both the Failed Jobs board & TooManyEtcdSnapshotCompactionJobsFailing alert.

Release note:

Add new monitoring dashboard panes for Etcd Compaction Job with detailed failure reasons and updated existing alerts and boards.

@gardener-prow gardener-prow bot added area/control-plane Control plane related area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. labels Mar 28, 2025
@gardener-prow gardener-prow bot requested review from ary1992 and ialidzhikov March 28, 2025 18:35
@gardener-prow gardener-prow bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 28, 2025
@anveshreddy18
Copy link
Contributor Author

/retest

@anveshreddy18
Copy link
Contributor Author

/hold till etcd-druid v0.29.0 is released and integrated into g/g

@gardener-prow gardener-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 1, 2025
@anveshreddy18
Copy link
Contributor Author

@gardener/monitoring-maintainers can you guys pls take a look at this PR

…reason for job failure
@anveshreddy18 anveshreddy18 force-pushed the adapt-compaction-metrics branch from 1354b4a to 2b43856 Compare April 10, 2025 08:37
@anveshreddy18
Copy link
Contributor Author

/test pull-gardener-e2e-kind-ha-multi-zone-upgrade

@chrkl
Copy link
Member

chrkl commented Apr 11, 2025

/lgtm

The PR and changes to the PromQL queries look good to me. Unfortunately, I was in my tests not able to make a compaction job failing so that it would set the failureReason label. Please merge if you are confident that etcd-druid exposes the metric with the expected label correctly.

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Apr 11, 2025
Copy link
Contributor

gardener-prow bot commented Apr 11, 2025

LGTM label has been added.

Git tree hash: 44fbff2ba6f68623e6a83a7cf05f72eb867bde34

@chrkl
Copy link
Member

chrkl commented Apr 15, 2025

The PR and changes to the PromQL queries look good to me. Unfortunately, I was in my tests not able to make a compaction job failing so that it would set the failureReason label. Please merge if you are confident that etcd-druid exposes the metric with the expected label correctly.

The presence of the failureReason label was successfully tested with @anveshreddy18. The PR is good to merge for me.

Copy link
Contributor

gardener-prow bot commented Apr 15, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ary1992

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 15, 2025
@anveshreddy18
Copy link
Contributor Author

/unhold

@gardener-prow gardener-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2025
@anveshreddy18
Copy link
Contributor Author

/retest

@gardener-prow gardener-prow bot merged commit 20dbe2b into gardener:master Apr 15, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane Control plane related area/monitoring Monitoring (including availability monitoring and alerting) related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants