Enhance Etcd Compaction Job Monitoring with Detailed Failure Reasons #11771

anveshreddy18 · 2025-03-28T18:35:20Z

How to categorize this PR?

/area control-plane
/area monitoring
/kind enhancement

What this PR does / why we need it:

This PR adds two new plutono dashboard panes for the Etcd Compaction Job monitoring. Those are titled Deadline Exceeded Jobs & Disrupted Jobs. These two, as the name suggests, tells us in more detail about the reason for the compaction job failure.

The druid PR#1039 introduced a new prometheus label alongside the existing succeeded label, named failureReason which can take values such as :

preempted indicates that the compaction pod has been preempted by the scheduler.
evicted indicates that the compaction Pod has been evicted due to various eviction reasons outlined in Project controller is reconciling on gcm update #1037
deadlineExceeded indicates that compaction could not finish before the activeDeadlineSeconds of the job.
processFailure indicates that the compaction process has failed.
unknown indicates that the failure reason is not known.
none this is used when combined with the {label, value} pair succeeded:true.

This PR also consequently adapts the existing Failed Jobs board & the alert TooManyEtcdSnapshotCompactionJobsFailing that we have set for alerting when too many control-plane namespaces have compaction jobs failing.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Earlier, the Job was considered Failed even when it was due to pod getting disrupted or deadline exceeding, and was queried using the label:value pair {succeeded="false"}. But now that we have introduced the extra label to identify the reason for failure, we now consider the Job as Failed only when it is because of process failure or un-identified reason. And for that reason, we now use {succeeded="false", failureReason=~"processFailure|unknown"} label:value pair to query the failed jobs. This is reflected in both the Failed Jobs board & TooManyEtcdSnapshotCompactionJobsFailing alert.

Release note:

Add new monitoring dashboard panes for Etcd Compaction Job with detailed failure reasons and updated existing alerts and boards.

anveshreddy18 · 2025-04-01T05:17:01Z

/retest

anveshreddy18 · 2025-04-01T06:32:30Z

/hold till etcd-druid v0.29.0 is released and integrated into g/g

anveshreddy18 · 2025-04-07T07:56:44Z

@gardener/monitoring-maintainers can you guys pls take a look at this PR

…ilureReason label

…reason for job failure

anveshreddy18 · 2025-04-10T09:18:08Z

/test pull-gardener-e2e-kind-ha-multi-zone-upgrade

chrkl · 2025-04-11T10:44:24Z

/lgtm

The PR and changes to the PromQL queries look good to me. Unfortunately, I was in my tests not able to make a compaction job failing so that it would set the failureReason label. Please merge if you are confident that etcd-druid exposes the metric with the expected label correctly.

gardener-prow · 2025-04-11T10:44:30Z

LGTM label has been added.

Git tree hash: 44fbff2ba6f68623e6a83a7cf05f72eb867bde34

chrkl · 2025-04-15T07:53:28Z

The PR and changes to the PromQL queries look good to me. Unfortunately, I was in my tests not able to make a compaction job failing so that it would set the failureReason label. Please merge if you are confident that etcd-druid exposes the metric with the expected label correctly.

The presence of the failureReason label was successfully tested with @anveshreddy18. The PR is good to merge for me.

gardener-prow · 2025-04-15T08:39:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ary1992

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ary1992]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

anveshreddy18 · 2025-04-15T08:41:37Z

/unhold

anveshreddy18 · 2025-04-15T10:27:07Z

/retest

gardener-prow bot added area/control-plane area/monitoring kind/enhancement cla: yes labels Mar 28, 2025

gardener-prow bot requested review from ary1992 and ialidzhikov March 28, 2025 18:35

gardener-prow bot added the size/L label Mar 28, 2025

gardener-prow bot added the do-not-merge/hold label Apr 1, 2025

anveshreddy18 added 2 commits April 10, 2025 14:05

edit etcd snapshot compaction job failure alert to include the new fa…

ab4de73

…ilureReason label

add two new plutono panes to Etcd compaction job dashboards with the …

Loading
Loading status checks…

2b43856

…reason for job failure

anveshreddy18 force-pushed the adapt-compaction-metrics branch from 1354b4a to 2b43856 Compare April 10, 2025 08:37

gardener-prow bot assigned chrkl Apr 11, 2025

gardener-prow bot added the lgtm label Apr 11, 2025

ary1992 approved these changes Apr 15, 2025

View reviewed changes

gardener-prow bot assigned ary1992 Apr 15, 2025

gardener-prow bot added the approved label Apr 15, 2025

gardener-prow bot removed the do-not-merge/hold label Apr 15, 2025

gardener-prow bot merged commit 20dbe2b into gardener:master Apr 15, 2025
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhance Etcd Compaction Job Monitoring with Detailed Failure Reasons #11771

Enhance Etcd Compaction Job Monitoring with Detailed Failure Reasons #11771

anveshreddy18 commented Mar 28, 2025

Uh oh!

anveshreddy18 commented Apr 1, 2025

Uh oh!

anveshreddy18 commented Apr 1, 2025

Uh oh!

anveshreddy18 commented Apr 7, 2025

Uh oh!

anveshreddy18 commented Apr 10, 2025

Uh oh!

chrkl commented Apr 11, 2025

Uh oh!

gardener-prow bot commented Apr 11, 2025

Uh oh!

chrkl commented Apr 15, 2025

Uh oh!

gardener-prow bot commented Apr 15, 2025

Uh oh!

anveshreddy18 commented Apr 15, 2025

Uh oh!

anveshreddy18 commented Apr 15, 2025

Uh oh!

Uh oh!

Enhance Etcd Compaction Job Monitoring with Detailed Failure Reasons #11771

Enhance Etcd Compaction Job Monitoring with Detailed Failure Reasons #11771

Conversation

anveshreddy18 commented Mar 28, 2025

Uh oh!

anveshreddy18 commented Apr 1, 2025

Uh oh!

anveshreddy18 commented Apr 1, 2025

Uh oh!

anveshreddy18 commented Apr 7, 2025

Uh oh!

Uh oh!

anveshreddy18 commented Apr 10, 2025

Uh oh!

chrkl commented Apr 11, 2025

Uh oh!

gardener-prow bot commented Apr 11, 2025

Uh oh!

chrkl commented Apr 15, 2025

Uh oh!

gardener-prow bot commented Apr 15, 2025

Uh oh!

anveshreddy18 commented Apr 15, 2025

Uh oh!

anveshreddy18 commented Apr 15, 2025

Uh oh!

Uh oh!