-
Notifications
You must be signed in to change notification settings - Fork 520
Enhance Etcd Compaction Job Monitoring with Detailed Failure Reasons #11771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance Etcd Compaction Job Monitoring with Detailed Failure Reasons #11771
Conversation
/retest |
/hold till etcd-druid v0.29.0 is released and integrated into g/g |
@gardener/monitoring-maintainers can you guys pls take a look at this PR |
…ilureReason label
…reason for job failure
1354b4a
to
2b43856
Compare
/test pull-gardener-e2e-kind-ha-multi-zone-upgrade |
/lgtm The PR and changes to the PromQL queries look good to me. Unfortunately, I was in my tests not able to make a compaction job failing so that it would set the |
LGTM label has been added. Git tree hash: 44fbff2ba6f68623e6a83a7cf05f72eb867bde34
|
The presence of the |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ary1992 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
/retest |
How to categorize this PR?
/area control-plane
/area monitoring
/kind enhancement
What this PR does / why we need it:
This PR adds two new plutono dashboard panes for the Etcd Compaction Job monitoring. Those are titled
Deadline Exceeded Jobs
&Disrupted Jobs
. These two, as the name suggests, tells us in more detail about the reason for the compaction job failure.The druid PR#1039 introduced a new prometheus label alongside the existing
succeeded
label, namedfailureReason
which can take values such as :preempted
indicates that the compaction pod has been preempted by the scheduler.evicted
indicates that the compaction Pod has been evicted due to various eviction reasons outlined in Project controller is reconciling on gcm update #1037deadlineExceeded
indicates that compaction could not finish before theactiveDeadlineSeconds
of the job.processFailure
indicates that the compaction process has failed.unknown
indicates that the failure reason is not known.none
this is used when combined with the {label, value} pairsucceeded
:true
.This PR also consequently adapts the existing
Failed Jobs
board & the alertTooManyEtcdSnapshotCompactionJobsFailing
that we have set for alerting when too many control-plane namespaces have compaction jobs failing.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Earlier, the Job was considered
Failed
even when it was due to pod getting disrupted or deadline exceeding, and was queried using the label:value pair{succeeded="false"}
. But now that we have introduced the extra label to identify the reason for failure, we now consider the Job asFailed
only when it is because of process failure or un-identified reason. And for that reason, we now use{succeeded="false", failureReason=~"processFailure|unknown"}
label:value pair to query the failed jobs. This is reflected in both theFailed Jobs
board &TooManyEtcdSnapshotCompactionJobsFailing
alert.Release note: