What would be the behaviour you'd like to see here, exactly?
Maybe remove/disable the KubeJobCompletion alert, and instead trigger an alert when the number of failed jobs is nonzero? It would then resolve automatically only if that number drops to zero again, which requires an operator to delete the failed jobs. In particular, so long as no-one intervenes, the alert would continue to fire, but it would only be a single alert so you wouldn't get an email per failed job.
I would like that, but if the admin doesn't really use prometheus and only reads the email notifications, they might not realise that the alert keeps firing, never delete the failed jobs, and then they wouldn't notice if an unrelated failure later occurs with the cronjob. But maybe that's not really a problem though. I guess it really depends on how the admin uses prometheus.
Oh and also, I seem to recall we were planning to maybe replace the nextcloud chart's kubernetes CronJob by a more lightweight solution with a persistent sidecar container. If we do, we probably shouldn't invest in this issue.
For what I see we are monitoring KubeJobFailedKubeJobCompletion and KubePodNotReady
What it's interesting is if a Job Failed, not if it didn't complete in time, because then it will automatically try again, right? We want to receive alerts for things we want to have to act upon.
Not completing in 12 hours is a form of failing in this case. Kubernetes will try it again a few times (number is configurable) but not indefinitely, that wouldn't really make sense for a job which is typically somehow timebound by nature.
I agree that we don't need to be notified if a job fails at first attempt but a retry succeeds. However I would hope that the default rules are already written like that, if not then indeed we should change that.
I think we can change the KubeJobCompletion alert by adding something like this to the values.yaml. But I'm not sure what exactly we should change to make sure this alert only fires once per day.
I also copy-pasted this from the helm chart, so we should also figure out the correct values for {{ .Values.defaultRules.runbookUrl }} and {{ $targetNamespace }} (earlier set as {{- $targetNamespace := .Values.defaultRules.appNamespacesTarget }}).
I don't really have an idea what I'm doing here, so I was hoping @varac would like to take a look when he's back.
additionalPrometheusRulesMap: override-KubeJobCompletion: groups: - name: my_group rules: - alert: KubeJobCompletion annotations: description: Job {{`{{`}} $labels.namespace {{`}}`}}/{{`{{`}} $labels.job_name {{`}}`}} is taking more than 12 hours to complete. runbook_url: {{ .Values.defaultRules.runbookUrl }}alert-name-kubejobcompletion summary: Job did not complete in time expr: kube_job_spec_completions{job="kube-state-metrics", namespace=~"{{ $targetNamespace }}"} - kube_job_status_succeeded{job="kube-state-metrics", namespace=~"{{ $targetNamespace }}"} > 0 for: 12h labels: severity: warning
Well we haven't recently had an alert storm like before, but we did have an alert because of failing nextcloud cronjob on stackspin.net just now, so it's not fully solved either.