Nextcloud cron job failure alert storm on oas.gh

changed milestone to %Backlog

added App::Prometheus UX + 1 deleted label

changed title from Nextcloud cron job failure alert strom on oas.gh to Nextcloud cron job failure alert storm on oas.gh

What would be the behaviour you'd like to see here, exactly?

Maybe remove/disable the KubeJobCompletion alert, and instead trigger an alert when the number of failed jobs is nonzero? It would then resolve automatically only if that number drops to zero again, which requires an operator to delete the failed jobs. In particular, so long as no-one intervenes, the alert would continue to fire, but it would only be a single alert so you wouldn't get an email per failed job.

I would like that, but if the admin doesn't really use prometheus and only reads the email notifications, they might not realise that the alert keeps firing, never delete the failed jobs, and then they wouldn't notice if an unrelated failure later occurs with the cronjob. But maybe that's not really a problem though. I guess it really depends on how the admin uses prometheus.

See also https://stackoverflow.com/questions/47343842/is-there-a-way-to-monitor-kube-cron-jobs-using-prometheus for some ideas on this.

Oh and also, I seem to recall we were planning to maybe replace the nextcloud chart's kubernetes CronJob by a more lightweight solution with a persistent sidecar container. If we do, we probably shouldn't invest in this issue.

changed milestone to %0.8.0

Preferred solution: make alertmanager less verbose about failed cronjobs/jobs that took too long

Shouldn't report an alert from cronjobs completion, right? Since it is not a problem

It is a problem if the cronjob failed though, which was the case here. Or more precisely, the job didn't succeed within the timeout time (12 hours).

For what I see we are monitoring KubeJobFailed KubeJobCompletion and KubePodNotReady What it's interesting is if a Job Failed, not if it didn't complete in time, because then it will automatically try again, right? We want to receive alerts for things we want to have to act upon.

Not completing in 12 hours is a form of failing in this case. Kubernetes will try it again a few times (number is configurable) but not indefinitely, that wouldn't really make sense for a job which is typically somehow timebound by nature.

I agree that we don't need to be notified if a job fails at first attempt but a retry succeeds. However I would hope that the default rules are already written like that, if not then indeed we should change that.

I think we can change the KubeJobCompletion alert by adding something like this to the values.yaml. But I'm not sure what exactly we should change to make sure this alert only fires once per day.

I also copy-pasted this from the helm chart, so we should also figure out the correct values for {{ .Values.defaultRules.runbookUrl }} and {{ $targetNamespace }} (earlier set as {{- $targetNamespace := .Values.defaultRules.appNamespacesTarget }}).

I don't really have an idea what I'm doing here, so I was hoping @varac would like to take a look when he's back.

    additionalPrometheusRulesMap:
      override-KubeJobCompletion:
        groups:
        - name: my_group
          rules:
          - alert: KubeJobCompletion
            annotations:
              description: Job {{`{{`}} $labels.namespace {{`}}`}}/{{`{{`}} $labels.job_name {{`}}`}} is taking more than 12 hours to complete.
              runbook_url: {{ .Values.defaultRules.runbookUrl }}alert-name-kubejobcompletion
              summary: Job did not complete in time
            expr: kube_job_spec_completions{job="kube-state-metrics", namespace=~"{{ $targetNamespace }}"} - kube_job_status_succeeded{job="kube-state-metrics", namespace=~"{{ $targetNamespace }}"}  > 0
            for: 12h
            labels:
              severity: warning

I'll take a look soon!

added Dogfood label

changed milestone to %Future

Our CI pipeline fails once in a while with Job oas-apps/nc-nextcloud-cron-27232160 is taking more than 12 hours to complete., see https://open.greenhost.net/openappstack/openappstack/-/jobs/127965

which is super weird, because the alert fires less than an hour after starting the machine for the first time

removed 1 deleted label

marked this issue as related to #1029 (closed)

marked this issue as related to #968 (closed)

marked this issue as related to stackspin#1228 (closed)

stackspin#1228 (closed) improved the situation, and we haven't had many cronjob alerts neither - canwe close this ?

Well we haven't recently had an alert storm like before, but we did have an alert because of failing nextcloud cronjob on stackspin.net just now, so it's not fully solved either.

Nextcloud cron job failure alert storm on oas.gh

Child items ...

Activity

Admin message

Nextcloud cron job failure alert storm on oas.gh

Activity