Frequent Kustomization ReconciliationFailures (failed to call webhook) after 1.0 release
After upgrading staging and varac-test to 1.0 (and with it flux 0.31.3
-> 0.33.0
and k3s v1.23.3+k3s1
-> 1.24.4+k3s1
) e get frequent (as in 3-4 per day) Kustomization ReconciliationFailures which fire out of the blue without any update to Stackspin, and which relsolve by themselves after ~1h.
I.e. like this
alertname = ReconciliationFailure
kind = Kustomization
name = monitoring-config
We get those from all clusters which have been updated to %1.0, and for these Kustomizations:
- monitoring-config
- kube-system-config
- letsencrypt-issuer
These are the flux-system non-Normal events from staging i.e.:
❯ kubectl -n flux-system get events --no-headers=true -o custom-columns=FirstSeen:.firstTimestamp,LastSeen:.lastTimestamp,Count:.count,Component:.source.component,Object:.involvedObject.name,Type:.type,Reason:.reason,Message:.message --field-selector type!=Normal
2022-09-11T12:17:15Z 2022-09-11T12:17:15Z 1 kustomize-controller monitoring-config Warning ReconciliationFailed PrometheusRule/kube-system/metallb dry-run failed, reason: InternalError, error: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": failed to call webhook: Post "https://kube-prometheus-stack-operator.stackspin.svc:443/admission-prometheusrules/mutate?timeout=10s": EOF
2022-09-11T13:15:51Z 2022-09-12T20:15:55Z 2 kustomize-controller kube-system-config Warning ReconciliationFailed IPAddressPool/kube-system/external-ip dry-run failed, reason: InternalError, error: Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://metallb-webhook-service.kube-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": EOF
But there are no related non-Normal events in the Stackspin namespace (when we want to investigate the first monitoring-config
issue above:
❯ kubectl -n stackspin get events --no-headers=true -o custom-columns=FirstSeen:.firstTimestamp,LastSeen:.lastTimestamp,Count:.count,Component:.source.component,Object:.involvedObject.name,Type:.type,Reason:.reason,Message:.message --field-selector type!=Normal
2022-09-12T04:00:54Z 2022-09-12T04:01:04Z 2 kubelet loki-0 Warning Unhealthy Liveness probe failed: HTTP probe failed with statuscode: 503
2022-09-12T04:00:54Z 2022-09-12T04:01:04Z 2 kubelet loki-0 Warning Unhealthy Readiness probe failed: HTTP probe failed with statuscode: 503
2022-09-13T04:00:55Z 2022-09-13T04:01:05Z 2 kubelet loki-0 Warning Unhealthy Liveness probe failed: HTTP probe failed with statuscode: 503
2022-09-13T04:00:55Z 2022-09-13T04:01:09Z 3 kubelet loki-0 Warning Unhealthy Readiness probe failed: HTTP probe failed with statuscode: 503
2022-09-11T04:00:17Z 2022-09-13T04:01:37Z 95 kubelet promtail-s7d8n Warning Unhealthy Readiness probe failed: Get "http://10.42.0.198:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)