Fix rabbitmq mem alert
Tonight 3:13 CEST we got this alert from staging.s.n, which was resolved 5 mins later:
alertname = ContainerMemoryUsage
container = rabbitmq
prometheus = stackspin/kube-prometheus-stack-prometheus
severity = warning
Annotations
description = Container Memory usage is above 80% VALUE = 82.97882080078125 LABELS = map[container:rabbitmq]
summary = Container Memory usage (rabbitmq)
Rabbitmq is used twice on the cluster, for NC and for Zulip. However, the zulip pod doesn't have any limit set that's why prometheus only shows the NC rabbitmq pos here:
The original alert source is this:
The problem is that the alert divides the sum of all rabbitmq containers (NC+Zulip) through the sum of all rabbitmq container limits (which is only the NC-rabbitmq limit), that's why we get this nasty misleading alert.
So we have two options:
- Set mem requests+limit for zulip rabbitmq pod
- Rename one of the rabbitmq container/pods so that they don't have the same container name.