Fine-tune container memory limits

Todo

Look at specific issues linked below

Context

We should tune the existing container limits. Some wekly mem usage averages are pretty close the the limit (i.e. loki, promtail, see #1013 (closed)), some are pretty low.

From #1013 (comment 31292):

Varac: Maybe it's a good idea to agree a % threshold (mem usage/mem limts, like shown in this graph) which we want all apps to limbo under (60% ? 70% ?)

@maarten: I'd say:
Request: the median memory usage of an app (if it uses 100M 90% of the time, that's the request)
Limit: 150% of the max memory usage we've seen in a few days?

Define shell alias:

❯ alias prom_query='kubectl -n oas exec -it -c prometheus statefulset/prometheus-kube-prometheus-stack-prometheus -- promtool query instant http://localhost:9090'

One-week average of mem usage / mem limit (%)

Find which containers mem usage are close to their limits:

❯ prom_query 'sort_desc(sum(sum_over_time(container_memory_working_set_bytes[1w])) BY (instance, container) / sum(sum_over_time(container_spec_memory_limit_bytes[1w]) > 0) BY (instance, container) * 100)'

{container="loki", instance="213.108.108.57:10250"} => 75.596526202572 @[1636473020.353]
{container="promtail", instance="213.108.108.57:10250"} => 71.10407554109295 @[1636473020.353]
{container="rocketchat", instance="213.108.108.57:10250"} => 70.33107862832951 @[1636473020.353]
...

Current mem usage / mem limit (%)

Not useful for finding the weekly average, but for prometheus ad-hoc graphs:

❯ prom_query 'sort_desc(sum(container_memory_working_set_bytes) BY (instance, container) / sum(container_spec_memory_limit_bytes > 0) BY (instance, container) * 100)'

{container="promtail", instance="213.108.108.57:10250"} => 84.61151123046875 @[1636473824.531]
{container="rocketchat", instance="213.108.108.57:10250"} => 71.17919921875 @[1636473824.531]
{container="loki", instance="213.108.108.57:10250"} => 62.26374308268229 @[1636473824.531]

One-week average of absolute mem usage (mb)

❯ prom_query 'sort_desc(avg_over_time(container_memory_working_set_bytes{container!=""}[1w]) /1024/1024)' 

{container="nextcloud-onlyoffice", endpoint="https-metrics", id="/kubepods/burstable/podf964344a-511f-4381-bd51-ee2788d97ac9/22849f1150210a256b6e699fc27da64d22858c8b5bf36995c4168654dd9b2be7", image="sha256:731f9669f88e8d8887d4431426c473e05037a7c797162916e98422545ce0a620", instance="213.108.105.236:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="22849f1150210a256b6e699fc27da64d22858c8b5bf36995c4168654dd9b2be7", namespace="oas-apps", node="staging.stackspin.net", pod="nc-onlyoffice-documentserver-66458d76dd-dc4nf", service="kube-prometheus-stack-kubelet"} => 937.1429582868284 @[1636469104.367]
{container="prometheus", endpoint="https-metrics", id="/kubepods/burstable/podf39fffbb-79df-46d7-81fd-05ede5c8811a/d811ac67146fbf8751a1b7dd2e7de13a2991ebbfdc3b7e6dc3e7c85c61bd1710", image="quay.io/prometheus/prometheus@sha256:5c030438c1e4c86bdc7428f24ee1ad18476eefdfa8a7f76a8ccc9b74f1970d81", instance="213.108.105.236:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="d811ac67146fbf8751a1b7dd2e7de13a2991ebbfdc3b7e6dc3e7c85c61bd1710", namespace="oas", node="staging.stackspin.net", pod="prometheus-kube-prometheus-stack-prometheus-0", service="kube-prometheus-stack-kubelet"} => 653.7031982421885 @[1636469104.367]
...

Current absolute mem usage (mb)