Fine-tune container memory limits
Todo
-
Look at specific issues linked below
Context
We should tune the existing container limits. Some wekly mem usage averages are pretty close the the limit (i.e. loki, promtail, see #1013 (closed)), some are pretty low.
From #1013 (comment 31292):
Varac: Maybe it's a good idea to agree a % threshold (mem usage/mem limts, like shown in this graph) which we want all apps to limbo under (60% ? 70% ?)
@maarten: I'd say:
Request: the median memory usage of an app (if it uses 100M 90% of the time, that's the request) Limit: 150% of the max memory usage we've seen in a few days?
Define shell alias:
❯ alias prom_query='kubectl -n oas exec -it -c prometheus statefulset/prometheus-kube-prometheus-stack-prometheus -- promtool query instant http://localhost:9090'
One-week average of mem usage / mem limit (%)
Find which containers mem usage are close to their limits:
❯ prom_query 'sort_desc(sum(sum_over_time(container_memory_working_set_bytes[1w])) BY (instance, container) / sum(sum_over_time(container_spec_memory_limit_bytes[1w]) > 0) BY (instance, container) * 100)'
{container="loki", instance="213.108.108.57:10250"} => 75.596526202572 @[1636473020.353]
{container="promtail", instance="213.108.108.57:10250"} => 71.10407554109295 @[1636473020.353]
{container="rocketchat", instance="213.108.108.57:10250"} => 70.33107862832951 @[1636473020.353]
...
Current mem usage / mem limit (%)
Not useful for finding the weekly average, but for prometheus ad-hoc graphs:
❯ prom_query 'sort_desc(sum(container_memory_working_set_bytes) BY (instance, container) / sum(container_spec_memory_limit_bytes > 0) BY (instance, container) * 100)'
{container="promtail", instance="213.108.108.57:10250"} => 84.61151123046875 @[1636473824.531]
{container="rocketchat", instance="213.108.108.57:10250"} => 71.17919921875 @[1636473824.531]
{container="loki", instance="213.108.108.57:10250"} => 62.26374308268229 @[1636473824.531]
One-week average of absolute mem usage (mb)
❯ prom_query 'sort_desc(avg_over_time(container_memory_working_set_bytes{container!=""}[1w]) /1024/1024)'
{container="nextcloud-onlyoffice", endpoint="https-metrics", id="/kubepods/burstable/podf964344a-511f-4381-bd51-ee2788d97ac9/22849f1150210a256b6e699fc27da64d22858c8b5bf36995c4168654dd9b2be7", image="sha256:731f9669f88e8d8887d4431426c473e05037a7c797162916e98422545ce0a620", instance="213.108.105.236:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="22849f1150210a256b6e699fc27da64d22858c8b5bf36995c4168654dd9b2be7", namespace="oas-apps", node="staging.stackspin.net", pod="nc-onlyoffice-documentserver-66458d76dd-dc4nf", service="kube-prometheus-stack-kubelet"} => 937.1429582868284 @[1636469104.367]
{container="prometheus", endpoint="https-metrics", id="/kubepods/burstable/podf39fffbb-79df-46d7-81fd-05ede5c8811a/d811ac67146fbf8751a1b7dd2e7de13a2991ebbfdc3b7e6dc3e7c85c61bd1710", image="quay.io/prometheus/prometheus@sha256:5c030438c1e4c86bdc7428f24ee1ad18476eefdfa8a7f76a8ccc9b74f1970d81", instance="213.108.105.236:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="d811ac67146fbf8751a1b7dd2e7de13a2991ebbfdc3b7e6dc3e7c85c61bd1710", namespace="oas", node="staging.stackspin.net", pod="prometheus-kube-prometheus-stack-prometheus-0", service="kube-prometheus-stack-kubelet"} => 653.7031982421885 @[1636469104.367]
...
Current absolute mem usage (mb)
Not useful for finding the weekly average, but for prometheus ad-hoc graphs:
❯ prom_query 'sort_desc(container_memory_working_set_bytes{container!=""} /1024/1024)'
{container="loki", endpoint="https-metrics", id="/kubepods/burstable/podec2a7743-d14d-4a62-b1d7-1b2d82789334/ccdf206c2f89c59fef14b842d0a9e4765fbe99e8266420677a4c4a8eec8997e6", image="docker.io/grafana/loki:2.3.0", instance="213.108.108.57:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="ccdf206c2f89c59fef14b842d0a9e4765fbe99e8266420677a4c4a8eec8997e6", namespace="oas", node="oas.greenhost.net", pod="loki-0", service="kube-prometheus-stack-kubelet"} => 956.37109375 @[1636473738.91]
{container="prometheus", endpoint="https-metrics", id="/kubepods/burstable/podbe60f291-9f6e-46e5-a4d0-59ec656a62d4/a569bbbfee73e1590ff27ae3dec180f6751e5dd251cc7c5e1c0d05a34739cea2", image="quay.io/prometheus/prometheus@sha256:5c030438c1e4c86bdc7428f24ee1ad18476eefdfa8a7f76a8ccc9b74f1970d81", instance="213.108.108.57:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="a569bbbfee73e1590ff27ae3dec180f6751e5dd251cc7c5e1c0d05a34739cea2", namespace="oas", node="oas.greenhost.net", pod="prometheus-kube-prometheus-stack-prometheus-0", service="kube-prometheus-stack-kubelet"} => 847.05078125 @[1636473738.91]
...