Newer
Older
Monitoring
==========
For monitoring your Stackspin cluster we included the kube-prometheus-stack_
helm chart, which bundles the applications Grafana_, Prometheus_ and Alertmanager_,
and also includes pre-configured Prometheus alerts and Grafana dashboards.
Grafana
-------
Grafana can be accessed by clicking on the ``Monitoring`` icon in the ``Utilities``
Section of the dashboard. Use Stackspin single sign-on to login.
Browse through the pre-configured dashboards to explore metrics of your
Stackspin cluster. Describing every dashboard would be too much here, reach out
for us if you don't find what you are looking for.
Browse aggregated logs in Grafana
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
See :ref:`logging:Viewing logs in Grafana` how to do this.
Prometheus
----------
Prometheus can be reached by adding ``prometheus.`` in front of your cluster
domain, i.e. ``https://prometheus.stackspin.example.org``. Until we `configure single
sign-on for prometheus`_ you need to login using basic auth.
The user name is ``admin``, the password can get retrieved by running
.. code::
python -m stackspin CLUSTERNAME secrets | grep prometheus-basic-auth
Alertmanager
------------
Alertmanager can be reached by adding ``alertmanager.`` in front of your cluster
domain, i.e. ``https://alertmanager.stackspin.example.org``. Until we `configure single
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
sign-on for prometheus`_ you need to login using basic auth.
The user name is ``admin``, the password can get retrieved by running
.. code::
python -m stackspin CLUSTERNAME secrets | grep alertmanager-basic-auth
Email alerts
------------
From time to time you might get email alerts sent by Alertmanager_ to the email
address you have set in the cluster configuration.
Common alerts include (listed by the ``alertname`` references in the email
body):
* **KubeJobCompletion**: A job did not complete successfully. Often happens
during initial setup phase. If the alert persists use i.e.
``kubectl -n stackspin-apps get jobs`` to see all jobs in the
``stackspin-apps`` namespace and delete the failed job
to silence the alert with i.e.
``kubectl -n stackspin-apps delete job nc-nextcloud-cron-27444460``.
* **ReconciliationFailure**: A `flux helmRelease`_ could not get reconciled
successfully. This also happen often during initial setup phase. Use i.e.
``flux -n stackspin-apps get helmreleases`` to view the current state of
all ``helmReleases`` in the ``stackspin-apps`` namespace.
In case the ``helmRelease`` in question is stuck in a ``install retries exhausted``
or ``upgrade retries exhausted`` state you can force a reconciliation with
.. code::
flux -n stackspin-apps suspend helmrelease zulip
flux -n stackspin-apps resume helmrelease zulip
For more information on this issue see `helmrelease upgrade retries exhausted regression`_
.. _kube-prometheus-stack: https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack
.. _Grafana: https://grafana.com
.. _Prometheus: https://prometheus.io
.. _Alertmanager: https://prometheus.io/docs/alerting/latest/alertmanager
.. _configure single sign-on for prometheus: https://open.greenhost.net/stackspin/stackspin/-/issues/371
.. _flux helmRelease: https://fluxcd.io/docs/guides/helmreleases
.. _helmrelease upgrade retries exhausted regression: https://github.com/fluxcd/flux2/issues/1878