monitoring.rst

Monitoring
==========

For monitoring your Stackspin cluster we included the kube-prometheus-stack_
helm chart, which bundles the applications Grafana_, Prometheus_ and Alertmanager_,
and also includes pre-configured Prometheus alerts and Grafana dashboards.

Grafana
-------

Grafana can be accessed by clicking on the ``Monitoring`` icon in the ``Utilities``
Section of the dashboard. Use Stackspin single sign-on to login.

Dashboards
~~~~~~~~~~

Browse through the pre-configured dashboards to explore metrics of your
Stackspin cluster. Describing every dashboard would be too much here, reach out
for us if you don't find what you are looking for.

Browse aggregated logs in Grafana
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

See :ref:`logging:Viewing logs in Grafana` how to do this.

Prometheus
----------

Prometheus can be reached by adding ``prometheus.`` in front of your cluster
domain, i.e. ``https://prometheus.stackspin.example.org``. Until we `configure single
sign-on for prometheus`_ you need to login using basic auth.
The user name is ``admin``, the password can get retrieved by running

.. code::

   python -m stackspin CLUSTERNAME secrets | grep prometheus-basic-auth

Alertmanager
------------

Alertmanager can be reached by adding ``alertmanager.`` in front of your cluster
domain, i.e. ``https://alertmanager.stackspin.example.org``. Until we `configure single
sign-on for prometheus`_ you need to login using basic auth.
The user name is ``admin``, the password can get retrieved by running

.. code::

   python -m stackspin CLUSTERNAME secrets | grep alertmanager-basic-auth


Email alerts
------------

From time to time you might get email alerts sent by Alertmanager_ to the email
address you have set in the cluster configuration.
Common alerts include (listed by the ``alertname`` references in the email
body):

* **KubeJobCompletion**: A job did not complete successfully. Often happens
  during initial setup phase. If the alert persists use i.e.
  ``kubectl -n stackspin-apps get jobs`` to see all jobs in the
  ``stackspin-apps`` namespace and delete the failed job
  to silence the alert with i.e.
  ``kubectl -n stackspin-apps delete job nc-nextcloud-cron-27444460``.
* **ReconciliationFailure**: A `flux helmRelease`_ could not get reconciled
  successfully. This also happen often during initial setup phase. Use i.e.
  ``flux -n stackspin-apps get helmreleases`` to view the current state of
  all ``helmReleases`` in the ``stackspin-apps`` namespace.
  In case the ``helmRelease`` in question is stuck in a ``install retries exhausted``
  or ``upgrade retries exhausted`` state you can force a reconciliation with

   .. code::

      flux -n stackspin-apps suspend helmrelease zulip
      flux -n stackspin-apps resume helmrelease zulip

  For more information on this issue see `helmrelease upgrade retries exhausted regression`_


.. _kube-prometheus-stack: https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack
.. _Grafana: https://grafana.com
.. _Prometheus: https://prometheus.io
.. _Alertmanager: https://prometheus.io/docs/alerting/latest/alertmanager
.. _configure single sign-on for prometheus: https://open.greenhost.net/stackspin/stackspin/-/issues/371
.. _flux helmRelease: https://fluxcd.io/docs/guides/helmreleases
.. _helmrelease upgrade retries exhausted regression: https://github.com/fluxcd/flux2/issues/1878