Merge branch '1169-document-alertmanager-emails-and-what-they-mean' into 'main'

Resolve "Document alertmanager emails and what they mean" Closes #1169 See merge request stackspin/stackspin!864

Merge branch '1169-document-alertmanager-emails-and-what-they-mean' into 'main'
Resolve "Document alertmanager emails and what they mean" Closes #1169 See merge request stackspin/stackspin!864
582a30ee · Arie Peterson · 6e72bcb4 · 93149e5f · 582a30ee · 582a30ee
Commit 582a30ee authored 3 years ago by Arie Peterson
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -38,6 +38,7 @@ For more information, go to `the Stackspin website`_.
   :caption: System Administration

   logging
+   monitoring
   maintenance
   upgrading
   customizing

--- a/docs/installation/install_stackspin.rst
+++ b/docs/installation/install_stackspin.rst
@@ -41,8 +41,11 @@ Outgoing email

 Stackspin uses SMTP to send emails. This is essential for finishing account
 setups with password recovery links. Additionally, apps like Nextcloud, Zulip
-and Alertmanager will be able to send email notifications from the email address
+and Wordpress  will be able to send email notifications from the email address
 configured here.
+You also may receive alert notification emails from Stackspin's
+monitoring system. See :ref:`monitoring:Email alerts` for more information about
+those alerts, especially during installation.

 Because Stackspin does not include an email server, you need to search your
 (external) email provider's helpdesk for SMTP configuration details.

--- a/docs/monitoring.rst
+++ b/docs/monitoring.rst
+Monitoring
+==========
+
+For monitoring your Stackspin cluster we included the kube-prometheus-stack_
+helm chart, which bundles the applications Grafana_, Prometheus_ and Alertmanager_,
+and also includes pre-configured Prometheus alerts and Grafana dashboards.
+
+Grafana
+-------
+
+Grafana can be accessed by clicking on the ``Monitoring`` icon in the ``Utilities``
+Section of the dashboard. Use Stackspin single sign-on to login.
+
+Dashboards
+~~~~~~~~~~
+
+Browse through the pre-configured dashboards to explore metrics of your
+Stackspin cluster. Describing every dashboard would be too much here, reach out
+for us if you don't find what you are looking for.
+
+Browse aggregated logs in Grafana
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+See :ref:`logging:Viewing logs in Grafana` how to do this.
+
+Prometheus
+----------
+
+Prometheus can be reached by adding ``prometheus.`` in front of your cluster
+domain, i.e. ``https://prometheus.stackspin.example.org``. Until we `configure single
+sign-on for prometheus`_ you need to login using basic auth.
+The user name is ``admin``, the password can get retrieved by running
+
+.. code::
+
+   python -m stackspin CLUSTERNAME secrets | grep prometheus-basic-auth
+
+Alertmanager
+------------
+
+Alertmanager can be reached by adding ``alertmanager.`` in front of your cluster
+domain, i.e. ``https://alertmanager.stackspin.example.org``. Until we `configure single
+sign-on for prometheus`_ you need to login using basic auth.
+The user name is ``admin``, the password can get retrieved by running
+
+.. code::
+
+   python -m stackspin CLUSTERNAME secrets | grep alertmanager-basic-auth
+
+
+Email alerts
+------------
+
+From time to time you might get email alerts sent by Alertmanager_ to the email
+address you have set in the cluster configuration.
+Common alerts include (listed by the ``alertname`` references in the email
+body):
+
+* **KubeJobCompletion**: A job did not complete successfully. Often happens
+  during initial setup phase. If the alert persists use i.e.
+  ``kubectl -n stackspin-apps get jobs`` to see all jobs in the
+  ``stackspin-apps`` namespace and delete the failed job
+  to silence the alert with i.e.
+  ``kubectl -n stackspin-apps delete job nc-nextcloud-cron-27444460``.
+
+* **ReconciliationFailure**: A `flux helmRelease`_ could not get reconciled
+  successfully. This also happen often during initial setup phase. It can have
+  different root causes though. Use
+  ``flux -n stackspin-apps get helmreleases`` to view the current state of
+  all ``helmReleases`` in the ``stackspin-apps`` namespace.
+  In case the ``helmRelease`` in question is stuck in a ``install retries exhausted``
+  or ``upgrade retries exhausted`` state you can force a reconciliation with
+
+   .. code::
+
+      flux -n stackspin-apps suspend helmrelease zulip
+      flux -n stackspin-apps resume helmrelease zulip
+
+  Depending on the underlying cause this will fix the ``helmRelease`` state or
+  not.
+  For more information on this issue see `helmrelease upgrade retries exhausted regression`_
+
+
+
+.. _kube-prometheus-stack: https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack
+.. _Grafana: https://grafana.com
+.. _Prometheus: https://prometheus.io
+.. _Alertmanager: https://prometheus.io/docs/alerting/latest/alertmanager
+.. _configure single sign-on for prometheus: https://open.greenhost.net/stackspin/stackspin/-/issues/371
+.. _flux helmRelease: https://fluxcd.io/docs/guides/helmreleases
+.. _helmrelease upgrade retries exhausted regression: https://github.com/fluxcd/flux2/issues/1878