maintenance.rst

Maintenance
===========

Logging
-------

Logs from pods and containers can be read in different ways:

-  In the cluster filesystem at ``/var/log/pods/`` or
   ``/var/logs/containers/``.
-  Using `kubectl logs`_
-  Querying aggregated logs with Grafana, see below.

Central log aggregation
-----------------------

We use `Promtail`_, `Loki`_ and `Grafana`_ for easy access of aggregated
logs. The `Loki documentation`_ is a good starting point how this setup
works, and the `Using Loki in Grafana`_ gets you started with querying
your cluster logs with Grafana.

You will find the Loki Grafana integration on your cluster at
https://grafana.oas.example.org/explore together with some generic query
examples.

LogQL query examples
~~~~~~~~~~~~~~~~~~~~

Please also refer to the `LogQL documentation`_.

Query all aggregated logs (unfortunatly we can’t find a better way of
doing this since LogQL always expects a stream label to get queried):

.. code:: bash

   logcli query '{foo!="bar"}'

Query all logs for a keyword:

.. code:: bash

   logcli query '{foo!="bar"} |= "error"'

Query all k8s apps for errors using a regular expression:

.. code:: bash

   logcli query '{job=~".*"} |~ "error|fail|exception|fatal"'

Flux
^^^^

`Flux`_ is responsible for installing applications. It uses four
controllers:

-  ``source-controller`` that tracks Helm and Git repositories like
   https://open.greenhost.net/openappstack/openappstack for updates.
-  ``kustomize-controller`` to deploy ``kustomizations`` that often
   install ``helmreleases``.
-  ``helm-controller`` to deploy the ``helmreleases``.
-  ``notification-controller`` that is responsible for inbound and
   outbound flux messages

Query all messages from the ``source-controller``:

.. code:: bash

   {app="source-controller"}

Query all messages from ``flux`` and ``helm-controller``:

.. code:: bash

   {app=~"(source-controller|helm-controller)"}

``helm-controller`` messages containing ``wordpress``:

.. code:: bash

   {app = "helm-controller"} |= "wordpress"

``helm-controller`` messages containing ``wordpress`` without
``unchanged`` events (to only show the installation messages):

.. code:: bash

   {app = "helm-controller"} |= "wordpress" != "unchanged"

Filter out redundant ``helm-controller`` messages:

.. code:: bash

   { app = "helm-controller" } !~ "(unchanged | event=refreshed | method=Sync | component=checkpoint)"

Debug oauth2 single sign-on with rocketchat:

.. code:: bash

   {container_name=~"(hydra|rocketchat)"}

Query kubernetes events processed by the ``eventrouter`` app containing
``warning``:

.. code:: bash

   logcli query '{app="eventrouter"} |~ "warning"'

Cert-manager
^^^^^^^^^^^^

Cert manager is responsible for requesting Let’s Encrypt TLS
certificates.

Query ``cert-manager`` messages containing ``chat``:

.. code:: bash

   {app="cert-manager"} |= "chat"

Hydra
^^^^^

Hydra is the single sign-on system.

Show only warnings and errors from ``hydra``:

.. code:: bash

   {container_name="hydra"} != "level=info"

Backup
------

On your provisioning machine
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

During the installation process, a cluster config directory is created
on your provisioning machine, located in the top-level sub-directory
``clusters`` in your clone of the openappstack git repository. Although
these files are not essential for your OpenAppStack cluster to continue
functioning, you may want to back this folder up because it allows easy
access to your cluster.

On your cluster
~~~~~~~~~~~~~~~

OpenAppStack supports using the program Velero to make backups of your
OpenAppStack instance to external storage via the S3 API. See
:ref:`backups-with-velero` in the installation instructions for setup details.
By default this will make nightly backups of the entire cluster (minus
Prometheus data). To make a manual backup, run

.. code:: bash

   cluster$ velero create backup BACKUP_NAME --exclude-namespaces velero --wait

from your VPS. See ``velero --help`` for other commands, and `Velero’s
documentation`_ for more information.

Note: in case you want to make an (additional) backup of application
data via alternate means, all persistent volume data of the cluster are
stored in directories under ``/var/lib/OpenAppStack/local-storage``.

Restore
-------

Restore instructions will follow, please `reach out to us`_ if you need
assistance.

Change the IP of your cluster
-----------------------------

In case your cluster needs to migrate to another IP, make sure to update
the IP address in ``/etc/rancher/k3s/k3s.yaml`` and, if applicable, your
local kube config and inventory.yml in the cluster directory
``clusters/oas.example.org``.

Delete evicted pods
-------------------

In case your cluster disk is full, kubernetes `taints`_ the node with
``DiskPressure``. Then it tries to evict pods, which is pointless in a single
node setup but can still happen. We have experienced hundreds of pods in
``evicted`` state that still showed up after ``DiskPressure`` had recovered. See
also the `out of resource handling with kubelet`_ documentation.

You can delete all evicted pods with this command:

.. code:: bash

   kubectl get pods --all-namespaces -ojson | jq -r '.items[] | select(.status.reason!=null) | select(.status.reason | contains("Evicted")) | .metadata.name + " " + .metadata.namespace' | xargs -n2 -l bash -c 'kubectl delete pods $0 --namespace=$1'


.. _kubectl logs: https://kubernetes.io/docs/concepts/cluster-administration/logging
.. _Promtail: https://grafana.com/docs/loki/latest/clients/promtail/
.. _Loki: https://grafana.com/oss/loki/
.. _Grafana: https://grafana.com/
.. _Loki documentation: https://grafana.com/docs/loki/latest/
.. _Using Loki in Grafana: https://grafana.com/docs/grafana/latest/datasources/loki
.. _LogQL documentation: https://grafana.com/docs/loki/latest/logql
.. _Flux: https://fluxcd.io/
.. _Velero’s documentation: https://velero.io/docs/v1.4/
.. _reach out to us: https://openappstack.net/contact.html
.. _taints: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
.. _out of resource handling with kubelet: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/