diff --git a/.gitlab/issue_templates/new_app.md b/.gitlab/issue_templates/new_app.md index fef04657381c009c83540ad40de2e07075537ab2..5dd13b97f78a309df35d890ecb2b61c09268acd3 100644 --- a/.gitlab/issue_templates/new_app.md +++ b/.gitlab/issue_templates/new_app.md @@ -28,6 +28,28 @@ * [ ] Admin-login should grant admin privileges * [ ] Non-admin should not grant admin privileges +### Backup/restore + +This applies if the app has any persistent storage that needs to be part of +backups. + +* [ ] Add the label `stackspin.net/backupSet=myApp` to some kubernetes + resources. This label is used by velero when instructed to restore a single + app. Typically you should add it to: + * [ ] the pvc(s) in `flux2/apps/myApp/pvc*.yaml`; + * [ ] any pod(s) that use those pvc(s); this would go in the chart's helm + values, with the value typically called `podLabels`, or if it doesn't have + that maybe `commonLabels`; + * [ ] the kubernetes objects controlling those pods, typically a deployment + (`deploymentLabels` or `commonLabels`) or statefulset (`statefulSetLabels` + or `commonLabels`). +* [ ] To the same pods, i.e., the ones that use the pvcs that need to be + backed up, add an annotation `backup.velero.io/backup-volumes: "volume-name"`, + where `volume-name` is the name of the volume internal to the pod kubernetes + object, as shown for example in `kubectl describe pod` output. +* [ ] Add app-specific backup/restore instructions to `docs/maintenance.rst` if + necessary. + ### Etc * [ ] Add app to `dump_secrets()` in `stackspin/cluster.py` diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 1535a95bc9bd8d8ff9945d161cd95ba62f1ea8f2..72cdcc996065bd68de95a1c06d156507053075d1 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -6,6 +6,8 @@ repos: - id: check-ast - id: check-merge-conflict - id: check-yaml + args: + - --allow-multiple-documents - id: detect-private-key - id: end-of-file-fixer - id: trailing-whitespace diff --git a/docs/installation/install_stackspin.rst b/docs/installation/install_stackspin.rst index d4f841509e03c15c23226cb4b67e3878b80d7180..24e2235c0673387c02a2f929c705f555869329b5 100644 --- a/docs/installation/install_stackspin.rst +++ b/docs/installation/install_stackspin.rst @@ -90,9 +90,8 @@ If enabled, Velero will create a backup of your cluster once every night and upload it to the S3 storage you configure. This includes: - your cluster state. Technically speaking, it will back up all Kubernetes - namespaces in your cluster, except ``velero`` itself; this includes things - like which applications are installed, including their version number and - installation-time settings; + resources in your cluster; this includes things like which applications are + installed, including their version number and installation-time settings; - persistent data of all applications: for example, single sign-on users that you created, Nextcloud files and metadata, WordPress site data and comments, Zulip chat history, etc. A single exception to this is Prometheus data @@ -109,6 +108,8 @@ and configure the settings with the ``backup_s3_`` prefix. Then continue with the installation procedure as described below. At the end of the installation procedure, you have to install the ``velero`` application. +For information on how to use Velero with Stackspin, please see :ref:`backup`. + .. _install_core_apps: Step 2: Install core applications diff --git a/docs/maintenance.rst b/docs/maintenance.rst index 544fb0c92e13d358daa1329b2a6bd7861e3bc66d..dfc3be9880d341d4113674831cb71d61ad6fc656 100644 --- a/docs/maintenance.rst +++ b/docs/maintenance.rst @@ -1,6 +1,8 @@ Maintenance =========== +.. _backup: + Backup ------ @@ -20,7 +22,13 @@ On your cluster Stackspin supports using the program Velero to make backups of your Stackspin instance to external storage via the S3 API. See :ref:`backups-with-velero` in the installation instructions for setup details. -By default this will make nightly backups of the entire cluster (minus + +For the maintenance operations described below -- in particular, restoring +backups -- you need the ``velero`` client program installed, typically on your +provisioning machine although you can also run it on the VPS if preferred. You +may find it at `Velero's github release page`_. + +By default Velero will make nightly backups of the entire cluster (minus Prometheus data). To make a manual backup, run .. code:: bash @@ -37,8 +45,85 @@ stored in directories under ``/var/lib/Stackspin/local-storage``. Restore ------- -Restore instructions will follow, please `reach out to us`_ if you need -assistance. +Restoring from backups is a process that for now has to be done via the command +line. We intend to allow doing this from the Stackspin dashboard instead in the +near future. + +These instructions explain how to restore the persistent data of an individual +app (such as Nextcloud, or Zulip) to a previous point in time, from a backup to +S3-compatible storage made using velero, on a Stackspin cluster that is in a +healthy state. Using backups to recover from more severe problems, like a +broken or completely destroyed Stackspin cluster, is also possible, by +reinstalling the cluster from scratch and restoring individual app data on top +of that. However, that procedure is not so streamlined and not documented here. +If you are in that situation, please `reach out to us`_ for advice or +assistence. + +Select backup +~~~~~~~~~~~~~ + +To show a list of available backups, perform the following command on your VPS: + +.. code:: bash + + kubectl get backup -A + +Once you have chosen a backup to restore from, record its name as written in +the ``kubectl`` output. + +.. note:: + + Please be aware that for technical reasons the restore operation will restore + not only the persistent data from this backup, but also the app's software + version that was running at that time. Although the auto-update mechanism + should in turn update the app to a recent version, and the recent app version + should be able to automatically perform any necessary data format migrations on + the old data, this operation has not been well tested for older backups, so + please proceed carefully. As an example of what could go wrong, Nextcloud + requires upgrades to be done in a serial fashion, never skipping a major + version upgrade, so if your backup is from two or more major Nextcloud + versions ago, some manual intervention is required. If you have any doubts, + please `reach out to us`_. + +Restore app data +~~~~~~~~~~~~~~~~ + +.. warning:: + + Please note that restoring data is a destructive operation! It will replace the + app's data as they are now. There is no way to undo a restore operation, + unless you have a copy of the current app data, in the form of a current + Stackspin backup or an app-specific data export. For that reason, we + recommend making another backup right before beginning a restore operation. + +To restore the data of app ``$app`` from the backup named ``$backup``, perform +the following commands: + +.. code:: bash + + flux suspend kustomization $app + flux suspend helmrelease -n stackspin-apps $app + kubectl delete all -n stackspin-apps -l stackspin.net/backupSet=$app + velero restore create arbitrary-name-of-restore-operation --from-backup=$backup -l stackspin.net/backupSet=$app + # At this point, please first wait for the restore operation to finish, see + # text below. + flux resume helmrelease -n stackspin-apps $app + flux resume kustomization $app + +.. note:: + + Specifically for Nextcloud, the ``kubectl delete pvc ...`` command might hang due + to a Kubernetes job that references that PVC. To solve that, look for such jobs + using ``kubectl get job -n stackspin-apps`` and delete any finished ones using + ``kubectl delete job ...``. That should let the ``kubectl delete pvc ...`` + command finish; if it was already terminated, run it again. + +The ``velero restore create ...`` command initiates the restore operation, but +it doesn't wait until the operation is complete. You may use the commands +suggested in the terminal output to check on the status of the operation. +Additionally, once the restore operation is finished, it may take some more +time for the various app components to be fully started and for the app to be +operational again. Change the IP of your cluster ----------------------------- @@ -77,6 +162,7 @@ following command that will apply the changes to all installed kustomizations: flux get -A kustomizations --no-header | awk -F' ' '{system("flux reconcile -n " $1 " kustomization " $2)}' +.. _Velero's github release page: https://github.com/vmware-tanzu/velero/releases/latest .. _Velero’s documentation: https://velero.io/docs/v1.4/ .. _reach out to us: https://stackspin.net/contact.html .. _taints: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ diff --git a/flux2/apps/nextcloud/nextcloud-values-configmap.yaml b/flux2/apps/nextcloud/nextcloud-values-configmap.yaml index 63357ab60299f63624023d6950041de15b5e0e73..ae60f00430ce79570ee3f8449a9d40d1dd759ff8 100644 --- a/flux2/apps/nextcloud/nextcloud-values-configmap.yaml +++ b/flux2/apps/nextcloud/nextcloud-values-configmap.yaml @@ -55,9 +55,13 @@ data: enabled: true existingClaim: "nextcloud-files" + deploymentLabels: + stackspin.net/backupSet: "nextcloud" + podLabels: + stackspin.net/backupSet: "nextcloud" podAnnotations: # Let the backup system include nextcloud data. - backup.velero.io/backup-volumes: "nextcloud-data" + backup.velero.io/backup-volumes: "nextcloud-main" # Explicitly disable use of internal database internalDatabase: @@ -83,7 +87,7 @@ data: enabled: true architecture: standalone primary: - annotations: + podAnnotations: # Let the backup system include nextcloud database data. backup.velero.io/backup-volumes: "data" persistence: @@ -97,6 +101,8 @@ data: requests: cpu: 100m memory: 256Mi + commonLabels: + stackspin.net/backupSet: "nextcloud" apps: - name: sociallogin @@ -138,10 +144,15 @@ data: - "office.${domain}" secretName: stackspin-nextcloud-office jwtSecret: "${onlyoffice_jwt_secret}" - persistence: enabled: true existingClaim: "nextcloud-onlyoffice-data" + deploymentLabels: + stackspin.net/backupSet: "nextcloud" + podLabels: + stackspin.net/backupSet: "nextcloud" + podAnnotations: + backup.velero.io/backup-volumes: "onlyoffice-data" postgresql: postgresqlPassword: "${onlyoffice_postgresql_password}" @@ -155,6 +166,11 @@ data: persistence: enabled: true existingClaim: "nextcloud-postgresql" + primary: + podAnnotations: + backup.velero.io/backup-volumes: "data" + commonLabels: + stackspin.net/backupSet: "nextcloud" rabbitmq: auth: diff --git a/flux2/apps/nextcloud/pvc.yaml b/flux2/apps/nextcloud/pvc.yaml index 76adebaebe8cb9924530e5b865d268698368a944..c64feb121c5909f076768c116903f5e1b68a50e4 100644 --- a/flux2/apps/nextcloud/pvc.yaml +++ b/flux2/apps/nextcloud/pvc.yaml @@ -4,6 +4,8 @@ kind: PersistentVolumeClaim metadata: name: nextcloud-files namespace: stackspin-apps + labels: + stackspin.net/backupSet: "nextcloud" spec: accessModes: - ReadWriteOnce @@ -18,6 +20,8 @@ kind: PersistentVolumeClaim metadata: name: nextcloud-mariadb namespace: stackspin-apps + labels: + stackspin.net/backupSet: "nextcloud" spec: accessModes: - ReadWriteOnce @@ -32,6 +36,8 @@ kind: PersistentVolumeClaim metadata: name: nextcloud-postgresql namespace: stackspin-apps + labels: + stackspin.net/backupSet: "nextcloud" spec: accessModes: - ReadWriteOnce @@ -46,6 +52,8 @@ kind: PersistentVolumeClaim metadata: name: nextcloud-onlyoffice-data namespace: stackspin-apps + labels: + stackspin.net/backupSet: "nextcloud" spec: accessModes: - ReadWriteOnce diff --git a/flux2/apps/wekan/pvc.yaml b/flux2/apps/wekan/pvc.yaml index 7114d1ec8f939b52598a271fdc1f01d9f73d7100..397b8920a43f569af9f9ccba1eda0848fd183a08 100644 --- a/flux2/apps/wekan/pvc.yaml +++ b/flux2/apps/wekan/pvc.yaml @@ -4,6 +4,8 @@ kind: PersistentVolumeClaim metadata: name: wekan namespace: stackspin-apps + labels: + stackspin.net/backupSet: "wekan" spec: accessModes: - ReadWriteOnce diff --git a/flux2/apps/wekan/wekan-values-configmap.yaml b/flux2/apps/wekan/wekan-values-configmap.yaml index 521243b7cdd5f275b2179bcb8b09c417a0015758..6e2018e7dbd4cd7215acb5376cdf48084fff5f8a 100644 --- a/flux2/apps/wekan/wekan-values-configmap.yaml +++ b/flux2/apps/wekan/wekan-values-configmap.yaml @@ -78,6 +78,10 @@ data: secretName: wekan autoscaling: enabled: false + deploymentLabels: + stackspin.net/backupSet: "wekan" + podLabels: + stackspin.net/backupSet: "wekan" # https://docs.bitnami.com/kubernetes/infrastructure/mongodb/ # https://github.com/bitnami/charts/tree/master/bitnami/mongodb#parameters mongodb: @@ -96,11 +100,15 @@ data: rootPassword: ${mongodb_root_password} password: ${mongodb_password} podAnnotations: - # Let the backup system include data stored in persistant volumes. + # Let the backup system include wekan data stored in mongodb. backup.velero.io/backup-volumes: "datadir" + podLabels: + stackspin.net/backupSet: "wekan" + # `labels` are applied by the mongodb chart to the statefulset/deployment. + labels: + stackspin.net/backupSet: "wekan" persistence: enabled: true - # FIXME: This value is ignored by the chart currently in use existingClaim: "wekan" resources: limits: diff --git a/flux2/apps/wordpress/pvc.yaml b/flux2/apps/wordpress/pvc.yaml index ce24beb09283dcc433fb9d649303f0555d733ff4..e8787952d2961bee5221b1d2953c63e20fd3f736 100644 --- a/flux2/apps/wordpress/pvc.yaml +++ b/flux2/apps/wordpress/pvc.yaml @@ -4,6 +4,8 @@ kind: PersistentVolumeClaim metadata: name: wordpress-files namespace: stackspin-apps + labels: + stackspin.net/backupSet: "wordpress" spec: accessModes: - ReadWriteOnce @@ -18,6 +20,8 @@ kind: PersistentVolumeClaim metadata: name: wordpress-mariadb namespace: stackspin-apps + labels: + stackspin.net/backupSet: "wordpress" spec: accessModes: - ReadWriteOnce diff --git a/flux2/apps/wordpress/wordpress-values-configmap.yaml b/flux2/apps/wordpress/wordpress-values-configmap.yaml index e2b92b8149df68b08806c0a6af933cae19b8f544..fbe661c8fe9c56a59f5780486fc831f1e46c54ce 100644 --- a/flux2/apps/wordpress/wordpress-values-configmap.yaml +++ b/flux2/apps/wordpress/wordpress-values-configmap.yaml @@ -24,6 +24,10 @@ data: existingClaim: wordpress-files podAnnotations: backup.velero.io/backup-volumes: "wordpress-wp-uploads" + podLabels: + stackspin.net/backupSet: "wordpress" + statefulSetLabels: + stackspin.net/backupSet: "wordpress" openid_connect_settings: enabled: true @@ -59,12 +63,13 @@ data: username: wordpress password: "${wordpress_mariadb_password}" rootPassword: "${wordpress_mariadb_root_password}" + architecture: standalone primary: persistence: ## Enable MariaDB persistence using Persistent Volume Claims. enabled: true existingClaim: "wordpress-mariadb" - annotations: + podAnnotations: # Let the backup system include nextcloud database data. backup.velero.io/backup-volumes: "data" resources: @@ -74,7 +79,8 @@ data: requests: cpu: 100m memory: 256Mi - architecture: "standalone" + commonLabels: + stackspin.net/backupSet: "wordpress" # It's advisable to set resource limits to prevent your K8s cluster from # crashing diff --git a/flux2/apps/zulip/zulip-data-pvc.yaml b/flux2/apps/zulip/zulip-data-pvc.yaml index 19fb676fc2666855d407fde9f289853faab5f572..5d98058d49c7d70aca0d3065559157aefc5b9d37 100644 --- a/flux2/apps/zulip/zulip-data-pvc.yaml +++ b/flux2/apps/zulip/zulip-data-pvc.yaml @@ -4,6 +4,8 @@ kind: PersistentVolumeClaim metadata: name: zulip-data namespace: stackspin-apps + labels: + stackspin.net/backupSet: "zulip" spec: accessModes: - ReadWriteOnce diff --git a/flux2/apps/zulip/zulip-postgres-pvc.yaml b/flux2/apps/zulip/zulip-postgres-pvc.yaml index 34e56936996f4c0ac05f6b57ac854265243e099d..65f1f2c87a3923ce031be646732a5dc2934dbb40 100644 --- a/flux2/apps/zulip/zulip-postgres-pvc.yaml +++ b/flux2/apps/zulip/zulip-postgres-pvc.yaml @@ -4,6 +4,8 @@ kind: PersistentVolumeClaim metadata: name: zulip-postgres namespace: stackspin-apps + labels: + stackspin.net/backupSet: "zulip" spec: accessModes: - ReadWriteOnce diff --git a/flux2/apps/zulip/zulip-values-configmap.yaml b/flux2/apps/zulip/zulip-values-configmap.yaml index d09975a0c078d239aa8594d37d75a9e3bac6c416..bcaa309444eb9fe1ef09b3ae1528bb14b3d32bce 100644 --- a/flux2/apps/zulip/zulip-values-configmap.yaml +++ b/flux2/apps/zulip/zulip-values-configmap.yaml @@ -25,6 +25,13 @@ data: - "zulip.${domain}" secretName: stackspin-zulip + statefulSetLabels: + stackspin.net/backupSet: "zulip" + podLabels: + stackspin.net/backupSet: "zulip" + podAnnotations: + backup.velero.io/backup-volumes: "zulip-persistent-storage" + memcached: memcachedPassword: "${memcached_password}" resources: @@ -50,6 +57,8 @@ data: requests: cpu: 100m memory: 32Mi + commonLabels: + stackspin.net/backupSet: "zulip" postgresql: persistence: @@ -62,6 +71,11 @@ data: requests: cpu: 200m memory: 128Mi + primary: + podAnnotations: + backup.velero.io/backup-volumes: "data" + commonLabels: + stackspin.net/backupSet: "zulip" zulip: password: "${zulip_password}" diff --git a/flux2/core/base/single-sign-on/pvc-database.yaml b/flux2/core/base/single-sign-on/pvc-database.yaml index d75e6df3af49a929350a4d37f29ef801635fad50..f92e37c8d210ddb7d05a1996e940b966c5d3c4f6 100644 --- a/flux2/core/base/single-sign-on/pvc-database.yaml +++ b/flux2/core/base/single-sign-on/pvc-database.yaml @@ -3,6 +3,8 @@ apiVersion: v1 kind: PersistentVolumeClaim metadata: name: single-sign-on-database + labels: + stackspin.net/backupSet: "single-sign-on" spec: accessModes: - ReadWriteOnce diff --git a/flux2/core/base/single-sign-on/single-sign-on-database-values-configmap.yaml b/flux2/core/base/single-sign-on/single-sign-on-database-values-configmap.yaml index 8e0ff3551998b3b4e104ab00edb0383ad3f5f5bc..29d1c52e343401a0de8452c7056e42cddff09659 100644 --- a/flux2/core/base/single-sign-on/single-sign-on-database-values-configmap.yaml +++ b/flux2/core/base/single-sign-on/single-sign-on-database-values-configmap.yaml @@ -16,3 +16,8 @@ data: CREATE DATABASE kratos WITH OWNER kratos; CREATE DATABASE hydra WITH OWNER hydra; CREATE DATABASE stackspin WITH OWNER stackspin; + primary: + podAnnotations: + backup.velero.io/backup-volumes: "data" + commonLabels: + stackspin.net/backupSet: "single-sign-on"