Skip to content
Snippets Groups Projects
troubleshooting.md 4.98 KiB
Newer Older
Varac's avatar
Varac committed
# Troubleshooting

Varac's avatar
Varac committed
Note: `cluster$` indicates that the commands should be run as root on your OAS cluster.
Varac's avatar
Varac committed

## Run the cli tests

To get an overall status of your cluster you can run the tests from the
command line.

There are two types of tests: [testinfra](https://testinfra.readthedocs.io/en/latest/)
tests, and [behave](https://behave.readthedocs.io/en/latest/) tests.

### Testinfra tests

Testinfra tests are split into two groups, lets call them `blackbox` and
`whitebox` tests.  The blackbox tests run on your provisioning machine and test
the OAS cluster from the outside. For example, the certificate check will check
if the OAS will return valid certificates for the provided services.
The whitebox tests run on the OAS host and check i.e. if docker is installed
in the right version etc.

To run the test against your cluster, first export the `CLUSTER_DIR` environment
variabel with the location of your cluster config directory:

    export CLUSTER_DIR="../clusters/CLUSTERNAME"

Run all tests:

    py.test -s --ansible-inventory=${CLUSTER_DIR}/inventory.yml --hosts='ansible://*'

#### Advance usage

Specify host manually:

    py.test -s --hosts='ssh://root@example.openappstack.net'

Run only tests tagged with `prometheus`:

    py.test -s --ansible-inventory=${CLUSTER_DIR}/inventory.yml --hosts='ansible://*' -m prometheus

Run cert test manually using the ansible inventory file:

    py.test -s --ansible-inventory=${CLUSTER_DIR}/inventory.yml --hosts='ansible://*' -m certs

Run cert test manually against a different cluster, not configured in any
ansible inventory file, either by using pytest:

    FQDN='example.openappstack.net' py.test -sv -m 'certs'

or directly:

    FQDN='example.openappstack.net' pytest/test_certs.py

#### Known Issues

- Default ssh backend for testinfra tests is `paramiko`, which doesn't work oout
  of the box. It fails to connect to the host because the `ed25519` hostkey was
  not verified. Therefore we need to force plain ssh:// with either
  `connection=ssh` or `--hosts=ssh://…`

#### Running tests with local gitlab-runner docker executor

Export the following environment variables like this:

    export CI_REGISTRY_IMAGE='open.greenhost.net:4567/openappstack/openappstack'
    export SSH_PRIVATE_KEY="$(cat ~/.ssh/id_ed25519_oas_ci)"
    export COSMOS_API_TOKEN='…'

then:

    gitlab-runner exec docker --env CI_REGISTRY_IMAGE="$CI_REGISTRY_IMAGE" --env SSH_PRIVATE_KEY="$SSH_PRIVATE_KEY" --env COSMOS_API_TOKEN="$COSMOS_API_TOKEN" bootstrap


## Behave tests

Behave tests run in a headless browser and test if all the interfaces are up
and running and correctly connected to each other. They are integrated in the
`openappstack` CLI command suite.
To run the behave tests, run the following command in this repository:

    python -m openappstack CLUSTERNAME test

In the future, this command will run all tests, but now only *behave* is
implemented. To learn more about the `test` subcommand, run:

    python -m openappstack CLUSTERNAME test --help


If you encounter problems when you upgrade your cluster, please make sure first
to include all potential new values of `ansible/group_vars/all/settings.yml.example`
to your `clusters/YOUR_CLUSTERNAME/group_vars/all/settings.yml`, and rerun the installation
Varac's avatar
Varac committed
## HTTPS Certificates

OAS uses [cert-manager](http://docs.cert-manager.io/en/latest/) to automatically
fetch [Let's Encrypt](https://letsencrypt.org/) certificates for all deployed
services. If you experience invalid SSL certificates (i.e. your browser warns you
when visiting Nextcloud (`https://files.YOUR.CLUSTER.DOMAIN`) here's how to
debug this:

Did you create your cluster using the `--acme-staging` argument?
Varac's avatar
Varac committed
Please check the resulting value of the `acme_staging` key in
`clusters/YOUR_CLUSTERNAME/group_vars/all/settings.yml`. If this is set to `true`, certificates
Varac's avatar
Varac committed
are fetched from the [Let's Encrypt staging API](https://letsencrypt.org/docs/staging-environment/),
which can't be validated by default in your browser.

Are all cert-manager pods in the `oas` namespace in the `READY` state ?
Varac's avatar
Varac committed

    cluster$ kubectl -n oas get pods | grep cert-manager

Are there any `cm-acme-http-solver-*` pods still running, indicating that there
are unfinished certificate requests ?

    cluster$ kubectl get pods --all-namespaces | grep cm-acme-http-solver
Varac's avatar
Varac committed

Show the logs of the main `cert-manager` pod:

    cluster$ kubectl -n oas logs -l "app.kubernetes.io/name=cert-manager"
Varac's avatar
Varac committed

You can `grep` for your cluster domain or for any specific subdomain to narrow
down results.


## Purge OAS and install from scratch

If ever things fail beyond possible recovery, here's how to completely purge an OAS installation in order to start from scratch:

    cluster$ apt purge docker-ce-cli containerd.io
    cluster$ mount | egrep '^(tmpfs.*kubelet|nsfs.*docker)' | cut -d' ' -f 3 | xargs umount
    cluster$ systemctl reboot
Varac's avatar
Varac committed
    cluster$ rm -rf /var/lib/docker /var/lib/OpenAppStack /etc/kubernetes /var/lib/etcd /var/lib/rancher /var/lib/kubelet /var/log/OpenAppStack /var/log/containers /var/log/pods