OpenAppStack CI machines break after a long time of re-use
The master CI machine sometimes reaches some kind of broken state it doesn't get out anymore. When this happens, all the hourly CI pipelines start failing.
Some failed pipelines:
- recent:
- https://open.greenhost.net/openappstack/openappstack/pipelines/2244/failures
- https://open.greenhost.net/openappstack/openappstack/pipelines/2243/failures
- https://open.greenhost.net/openappstack/openappstack/pipelines/2240/failures
- https://open.greenhost.net/openappstack/openappstack/pipelines/2241/failures
- older:
I'm not sure if the "recent" problem is the same as the "older" problem because they fail on different stages. However, both include failed helm releases.
For the recent one, I have done some research. Here's the output for kubectl get helmrelease single-sign-on -A
:
# kubectl get helmreleases.helm.fluxcd.io single-sign-on -n oas
NAME RELEASE STATUS MESSAGE AGE
single-sign-on single-sign-on FAILED rpc error: code = Unknown desc = Job.batch "single-sign-on-create-admin-user" is invalid: spec.template: Invalid value: core.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"app.kubernetes.io/instance":"single-sign-on", "app.kubernetes.io/managed-by":"Tiller", "controller-uid":"2d282fca-ced6-4494-a3ee-9480870add30", "helm.sh/chart":"single-sign-on-0.2.0", "job-name":"single-sign-on-create-admin-user"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:core.PodSpec{Volumes:[]core.Volume(nil), InitContainers:[]core.Container(nil), Containers:[]core.Container{core.Container{Name:"create-admin-user", Image:"open.greenhost.net:4567/openappstack/user-panel/backend:1.2.0", Command:[]string{"/bin/bash", "-c"}, Args:[]string{"/bin/bash ./utils/create-user.bash \"$USERNAME\" \"$PASSWORD\" \"$EMAIL\" http://single-sign-on-userbackend:80 && /bin/bash ./utils/create-application.bash user-panel 'Administration interface to manage user accounts' http://single-sign-on-userbackend:80 && /bin/bash ./utils/grant-access.bash \"$USERNAME\" user-panel http://single-sign-on-userbackend:80 && /bin/bash ./utils/create-application.bash nextcloud 'Nextcloud Files offers an on-premise Universal File Access and sync platform with powerful collaboration capabilities and desktop, mobile and web interfaces.' http://single-sign-on-userbackend:80 && /bin/bash ./utils/grant-access.bash \"$USERNAME\" nextcloud http://single-sign-on-userbackend:80 && /bin/bash ./utils/create-role.bash admin http://single-sign-on-userbackend:80 && /bin/bash ./utils/assign-role.bash \"$USERNAME\" admin http://single-sign-on-userbackend:80"}, WorkingDir:"", Ports:[]core.ContainerPort(nil), EnvFrom:[]core.EnvFromSource(nil), Env:[]core.EnvVar{core.EnvVar{Name:"USERNAME", Value:"admin", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:"PASSWORD", Value:"oqkkwSlGnmYzVXvgcCcr", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:"EMAIL", Value:"admin@master.ci.openappstack.net", ValueFrom:(*core.EnvVarSource)(nil)}}, Resources:core.ResourceRequirements{Limits:core.ResourceList(nil), Requests:core.ResourceList(nil)}, VolumeMounts:[]core.VolumeMount(nil), VolumeDevices:[]core.VolumeDevice(nil), LivenessProbe:(*core.Probe)(nil), ReadinessProbe:(*core.Probe)(nil), Lifecycle:(*core.Lifecycle)(nil), TerminationMessagePath:"/dev/termination-log", TerminationMessagePolicy:"File", ImagePullPolicy:"Always", SecurityContext:(*core.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, RestartPolicy:"Never", TerminationGracePeriodSeconds:(*int64)(0xc01bfde290), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"", AutomountServiceAccountToken:(*bool)(nil), NodeName:"", SecurityContext:(*core.PodSecurityContext)(0xc012c399d0), ImagePullSecrets:[]core.LocalObjectReference(nil), Hostname:"", Subdomain:"", Affinity:(*core.Affinity)(nil), SchedulerName:"default-scheduler", Tolerations:[]core.Toleration(nil), HostAliases:[]core.HostAlias(nil), PriorityClassName:"", Priority:(*int32)(nil), PreemptionPolicy:(*core.PreemptionPolicy)(nil), DNSConfig:(*core.PodDNSConfig)(nil), ReadinessGates:[]core.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), EnableServiceLinks:(*bool)(nil)}}: field is immutable 4h39m
For some reason, the single-sign-on helm release has stayed in this state for over four hours.
The helm operator refuses to fix this problem, because:
ts=2020-01-07T14:16:53.380739122Z caller=chartsync.go:328 component=chartsync warning="unable to proceed with release" resource=oas:helmrelease/single-sign-on release=single-sign-on err="release requires a rollback before it can be upgraded (FAILED)"
Here's the helm operator error from 4.5 hours ago:
ts=2020-01-07T10:28:40.803252008Z caller=release.go:246 component=release error="Chart upgrade release failed: single-sign-on: &status.statusError{Code:2, Message:\"Job.batch \\\"single-sign-on-create-admin-user\\\" is invalid: spec.template: Invalid value: core.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:\\\"\\\", GenerateName:\\\"\\\", Namespace:\\\"\\\", SelfLink:\\\"\\\", UID:\\\"\\\", ResourceVersion:\\\"\\\", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{\\\"app.kubernetes.io/instance\\\":\\\"single-sign-on\\\", \\\"app.kubernetes.io/managed-by\\\":\\\"Tiller\\\", \\\"controller-uid\\\":\\\"2d282fca-ced6-4494-a3ee-9480870add30\\\", \\\"helm.sh/chart\\\":\\\"single-sign-on-0.2.0\\\", \\\"job-name\\\":\\\"single-sign-on-create-admin-user\\\"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:\\\"\\\", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:core.PodSpec{Volumes:[]core.Volume(nil), InitContainers:[]core.Container(nil), Containers:[]core.Container{core.Container{Name:\\\"create-admin-user\\\", Image:\\\"open.greenhost.net:4567/openappstack/user-panel/backend:1.2.0\\\", Command:[]string{\\\"/bin/bash\\\", \\\"-c\\\"}, Args:[]string{\\\"/bin/bash ./utils/create-user.bash \\\\\\\"$USERNAME\\\\\\\" \\\\\\\"$PASSWORD\\\\\\\" \\\\\\\"$EMAIL\\\\\\\" http://single-sign-on-userbackend:80 && /bin/bash ./utils/create-application.bash user-panel 'Administration interface to manage user accounts' http://single-sign-on-userbackend:80 && /bin/bash ./utils/grant-access.bash \\\\\\\"$USERNAME\\\\\\\" user-panel http://single-sign-on-userbackend:80 && /bin/bash ./utils/create-application.bash nextcloud 'Nextcloud Files offers an on-premise Universal File Access and sync platform with powerful collaboration capabilities and desktop, mobile and web interfaces.' http://single-sign-on-userbackend:80 && /bin/bash ./utils/grant-access.bash \\\\\\\"$USERNAME\\\\\\\" nextcloud http://single-sign-on-userbackend:80 && /bin/bash ./utils/create-role.bash admin http://single-sign-on-userbackend:80 && /bin/bash ./utils/assign-role.bash \\\\\\\"$USERNAME\\\\\\\" admin http://single-sign-on-userbackend:80\\\"}, WorkingDir:\\\"\\\", Ports:[]core.ContainerPort(nil), EnvFrom:[]core.EnvFromSource(nil), Env:[]core.EnvVar{core.EnvVar{Name:\\\"USERNAME\\\", Value:\\\"admin\\\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\\\"PASSWORD\\\", Value:\\\"oqkkwSlGnmYzVXvgcCcr\\\", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:\\\"EMAIL\\\", Value:\\\"admin@master.ci.openappstack.net\\\", ValueFrom:(*core.EnvVarSource)(nil)}}, Resources:core.ResourceRequirements{Limits:core.ResourceList(nil), Requests:core.ResourceList(nil)}, VolumeMounts:[]core.VolumeMount(nil), VolumeDevices:[]core.VolumeDevice(nil), LivenessProbe:(*core.Probe)(nil), ReadinessProbe:(*core.Probe)(nil), Lifecycle:(*core.Lifecycle)(nil), TerminationMessagePath:\\\"/dev/termination-log\\\", TerminationMessagePolicy:\\\"File\\\", ImagePullPolicy:\\\"Always\\\", SecurityContext:(*core.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, RestartPolicy:\\\"Never\\\", TerminationGracePeriodSeconds:(*int64)(0xc01bfde290), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:\\\"ClusterFirst\\\", NodeSelector:map[string]string(nil), ServiceAccountName:\\\"\\\", AutomountServiceAccountToken:(*bool)(nil), NodeName:\\\"\\\", SecurityContext:(*core.PodSecurityContext)(0xc012c399d0), ImagePullSecrets:[]core.LocalObjectReference(nil), Hostname:\\\"\\\", Subdomain:\\\"\\\", Affinity:(*core.Affinity)(nil), SchedulerName:\\\"default-scheduler\\\", Tolerations:[]core.Toleration(nil), HostAliases:[]core.HostAlias(nil), PriorityClassName:\\\"\\\", Priority:(*int32)(nil), PreemptionPolicy:(*core.PreemptionPolicy)(nil), DNSConfig:(*core.PodDNSConfig)(nil), ReadinessGates:[]core.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), EnableServiceLinks:(*bool)(nil)}}: field is immutable\", Details:[]*any.Any(nil), XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0}"