Release Manager Runbook
Pointers for the release manager
Emergency Shutdown (Red Button)
In the case that ingest needs to shut down immediately scale all pods for deployments in the environment to 0.
This procedure can also be used as a reset if the cluster is overloaded and pods are failing on mass.
kubectl scale --replicas=0 deployments --all
To restore reapply the deployments and allow pods to scale to the previous level.
kubectl apply -f ./deployments/
WARNING - Avoid scaling down the stateful sets as this should not be necessary. If you do scale down the statefulsets also scale down deployments and make sure the statefulsets are restored and fully running before restoring the deployments.
Cluster Failure
If nodes on the cluster fail we are likely to see pods with the status unknown. e.g.
kubectl get pods -o wide | grep Unknown
ingest-accessioner-5b898964d7-qlflf 0/1 Unknown 2 22h 100.96.4.5 ip-172-20-110-97.ec2.internal
ingest-broker-69b4447778-t7w88 0/1 Unknown 0 22h <none> ip-172-20-110-97.ec2.internal
This appears to be a known bug put down to the Docker daemon on a node failing with the suggested solution to be to restart all nodes. The cause is suggested to be too many pods restarting at one time leading to a node running out of resources.
Solution
Stop the failing node or if that cannot be determined all nodes in the AWS EC2 console. The autoscaling group will then create new nodes. It will take approximately 10 minutes for the cluster to become ready and pods to be created.
Debugging Failure
Please look here is you experience an error and then check the state of the Kubernetes cluster.
These are the errors to expect if services are unavailable.
Below is a flowchart that helps debug errors.
[[/images/debug-flowchart.png]]
Ingest Accessioner Unavailable
UI
The submission and all metadata will appear “stuck” in draft.
Integration Test
Test will time out with submission in draft.
0:00:41 WAIT FOR VALIDATION...
0:00:41 envelope status is Draft
Ingest Broker Unavailable
UI
Uploading a spreadsheet will take over 30 seconds and fail with the message:
An error occurred in uploading spreadsheet
HttpErrorResponse: Http failure response for (unknown url): 0 Unknown Error
Integration Test
Uploading spreadsheet will fail:
0:00:00 CREATING SUBMISSION with Q4DemoSS2Metadata_v5_plainHeaders_new.xlsx...
...
http.client.RemoteDisconnected: Remote end closed connection without response
Ingest Core Unavailable
UI
The list of submission on the welcome page will be stuck at:
Loading your submissions...
Attempting to upload a spreadsheet will fail quickly with the message:
We experienced a problem while uploading your spreadsheet
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
Integration Test
Uploading spreadsheet will fail:
RuntimeError: POST http://ingest.dev.data.humancellatlas.org/api_upload response was 500: b'{"details": "(\'Connection aborted.\', RemoteDisconnected(\'Remote end closed connection without response\',))", "message": "We experienced a problem while uploading your spreadsheet"}'
Ingest Exporter Unavailable
UI
TBD
Integration Test
TBD
Ingest Ontology Unavailable
UI
No symptoms.
Integration Test
No symptoms.
Ingest Staging Manager Unavailable
UI
Submission will remain in draft. ‘Upload Area Location’ in UI remains blank
Integration Test
Test will fail when waiting for staging area:
0:00:00 CREATING SUBMISSION with Q4DemoSS2Metadata_v5_plainHeaders_new.xlsx...
0:00:02 submission ID is 5b1a845684eb570008dca3e0
0:00:02 WAITING FOR STAGING AREA...
...
RuntimeError: Function _get_upload_area_credentials did not return a non-None value within 60 seconds
Ingest State Tracking Unavailable
UI
The submission will appear “stuck” in Pending. Metadata will continue to move from draft into valid or invalid.
Integration Test
The test will time out waiting for the submission to come out of Pending
0:02:40 envelope status is Pending
0:02:40 .
Ingest Validator Unavailable
UI
The submission will appear “stuck” in Pending. Metadata will remain ‘stuck’ in Draft.
Integration Test
The test will time out waiting for the submission to come out of Pending
0:02:40 envelope status is Pending
0:02:40 .
Suggestions to make the Release Manager role easier
- It is difficult to track which issues have been resolved in which environments. Using ZenHub to track which features have made each environment.
- Centralise configuration with parameterisation see issue #5.
- Consider tagging the master in quay.io with integration instead of building a new container.
- This would mean we would not need to wait for quay.io to build.
- This may mean that we need to filter the branches in quay.io that trigger builds.
- Consider pushing binaries such as core to a package repo before creating containers. We could use multi part docker builds to achieve this.
- This would speed up deployment