Data retraction SOP
Table of contents
Purpose of this document
This document will serve to capture the steps needed to do a full cleanup of a dataset brokered by the Data Wrangling team.
Overview
Before you start
To follow all the steps, a developer working for the ingest team is needed. Most of the steps will only need wrangler credentials, but some steps have to be carried out by a developer. If a wrangler is responsible for this task, please contact and assign a dev to help with the specific steps.
Needed permissions
Please make sure you have the proper permissions or you are paired with someone who has. The needed permissions are:
- Access to the EC2 instance: Any wrangler/dev should have access.
- Admin access to s3 buckets: Any wrangler/dev should have access. If you don’t have access, please set up your credentials by running
aws config
and providing with your wrangler/ingest s3 access key and secret access key. - Dev access to Ingest Core: Every dev should have access.
- Access to subs.team-2 AAP domain: Please retrieve the ingest DSP user and password by using the following command:
aws --region us-east-1 secretsmanager get-secret-value --secret-id ingest/dev/secrets --query SecretString --output text | jq -jr .archiver_api_key
Only developers are able to access the secrets manager.
Set up gsutil to access staging area
If you already have gsutil set up, please skip this step
https://ebi-ait.github.io/hca-ebi-dev-team/admin_setup/Setting-up-access-to-Terra-staging-area.html
Set up hca-util with wrangler credentials
Please also make sure that you have the hca-util set up with the wrangler credentials:
hca-util configure
And input the wrangler or ingest credentials.
Open an issue in GitHub
Finally, please open an issue in the wrangler repo with the template named as Dataset Retraction
and tick through the boxes as you go through the process of retraction.
Write down accessions/UUIDs/GitHub issue:
You will need to locate the following UUIDs:
- DCP project UUID and project shortname: Locate the project in the UI and copy the project UUID and the shortname
- DCP submission UUID: Locate the submission in the UI and copy the submission UUID
- ENA/BioStudies study/project accession: Inside the Submission in the UI, locate the
accessions
tab and copy the study and project accessions - DSP submission UUID: Locate the DSP submission UUID following the instructions on point three of this document’s section
- GH issue: Please locate the github issue within the hca-ebi-wrangler-central repository and write down the url. Please note that it may be closed.
Internal
Data and metadata that is stored within our immediate reach and deleting that data would not cause any consequence downstream.
Amazon buckets/EC2 instance
Currently, we store contributor data in the cloud before and while brokering it. The places it needs to be retracted from are:
- HCA-util contributor areas: Locate the area and delete it
hca-util list -b hca-util upload select <area> hca-util delete -d
- Ingest upload area: This task can only be carried out by a developer. Send a DELETE request to the Upload Service as following
# retrieve Upload Service Api-Key (dev)
aws --region us-east-1 secretsmanager get-secret-value --secret-id ingest/dev/secrets --query SecretString --output text | jq -jr .staging_api_key
# delete an upload area (dev)
curl -X DELETE "https://upload.dev.archive.data.humancellatlas.org/v1/area/<upload_area_uuid>" -H "accept: application/json" -H "Api-Key: <API-KEY>"
Please, make sure that the submission was not tested in other environments. If in doubt, please ask the primary wrangler (Stated in the GH ticket/Dataset tracking sheet)
- EC2 instance: If this data has been downloaded to the EC2 for some reason (e.g. validation), please make sure to remove it. If the person tasked with the retraction is not the primary/secondary wrangler, the email later in the document will cover this.
EBI Cluster
When archiving, if the data needed to be converted to bam files, it will be saved in the folder used for that purpose. To delete this data:
ssh noah-login
rm -r /nfs/production/hca/<name_of_the_folder>/
Tips to find the folder: It is usually best practice to set up the name with the shortname of the project, if you know this dataset was archived but there is no apparent folder, please contact the primary wrangler for this project. The primary wrangler’s name can be found in the GitHub ticket.
Staging area
-
Find the directory of the project in the Terra staging area GCP bucket. The GCP buckets locations are configured in
<env>.yaml
files in ingest-kube-deployment/apps# Example project directory in dev: gs://broad-dsp-monster-hca-dev-ebi-staging/dev/<project-uuid>
-
Remove the project uuid subdirectory:
gsutil -m rm gs://<bucket-parent-directory>/<project-uuid>
Spreadsheet
The metadata spreadsheet can be in many different locations. Please make sure you delete each and every copy that you have:
- Google drive folder: Locate the project folder under
brokering
and delete it in google drive. - Email
- Head to the Wrangler team Google group and delete any email thread that may contain the spreadsheet.
- Send an email to the wrangler team
- Delete your own local copies of the thread in your mail apps.
- Please ask for a dev to delete the copy of the spreadsheet associated with the submission in the ingest database.
Instructions for dev to follow:
The spreadsheets are saved in the ingest-broker pod under /data/spreadsheets
. Log into the pod as follows and delete the required spreadsheet(s).
# switch to env (e.g. prod) context
kubectx ingest-eks-<env>
# get ingest-broker-pod name
kubectl get pods | grep broker | awk '{print $1}'
kubectl exec -it <ingest-broker-pod> -- /bin/sh
Ingest
UI
Metadata and data can be found in the ingest submission. If following the steps, the data and the metadata spreadsheet should already be removed, but there’s still the metadata in the database.
Before deleting the submission, please remember to retrieve the following:
- BioStudies accession
- ENA study/project accession
- Ingest project/submission UUID
- DSP submission UUID
And please run the deletion of files in the staging area first. The files in the staging area can only be mapped crawling through the ingest API.
Once these have been retrieved, you can proceed to ask a dev to delete the submission and the project, where applicable.
Instructions for dev to follow:
# Force delete a submitted submission
curl -X DELETE -H "Authorization: Bearer <TOKEN>" <INGEST_API_URL>/submissionEnvelopes/<ARCHIVE-SUBMISION-ID>?force=true
Metadata archiver
The metadata that is archived through ingest’s archiver is stored in the archiver endpoints. A developer is needed for the following points:
1. Check the _links.self.href from the json:
curl -X GET -H "Accept: application/hal+json" <INGEST_API_URL>/archiveSubmissions/search/findByDspUuid?dspUuid=<DSP-SUBMISSION-UUID>
2. Send a delete request to delete the DSP metadata being tracked in Ingest DB
curl -X DELETE -H "Authorization: Bearer <TOKEN>" <INGEST_API_URL>/archiveSubmissions/<ARCHIVE-SUBMISION-ID>
External
Depending on the route the dataset took on the system, there might be data and metadata that needs to be retracted from sources external to the ingestion team.
DCP2
- Pipelines team
- Terra repository
- Data browser
Please contact them following this email template
Archives
With the following procedures on archiving, the dataset might be in many different places:
- ENA: Please use the email template provided to contact ENA helpdesk.
- ArrayExpress: Please use the email template provided to contact AE directly. If this dataset has been included in SCEA, please also add a line for them to contact SCEA.
- BioSamples: Please follow the DSP section to remove all the metadata from the BioSamples entities associated with this project following the steps listed in the DSP section.
- BioStudies: Please contact BioStudies Helpdesk with the email template provided
DSP
While not completely internal, we have the ability to update the BioSamples submission within DSP.
The steps are:
- Remove all metadata within the BioSamples entities. Just leave the description field with an update:
Unfortunately, this metadata can't be reached as it has been withdrawn from public domain.
And update the release date for 100 years into the future.
-
Update the submission.
- Clone the hca-to-dsp-tools repository and install requirements
git clone https://github.com/ebi-ait/hca-to-dsp-tools.git pip install -r requirements.txt
- Set up your credentials in a txt file called
cred.txt
at the root of the repository. This file has to have the following structure:USER=<dsp_username> PASSWORD=<dsp_password> ROOT=https://submission.ebi.ac.uk/api/
- Open a python console on the root of the repository and run the following:
import dsp_cli.DSP_submission as ds dsp = ds.DspCLI() dsp.select_submission()
It will proceed to ask for what team do you want to use and what submission should be selected. The team is
subs.team-2
and the submission will use the ID you gathered before starting. Please keep this session opened as all the following steps will be carried out in this session. - Once the submission is selected, retrieve the samples:
samples = dsp.show_submittable_names('samples')
It may take a while, depending on the number of samples. When you’re asked if you want to print the names, input
n
- Create a new submission and select it:
dsp.create_submission() dsp.select_submission()
Pro-tip: Newly-created submissions are always listed last.
- Modify the metadata to delete all the information and set the release date for 100 years into the future:
import datetime hundred_years = (datetime.datetime.now() + datetime.timedelta(days=100*365)).date() for sample in samples: sample['attributes'] = {} sample['description'] = "Unfortunately, this metadata can't be reached as it has been withdrawn from public domain." sample['releaseDate'] = str(hundred_years)
- Load that data in the new submission:
for sample in samples: dsp.create_submittable(sample, 'samples')
- Finish the submission:
dsp.finish_submission()
-
Close you python session.
- Ask DSP to delete the original submission from their records with this email template provided
Other
Email templates
Please modify the templates with the dataset-specific details before sending it.
Wrangler-team email
Click on the following link to send the email: Email the wrangler team
Hello
We have been asked to retract the dataset "<shortname_of_dataset>".
Please delete your own copies of the following emails:
- <title_of_email_thread>
(more if necessary)
Please remove any local copies you may have of spreadsheets associated with "<shortname_of_dataset>"
If you were involved in wrangling this dataset, please ensure no local copies of the data/metadata are left in the EC2 instance
Many thanks for your cooperation
Best regards,
<Wrangler name>
DCP team
Click on the following link to send the email: Email the DCP
Hello,
We have been asked to retract the dataset "<shortname_of_dataset", with project UUID "<project_uuid>".
Please delete any copies of the data and metadata that you might have on your system.
Please contact wrangler-team@data.humancellatlas.org if you have any questions about this process.
Best regards,
<Wrangler name>
ENA
Click on the following link to send the email: Email ENA
Hello,
We have been asked to retract the dataset "<ENA_study_accession>"
We were responsible for brokering the dataset into the ENA database, so please delete any copies of the data and metadata that you might have on your system.
Please contact wrangler-team@data.humancellatlas.org if you have any questions about this process.
Best regards,
<Wrangler name>
ArrayExpress
Click on the following link to send the email: Email AE
Hello,
We have been asked to retract the dataset "<ArraExpress_project_accession>"
Please delete any copies of the data and metadata that you might have on your system.
Please contact wrangler-team@data.humancellatlas.org if you have any questions about this process.
Best regards,
<Wrangler name>
BioStudies
Click on the following link to send the email: Email BioStudies
Hello,
We have been asked to retract the dataset "<BioStudies_project_accession>"
Please delete any copies of the data and metadata that you might have on your system.
Please contact wrangler-team@data.humancellatlas.org if you have any questions about this process.
Best regards,
<Wrangler name>
DSP
Click on the following link to send the email: Email DSP
Hello,
We have been asked to retract the submission "<DSP_submission_uuid>"
Please delete any copies of the data and metadata that you might have on your system.
Please contact wrangler-team@data.humancellatlas.org if you have any questions about this process.
Best regards,
<Wrangler name>