Data retraction SOP

Table of contents

  1. Purpose of this document
  2. Overview
  3. Before you start
    1. Needed permissions
    2. Set up gsutil to access staging area
    3. Set up hca-util with wrangler credentials
    4. Open an issue in GitHub
    5. Write down accessions/UUIDs/GitHub issue:
  4. Internal
    1. Amazon buckets/EC2 instance
    2. EBI Cluster
    3. Staging area
    4. Spreadsheet
    5. Ingest
      1. UI
      2. Metadata archiver
  5. External
    1. DCP2
    2. Archives
    3. DSP
  6. Other
    1. Email templates
      1. Wrangler-team email
      2. DCP team
    2. ENA
    3. ArrayExpress
    4. BioStudies
    5. DSP

Purpose of this document

This document will serve to capture the steps needed to do a full cleanup of a dataset brokered by the Data Wrangling team.

Overview

[Click here to edit]

Before you start

To follow all the steps, a developer working for the ingest team is needed. Most of the steps will only need wrangler credentials, but some steps have to be carried out by a developer. If a wrangler is responsible for this task, please contact and assign a dev to help with the specific steps.

Needed permissions

Please make sure you have the proper permissions or you are paired with someone who has. The needed permissions are:

  1. Access to the EC2 instance: Any wrangler/dev should have access.
  2. Admin access to s3 buckets: Any wrangler/dev should have access. If you don’t have access, please set up your credentials by running aws config and providing with your wrangler/ingest s3 access key and secret access key.
  3. Dev access to Ingest Core: Every dev should have access.
  4. Access to subs.team-2 AAP domain: Please retrieve the ingest DSP user and password by using the following command:
    aws --region us-east-1 secretsmanager get-secret-value --secret-id ingest/dev/secrets --query SecretString --output text | jq -jr .archiver_api_key
    

    Only developers are able to access the secrets manager.

Set up gsutil to access staging area

If you already have gsutil set up, please skip this step

https://ebi-ait.github.io/hca-ebi-dev-team/admin_setup/Setting-up-access-to-Terra-staging-area.html

Set up hca-util with wrangler credentials

Please also make sure that you have the hca-util set up with the wrangler credentials:

hca-util configure

And input the wrangler or ingest credentials.

Open an issue in GitHub

Finally, please open an issue in the wrangler repo with the template named as Dataset Retraction and tick through the boxes as you go through the process of retraction.

Write down accessions/UUIDs/GitHub issue:

You will need to locate the following UUIDs:

  • DCP project UUID and project shortname: Locate the project in the UI and copy the project UUID and the shortname
  • DCP submission UUID: Locate the submission in the UI and copy the submission UUID
  • ENA/BioStudies study/project accession: Inside the Submission in the UI, locate the accessions tab and copy the study and project accessions
  • DSP submission UUID: Locate the DSP submission UUID following the instructions on point three of this document’s section
  • GH issue: Please locate the github issue within the hca-ebi-wrangler-central repository and write down the url. Please note that it may be closed.

Internal

Data and metadata that is stored within our immediate reach and deleting that data would not cause any consequence downstream.

Amazon buckets/EC2 instance

Currently, we store contributor data in the cloud before and while brokering it. The places it needs to be retracted from are:

  1. HCA-util contributor areas: Locate the area and delete it
     hca-util list -b
     hca-util upload select <area>
     hca-util delete -d
    
  2. Ingest upload area: This task can only be carried out by a developer. Send a DELETE request to the Upload Service as following
# retrieve Upload Service Api-Key (dev)
aws --region us-east-1 secretsmanager get-secret-value --secret-id ingest/dev/secrets --query SecretString --output text | jq -jr .staging_api_key

# delete an upload area (dev)
curl -X DELETE "https://upload.dev.archive.data.humancellatlas.org/v1/area/<upload_area_uuid>" -H  "accept: application/json" -H  "Api-Key: <API-KEY>"

Please, make sure that the submission was not tested in other environments. If in doubt, please ask the primary wrangler (Stated in the GH ticket/Dataset tracking sheet)

  1. EC2 instance: If this data has been downloaded to the EC2 for some reason (e.g. validation), please make sure to remove it. If the person tasked with the retraction is not the primary/secondary wrangler, the email later in the document will cover this.

EBI Cluster

When archiving, if the data needed to be converted to bam files, it will be saved in the folder used for that purpose. To delete this data:

ssh noah-login
rm -r /nfs/production/hca/<name_of_the_folder>/

Tips to find the folder: It is usually best practice to set up the name with the shortname of the project, if you know this dataset was archived but there is no apparent folder, please contact the primary wrangler for this project. The primary wrangler’s name can be found in the GitHub ticket.

Staging area

  1. Find the directory of the project in the Terra staging area GCP bucket. The GCP buckets locations are configured in <env>.yaml files in ingest-kube-deployment/apps

     # Example project directory in dev:
       
     gs://broad-dsp-monster-hca-dev-ebi-staging/dev/<project-uuid> 
    
  2. Remove the project uuid subdirectory:

    gsutil -m rm gs://<bucket-parent-directory>/<project-uuid>
    

Spreadsheet

The metadata spreadsheet can be in many different locations. Please make sure you delete each and every copy that you have:

  1. Google drive folder: Locate the project folder under brokering and delete it in google drive.
  2. Email
    1. Head to the Wrangler team Google group and delete any email thread that may contain the spreadsheet.
    2. Send an email to the wrangler team
    3. Delete your own local copies of the thread in your mail apps.
  3. Please ask for a dev to delete the copy of the spreadsheet associated with the submission in the ingest database.

Instructions for dev to follow:

The spreadsheets are saved in the ingest-broker pod under /data/spreadsheets. Log into the pod as follows and delete the required spreadsheet(s).

# switch to env (e.g. prod) context
kubectx ingest-eks-<env>
# get ingest-broker-pod name
kubectl get pods | grep broker | awk '{print $1}'
kubectl exec -it <ingest-broker-pod> -- /bin/sh 

Ingest

UI

Metadata and data can be found in the ingest submission. If following the steps, the data and the metadata spreadsheet should already be removed, but there’s still the metadata in the database.

Before deleting the submission, please remember to retrieve the following:

  • BioStudies accession
  • ENA study/project accession
  • Ingest project/submission UUID
  • DSP submission UUID

And please run the deletion of files in the staging area first. The files in the staging area can only be mapped crawling through the ingest API.

Once these have been retrieved, you can proceed to ask a dev to delete the submission and the project, where applicable.

Instructions for dev to follow:

# Force delete a submitted submission


curl -X DELETE -H "Authorization: Bearer <TOKEN>" <INGEST_API_URL>/submissionEnvelopes/<ARCHIVE-SUBMISION-ID>?force=true

Metadata archiver

The metadata that is archived through ingest’s archiver is stored in the archiver endpoints. A developer is needed for the following points:

1. Check the _links.self.href from the json:

curl -X GET -H "Accept: application/hal+json" <INGEST_API_URL>/archiveSubmissions/search/findByDspUuid?dspUuid=<DSP-SUBMISSION-UUID>

2. Send a delete request to delete the DSP metadata being tracked in Ingest DB

curl -X DELETE -H "Authorization: Bearer <TOKEN>" <INGEST_API_URL>/archiveSubmissions/<ARCHIVE-SUBMISION-ID>

External

Depending on the route the dataset took on the system, there might be data and metadata that needs to be retracted from sources external to the ingestion team.

DCP2

  • Pipelines team
  • Terra repository
  • Data browser

Please contact them following this email template

Archives

With the following procedures on archiving, the dataset might be in many different places:

  • ENA: Please use the email template provided to contact ENA helpdesk.
  • ArrayExpress: Please use the email template provided to contact AE directly. If this dataset has been included in SCEA, please also add a line for them to contact SCEA.
  • BioSamples: Please follow the DSP section to remove all the metadata from the BioSamples entities associated with this project following the steps listed in the DSP section.
  • BioStudies: Please contact BioStudies Helpdesk with the email template provided

DSP

While not completely internal, we have the ability to update the BioSamples submission within DSP.

The steps are:

  1. Remove all metadata within the BioSamples entities. Just leave the description field with an update:
    Unfortunately, this metadata can't be reached as it has been withdrawn from public domain.
    

    And update the release date for 100 years into the future.

  2. Update the submission.

  3. Clone the hca-to-dsp-tools repository and install requirements
    git clone https://github.com/ebi-ait/hca-to-dsp-tools.git
    pip install -r requirements.txt
    
  4. Set up your credentials in a txt file called cred.txt at the root of the repository. This file has to have the following structure:
    USER=<dsp_username>
    PASSWORD=<dsp_password>
    ROOT=https://submission.ebi.ac.uk/api/
    
  5. Open a python console on the root of the repository and run the following:
    import dsp_cli.DSP_submission as ds
    dsp = ds.DspCLI()
    dsp.select_submission()
    

    It will proceed to ask for what team do you want to use and what submission should be selected. The team is subs.team-2 and the submission will use the ID you gathered before starting. Please keep this session opened as all the following steps will be carried out in this session.

  6. Once the submission is selected, retrieve the samples:
    samples = dsp.show_submittable_names('samples')
    

    It may take a while, depending on the number of samples. When you’re asked if you want to print the names, input n

  7. Create a new submission and select it:
    dsp.create_submission()
    dsp.select_submission()
    

    Pro-tip: Newly-created submissions are always listed last.

  8. Modify the metadata to delete all the information and set the release date for 100 years into the future:
    import datetime
    hundred_years = (datetime.datetime.now() + datetime.timedelta(days=100*365)).date()
    for sample in samples:
        sample['attributes'] = {}
        sample['description'] = "Unfortunately, this metadata can't be reached as it has been withdrawn from public domain."
        sample['releaseDate'] = str(hundred_years)
    
  9. Load that data in the new submission:
    for sample in samples:
        dsp.create_submittable(sample, 'samples')
    
  10. Finish the submission:
    dsp.finish_submission()
    
  11. Close you python session.

  12. Ask DSP to delete the original submission from their records with this email template provided

Other

Email templates

Please modify the templates with the dataset-specific details before sending it.

Wrangler-team email

Click on the following link to send the email: Email the wrangler team

Hello

We have been asked to retract the dataset  "<shortname_of_dataset>". 

Please delete your own copies of the following emails:

- <title_of_email_thread>
(more if necessary)

Please remove any local copies you may have of spreadsheets associated with "<shortname_of_dataset>"

If you were involved in wrangling this dataset, please ensure no local copies of the data/metadata are left in the EC2 instance

Many thanks for your cooperation

Best regards,

<Wrangler name>

DCP team

Click on the following link to send the email: Email the DCP

Hello,

We have been asked to retract the dataset "<shortname_of_dataset", with project UUID "<project_uuid>".

Please delete any copies of the data and metadata that you might have on your system.

Please contact wrangler-team@data.humancellatlas.org if you have any questions about this process.

Best regards,

<Wrangler name>

ENA

Click on the following link to send the email: Email ENA

Hello,

We have been asked to retract the dataset "<ENA_study_accession>"

We were responsible for brokering the dataset into the ENA database, so please delete any copies of the data and metadata that you might have on your system.

Please contact wrangler-team@data.humancellatlas.org if you have any questions about this process.

Best regards,

<Wrangler name>

ArrayExpress

Click on the following link to send the email: Email AE

Hello,

We have been asked to retract the dataset "<ArraExpress_project_accession>"

Please delete any copies of the data and metadata that you might have on your system.

Please contact wrangler-team@data.humancellatlas.org if you have any questions about this process.

Best regards,

<Wrangler name>

BioStudies

Click on the following link to send the email: Email BioStudies

Hello,

We have been asked to retract the dataset "<BioStudies_project_accession>"

Please delete any copies of the data and metadata that you might have on your system.

Please contact wrangler-team@data.humancellatlas.org if you have any questions about this process.

Best regards,

<Wrangler name>

DSP

Click on the following link to send the email: Email DSP

Hello,

We have been asked to retract the submission "<DSP_submission_uuid>"

Please delete any copies of the data and metadata that you might have on your system.

Please contact wrangler-team@data.humancellatlas.org if you have any questions about this process.

Best regards,

<Wrangler name>