Data retraction SOP

Purpose of this document
Overview
Before you start
Internal
External
1. DCP2
2. Archives
3. DSP
Other

Purpose of this document

This document will serve to capture the steps needed to do a full cleanup of a dataset brokered by the Data Wrangling team.

Overview

[Click here to edit]

Before you start

To follow all the steps, a developer working for the ingest team is needed. Most of the steps will only need wrangler credentials, but some steps have to be carried out by a developer. If a wrangler is responsible for this task, please contact and assign a dev to help with the specific steps.

Needed permissions

Please make sure you have the proper permissions or you are paired with someone who has. The needed permissions are:

Access to the EC2 instance: Any wrangler/dev should have access.
Admin access to s3 buckets: Any wrangler/dev should have access. If you don’t have access, please set up your credentials by running aws config and providing with your wrangler/ingest s3 access key and secret access key.
Dev access to Ingest Core: Every dev should have access.
Access to subs.team-2 AAP domain: Please retrieve the ingest DSP user and password by using the following command:
```
aws --region us-east-1 secretsmanager get-secret-value --secret-id ingest/dev/secrets --query SecretString --output text | jq -jr .archiver_api_key
```
Only developers are able to access the secrets manager.

Set up gsutil to access staging area

If you already have gsutil set up, please skip this step

https://ebi-ait.github.io/hca-ebi-dev-team/admin_setup/Setting-up-access-to-Terra-staging-area.html

Set up hca-util with wrangler credentials

Please also make sure that you have the hca-util set up with the wrangler credentials:

hca-util configure

And input the wrangler or ingest credentials.

Open an issue in GitHub

Finally, please open an issue in the wrangler repo with the template named as Dataset Retraction and tick through the boxes as you go through the process of retraction.

Write down accessions/UUIDs/GitHub issue:

You will need to locate the following UUIDs:

DCP project UUID and project shortname: Locate the project in the UI and copy the project UUID and the shortname
DCP submission UUID: Locate the submission in the UI and copy the submission UUID
ENA/BioStudies study/project accession: Inside the Submission in the UI, locate the accessions tab and copy the study and project accessions
DSP submission UUID: Locate the DSP submission UUID following the instructions on point three of this document’s section
GH issue: Please locate the github issue within the hca-ebi-wrangler-central repository and write down the url. Please note that it may be closed.

Internal

Data and metadata that is stored within our immediate reach and deleting that data would not cause any consequence downstream.

Amazon buckets/EC2 instance

Currently, we store contributor data in the cloud before and while brokering it. The places it needs to be retracted from are:

HCA-util contributor areas: Locate the area and delete it

 hca-util list -b
 hca-util upload select <area>
 hca-util delete -d

Ingest upload area: This task can only be carried out by a developer. Send a DELETE request to the Upload Service as following

# retrieve Upload Service Api-Key (dev)
aws --region us-east-1 secretsmanager get-secret-value --secret-id ingest/dev/secrets --query SecretString --output text | jq -jr .staging_api_key

# delete an upload area (dev)
curl -X DELETE "https://upload.dev.archive.data.humancellatlas.org/v1/area/<upload_area_uuid>" -H  "accept: application/json" -H  "Api-Key: <API-KEY>"

Please, make sure that the submission was not tested in other environments. If in doubt, please ask the primary wrangler (Stated in the GH ticket/Dataset tracking sheet)

EC2 instance: If this data has been downloaded to the EC2 for some reason (e.g. validation), please make sure to remove it. If the person tasked with the retraction is not the primary/secondary wrangler, the email later in the document will cover this.

EBI Cluster

When archiving, if the data needed to be converted to bam files, it will be saved in the folder used for that purpose. To delete this data:

ssh noah-login
rm -r /nfs/production/hca/<name_of_the_folder>/

Tips to find the folder: It is usually best practice to set up the name with the shortname of the project, if you know this dataset was archived but there is no apparent folder, please contact the primary wrangler for this project. The primary wrangler’s name can be found in the GitHub ticket.

Staging area

Find the directory of the project in the Terra staging area GCP bucket. The GCP buckets locations are configured in <env>.yaml files in ingest-kube-deployment/apps
```
 # Example project directory in dev:
   
 gs://broad-dsp-monster-hca-dev-ebi-staging/dev/<project-uuid> 
```

Remove the project uuid subdirectory:

gsutil -m rm gs://<bucket-parent-directory>/<project-uuid>

Spreadsheet

The metadata spreadsheet can be in many different locations. Please make sure you delete each and every copy that you have:

Google drive folder: Locate the project folder under brokering and delete it in google drive.
Email
1. Head to the Wrangler team Google group and delete any email thread that may contain the spreadsheet.
2. Send an email to the wrangler team
3. Delete your own local copies of the thread in your mail apps.
Please ask for a dev to delete the copy of the spreadsheet associated with the submission in the ingest database.

Instructions for dev to follow:

The spreadsheets are saved in the ingest-broker pod under /data/spreadsheets. Log into the pod as follows and delete the required spreadsheet(s).

# switch to env (e.g. prod) context
kubectx ingest-eks-<env>
# get ingest-broker-pod name
kubectl get pods | grep broker | awk '{print $1}'
kubectl exec -it <ingest-broker-pod> -- /bin/sh 

Ingest

UI

Metadata and data can be found in the ingest submission. If following the steps, the data and the metadata spreadsheet should already be removed, but there’s still the metadata in the database.

Before deleting the submission, please remember to retrieve the following:

BioStudies accession
ENA study/project accession
Ingest project/submission UUID
DSP submission UUID

And please run the deletion of files in the staging area first. The files in the staging area can only be mapped crawling through the ingest API.

Once these have been retrieved, you can proceed to ask a dev to delete the submission and the project, where applicable.

Instructions for dev to follow:

# Force delete a submitted submission

curl -X DELETE -H "Authorization: Bearer <TOKEN>" <INGEST_API_URL>/submissionEnvelopes/<ARCHIVE-SUBMISION-ID>?force=true

Metadata archiver

The metadata that is archived through ingest’s archiver is stored in the archiver endpoints. A developer is needed for the following points:

1. Check the _links.self.href from the json:

curl -X GET -H "Accept: application/hal+json" <INGEST_API_URL>/archiveSubmissions/search/findByDspUuid?dspUuid=<DSP-SUBMISSION-UUID>

2. Send a delete request to delete the DSP metadata being tracked in Ingest DB

curl -X DELETE -H "Authorization: Bearer <TOKEN>" <INGEST_API_URL>/archiveSubmissions/<ARCHIVE-SUBMISION-ID>

External

Depending on the route the dataset took on the system, there might be data and metadata that needs to be retracted from sources external to the ingestion team.

DCP2

Pipelines team
Terra repository
Data browser

Please contact them following this email template

DSP

While not completely internal, we have the ability to update the BioSamples submission within DSP.

The steps are:

Remove all metadata within the BioSamples entities. Just leave the description field with an update:
```
Unfortunately, this metadata can't be reached as it has been withdrawn from public domain.
```
And update the release date for 100 years into the future.
Update the submission.

Clone the hca-to-dsp-tools repository and install requirements

git clone https://github.com/ebi-ait/hca-to-dsp-tools.git
pip install -r requirements.txt

Set up your credentials in a txt file called cred.txt at the root of the repository. This file has to have the following structure:
```
USER=<dsp_username>
PASSWORD=<dsp_password>
ROOT=https://submission.ebi.ac.uk/api/
```
Open a python console on the root of the repository and run the following:
```
import dsp_cli.DSP_submission as ds
dsp = ds.DspCLI()
dsp.select_submission()
```
It will proceed to ask for what team do you want to use and what submission should be selected. The team is subs.team-2 and the submission will use the ID you gathered before starting. Please keep this session opened as all the following steps will be carried out in this session.
Once the submission is selected, retrieve the samples:
```
samples = dsp.show_submittable_names('samples')
```
It may take a while, depending on the number of samples. When you’re asked if you want to print the names, input n
Create a new submission and select it:
```
dsp.create_submission()
dsp.select_submission()
```
Pro-tip: Newly-created submissions are always listed last.

Modify the metadata to delete all the information and set the release date for 100 years into the future:

import datetime
hundred_years = (datetime.datetime.now() + datetime.timedelta(days=100*365)).date()
for sample in samples:
    sample['attributes'] = {}
    sample['description'] = "Unfortunately, this metadata can't be reached as it has been withdrawn from public domain."
    sample['releaseDate'] = str(hundred_years)

Load that data in the new submission:

for sample in samples:
    dsp.create_submittable(sample, 'samples')

Finish the submission:
```
dsp.finish_submission()
```
Close you python session.
Ask DSP to delete the original submission from their records with this email template provided

Other

Email templates

Please modify the templates with the dataset-specific details before sending it.

Wrangler-team email

Click on the following link to send the email: Email the wrangler team

Hello

We have been asked to retract the dataset  "<shortname_of_dataset>". 

Please delete your own copies of the following emails:

- <title_of_email_thread>
(more if necessary)

Please remove any local copies you may have of spreadsheets associated with "<shortname_of_dataset>"

If you were involved in wrangling this dataset, please ensure no local copies of the data/metadata are left in the EC2 instance

Many thanks for your cooperation

Best regards,

<Wrangler name>

DCP team

Click on the following link to send the email: Email the DCP

Hello,

We have been asked to retract the dataset "<shortname_of_dataset", with project UUID "<project_uuid>".

Please delete any copies of the data and metadata that you might have on your system.

Please contact wrangler-team@data.humancellatlas.org if you have any questions about this process.

Best regards,

<Wrangler name>

ENA

Click on the following link to send the email: Email ENA

Hello,

We have been asked to retract the dataset "<ENA_study_accession>"

We were responsible for brokering the dataset into the ENA database, so please delete any copies of the data and metadata that you might have on your system.

Please contact wrangler-team@data.humancellatlas.org if you have any questions about this process.

Best regards,

<Wrangler name>

ArrayExpress

Click on the following link to send the email: Email AE

Hello,

We have been asked to retract the dataset "<ArraExpress_project_accession>"

Please delete any copies of the data and metadata that you might have on your system.

Please contact wrangler-team@data.humancellatlas.org if you have any questions about this process.

Best regards,

<Wrangler name>

BioStudies

Click on the following link to send the email: Email BioStudies

Hello,

We have been asked to retract the dataset "<BioStudies_project_accession>"

Please delete any copies of the data and metadata that you might have on your system.

Please contact wrangler-team@data.humancellatlas.org if you have any questions about this process.

Best regards,

<Wrangler name>

DSP

Click on the following link to send the email: Email DSP

Hello,

We have been asked to retract the submission "<DSP_submission_uuid>"

Please delete any copies of the data and metadata that you might have on your system.

Please contact wrangler-team@data.humancellatlas.org if you have any questions about this process.

Best regards,

<Wrangler name>