Table of contents
- TOC {:toc}
Overview
This is the SOP for fixing datasets in the issue: ebi-ait/hca-ebi-wrangler-central#250
Requirements
- Gain access to the EBI cluster
ssh ebi-cli.ebi.ac.uk
- Install
gsutil
in your environment in the EBI cluster and log in using your EMBL-EBI google account. You could follow instructions from https://cloud.google.com/storage/docs/gsutil_install to installgs_util
. See more details here about setting up your access to Terra staging area. - Get DSP’s Webin credentials. (Only an ingest developer has access to this atm)
- Clone the ingest-archiver repository. The scripts that will be used is in
ena
directory of that repo.git clone https://github.com/ebi-ait/ingest-archiver.git pip install -r requirements.txt
- Get your JWT Token from Ingest UI.
- Log in in Ingest UI https://contribute.data.humancellatlas.org/login using the account which has the WRANGLER role.
- In Chrome, right click and select Inspect to open the developer console. Select the Network tab.
- Refresh the page,
https://contribute.data.humancellatlas.org/home
- Check the Authorization in headers of the request to
https://api.ingest.archive.data.humancellatlas.org/auth/account
- Copy the token after Bearer prefix:
Authorization: Bearer <copy the very long string of random characters>
- That token has 1 hr validity. The token will be needed in the submitter script later.
Steps
1 - Suppress sequencing runs
-
Get the list of sequencing runs to be suppressed. This can be downloaded as TSV/JSON from the ENA Browser.
-
File a ticket via ENA helpdesk to suppress the old sequencing runs. Guide on answering the form questions:
Submitter: Broker Query is related to: Suppression I work on: Humans Organisms classification: Not applicable The work is: Other/not sure (Raw sequencing reads)
2 - Clear sequencing runs
Clear the sequencing run accessions in file metadata. The following should not be in the file metadata json: "insdc_run_accessions": [ "ERR6449905" ]
Update clear_run_accession_from_files.py to have a jwt token from the Ingest UI then run the following: python clear_run_accession_from_files.py <submission-uuid>
3 - Submit new sequencing runs
-
Make sure that the metadata in Ingest contains sequencing experiment accessions. The submitter script will raise an error if any of the assay processes has no accession. The assay processes in the submission should have the following property:
"insdc_experiment": { "insdc_experiment_accession": "ERX4319109" }
-
Download all files from Ingest / Terra upload area to any directory inside
/nfs/production/hca/
in the EBI cluster. gsutil can be used for downloading the files The files may also be in the hca-util upload area but we should make sure they’re valid. Using Ingest/Terra upload area means the files have already been validated before. Please prefer downloading the Terra upload area as downloading from Ingest upload area will incur cost to our AWS account. - Checksum all the files.
gsutil hash -hm gs://broad-dsp-monster-hca-prod-ebi-storage/prod/<project_uuid>/data/* | grep -A1 "hex" | awk -F"/" '{printf $4 $1}' | awk -F"--" '{for (i=1;i<NF;i++)print $i}' | awk -F":" '{print $1 $3}' > <md-filename>.txt
- Upload the files to Webin FTP upload area (could be in parallel with checksumming)
$ cd <directory where you downloaded the files> $ lftp webin2.ebi.ac.uk -u <webin-user> $ > # input webin-password $ mkdir parent-dir $ cd parent-dir $ mput *
Please refer to ENA documentation for more details
- Run the submit_10x_fastq_files.py script. The
receipt.xml
andreport.json
file should be available after running the script. Thereceipt.xml
will contain the ENA REST API response. Thereport.json
will contain some report on which files were updated with the run accessions from ENA response.python submit_10x_fastq_files.py <submission-uuid> <md5-filename> <jwt-token-from-ingest-ui> [--ftp_dir <parent-dir>]
- Verify that the new runs were submitted. They should be visible in the Webin Portal but it may take 48 hours before they become available in the ENA browser