Wrangling Datasets SOP

Background

This SOP document aims to describe the process of wrangling datasets into ingest for submission into the Data Portal.
Keep track of the wrangling process documentation using the wrangling process diagram.

We’ll focus on submitting from a publication, but more alternatives are found here.

Table of contents

  1. Consent type
  2. Project managment
  3. Gathering data
    1. Data download
  4. Data upload
    1. Data upload - contributor
  5. Gather metadata
    1. For published datasets only
    2. Spreadsheet template tailoring
  6. Curating metadata
    1. General best practices
    2. Ontology curation
    3. General metadata curation tips
    4. Filling in metadata about the files
  7. Submission process in Ingest
    1. Create submission
    2. Metadata validation
    3. Sync data
    4. Experimental graph validation
    5. Secondary Review
  8. Completing the submission
  9. Export
    1. What to review in browser?
    2. Import Form Details for DCP data releases
  10. Brokering to SCEA

Once a dataset needs wrangling is assigned to us, first thing to do is to decide on the consent type of the donors.

Before any metadata and data can be received from contributors they are required to sign the HCA Data contributor agreement (DCA). Two versions are available,

  • a light DCA to be used for open access dataseta, requiring no institutional signer - DCA is being finalised, todo: add link to pdf
  • a full DCA to be used for managed access dataset.

After confirming what type of consent the contributor have obtained for their dataset, send them the appropriate DCA.

For more information on managed access SOPs, including how to sign the DCA read here:

For open access template emails:

OA DCA SOPs might have to change soon.

Project managment

When working with new projects we need to be sure that we keep track of everything in the same way, so all team members can track their status. Two main tracking systems exist:

  1. Ingest project
  2. Github project issue

Read more in the relevant SOP

Gathering data

Collecting data files is important because submissions with no data are not allowed in the Data portal, and a submission with no data will fail the validation step when ingested.

Data for the submission might be already deposited as open access in other archives or repositories in the case of published datasets, or it might come directly from a contributor in case of direct submission to the Data Portal. When working on a direct submission for a contributor wranglers need to have a signed DCA in place before collecting data.

EBI team has created a very handy aws-wrapper cli tool to upload data into the appropriate bucket called hca-util. You can read more about how to use hca-util to create and manage upload areas for data here.

Once we have made sure that files are available for the project we need to:

  • Create an hca-util upload area
  • Upload files to the upload area

Note that this step does not need to be completed now, and can wait until after the metadata spreadsheet has been gathered.

Data download

If the data that is going to be deposited is in a public archive, the wranlger is in charge of transferring the data to the hca-util upload area.

Raw (fastq) data files: Once the upload area has been created, there are several ways to upload the files from ENA/SRA. We have multiple SOPs depending on where files are deposited:

GEO Matrix data files: For most publications with a GEO accession, gene matrices files are available for download. The matrices files can be directly downloaded either locally to your desktop by clicking the link, or via wget in the terminal and on EC2.

There is an SOP to help with programmatic downloading of GEO supplementary files.

Matrix and Cell types: For each project, wranglers should endeavour to find an expression matrix and if not embedded within that matrix, a listing of cell type annotations. These are generally linked from a publication, present as a supplementary file on the publication, GEO or ArrayExpress submission.

The preferred formats for matrix files are:

  • loom
  • h5ad
  • rds or Robj
  • cell ranger format (csv & mtx) - split into three files for _features.csv, _matrix.mtx and cell _barcodes.csv

Other repos: It is very common for contributors to upload their processed data into a custom webside that is not an archive. In this case, if you are sure there is consent to download and re-distribute data, you should get files locally.

Data upload

Once all files are downloaded locally (or in EC2 / S3), the data files need to be uploaded to an hca-util upload area. This is an intermediate bucket that allows us fast upload to final upload area in case submission envelope changes.

You can follow the upload instructions.

Data upload - contributor

If contributor is to upload their data, this should be done via hca-util as well. In order to do that an aws account needs to be created using this SOP.

Then share this email template for contributor to fill the metadata. # TODO add link to email template for contributor how to uploading files

Gather metadata

First, start by getting some initial information about the dataset from reading the published paper, and checking if there is data that is publicly available.

Once you have an understanding of which biomaterials, protocols, and processes are used, it is time to generate a metadata spreadsheet.

For published datasets only

Since there is no contributor involved, do make the spreadsheet as comprehensive as you think is necessary. Instead of the iterative process of the contributor filling in what they can, the wrangler reviewing, curating, and asking questions, there is only you (the wrangler) working with the spreadsheet. It is easy to get stuck, so don’t forget that you’re working as a team and get a second opinion if necessary!

Then move onto the ‘curating metadata’ section.

Spreadsheet template tailoring

Once the DCA has been signed, and you have some initial information about the dataset from a questionnaire, the publication or ameeting, it is time to tailor the spreadsheet for the contributor to fill in. You can use schema-template-generator but other option is to get the most recent full HCA metadata template from geo-to-hca repo, (download link).

The spreadsheet should be as simple as possible for the given experiment so that it is more easily understandable for the contributor so only select schema and properties that are relevant to their experiment.

After the spreadsheet is generated some manual steps can help contributors understand the spreadsheet including:

  • Ordering of tabs and ordering protocols between biomaterials can help contributors understand our graphical experiment model
  • Ordering of columns: Move linking columns e.g. INPUT DONOR ORGANISM to be in the first few columns of the sheet to be more visible
  • Ensure every tab that requires a link has a linking column (these are easy to miss)
  • Delete or hide columns that you know aren’t relevant
  • Pre-fill any metadata you already know (optional): if the dataset has a publication it is normally possible to gain information from the publication and prefill it into the spreadsheet

Once you have a customised and potentially pre-filled spreadsheet it can be sent to the contributor along with the contributor spreadsheet guide. It is generally an iterative process of the contributor filling in what they can, the wrangler reviewing, curating and asking questions before further curation until the metadata is complete.

Curating metadata

CAUTION!

When wranging from publication wranglers need to take care to only include donor metadata that are mapped to Tier 1. Without a DCA we are unable to confirm that the dataset is consented for open access publication in compliance with local laws, so to protect potentially sensitive metadata we only collect metadata that can be released publicly.

  • donor_organism.biomaterial_core.biomaterial_id
  • donor_organism.biomaterial_core.ncbi_taxon_id
  • donor_organism.sex
  • donor_organism.organism_age - in 10 year bins (i.e. 20-29 year, 70-79 years old)
  • donor_organism.development_stage - in 10 year bins (i.e. 20-29 years old, 70-79 years old)
  • specimen_from_organism.biomaterial_core.biomaterial_id
  • specimen_from_organism.organ

General best practices

For best practices on dataset wrangling, please refer to the document Wrangling best practices

Ontology curation

For ontologised fields, wranglers need to assign an ontology term that best suits the text provided by the contributor. You can use EBI’s Ontology Lookup Service (OLS). The ontologised fields must be present within the latest version of the HCAO or OLS4 and meet the graph restriction present in the schema. The ontology filler script can help with this process, but should be reviewed once complete.

If a wrangler cannot find an accurate ontology term and believes a term should be added to the relevant ontology, they should follow the Request ontology terms SOP.~~ _Is HCAO able to get new terms?

General metadata curation tips

  • organ or model_organ should be the broad major organ type, not a system
  • The organ_part or model_organ_part should be the most specific organ part available
  • biomaterial_ids should be either consistent with the publication or the tissue provider to allow identification of biomaterials with shared donors
  • Try to encourage the contributor to provide a meaningful identifier, not just a single digit

Library preparation protocol
You can refer to the assay cheat sheet for existing, standard assays so that we stay consistent across Lattice, EBI and UCSC wrangling teams.

Filling in metadata about the files

For datasets with large number of files, the ENA filename extractor tool can be of use. It requires at least to have already filled the ‘INSDC Experiment Accesion’ at the ‘Cell suspension’ and the ‘Sequence file’ tabs. The wrangler has to manually download a JSON report from the corresponding project’s page at ENA. This script will fill in the ‘File name’ column at the ‘Sequence file’ tab.

For each expression matrix or cell type annotation file that is found, a row needs to be filled in the metadata spreadsheet, in the ‘Analysis file’ tab. Analysis files can be linked to biomaterial entities via processes. Information related to the analysis protocol is captured in the Analysis_protocol entity (See the Analysis protocol tab) linked to the process.

??? The best practice is to link the analysis files to sequence file entities, if possible.
Link the analysis files to cell suspension entities. This is currently done by adding the ‘Input Cell Suspension ID’ column to the ‘Analysis File’ tab and adding the linked cell suspensions to the cell.

The gene expression matrix and cell annotations files should be added to the S3 bucket in the ingest-area together with raw data files, using the ‘hca-util tool’

image

Submission process in Ingest

Once the spreadsheet is considered complete by the primary wrangler, there are two phases of metadata validation that can be completed.

Create submission

The primary wrangler should upload the spreadsheet to the ingest production ui to check the validity of the spreadsheet.

In order to upload a spreadsheet, you need to attach spreadsheet into a submission to a project that already exists in the ingest ui:

  1. Click register project in homepage
  2. Fill as many project metadata as possible
  3. Register project
  4. In project page, go to 3. Data upload tab and click the Submit to Project button to upload the spreadsheet
    1. If you added project metadata in your spreadsheet (prefferable) select: Update the project metadata. This will overwrite the existing project metadata.
    2. If spreadsheet does not have project metadata select: Do not update the project metadata. This will ignore project worksheet in the spreadsheet.

Metadata validation

If any metadata is invalid, you will need to resolve any errors or problems. This can be done directly in the UI or by uploading a new fixed spreadsheet to the existing project.

  1. To upload a fixed spreadsheet to the project:
    1. Fix the errors in the spreadhseet and return to the existing project in the ui
    2. Click the 3. Upload tab to view the submissions
    3. Delete the submission with errors by clicking the trash icon next to the submission
    4. Go to the 3. Upload tab and click the Submit to Project button to upload the fixed spreadsheet. (If the button doesn’t appear try refreshing the page.)
    5. Repeat these steps until you have a project with valid metadata
  2. To edit directly in the ui, click the pencil symbol in specific entity to open the editing form
    1. change any fields that need to be edited
    2. click save

Please note:

  • Once a project has been created in the UI, it is best practice to retain the project’s unique identifier throughout the submission and validation process, so please only delete the project if there are serious issues with project level metadata that cannot be fixed easily in the UI.
  • There should never be duplicate projects in the production ui, if you do need to reupload an entire project, please delete the existing project before re-uploading a spreadsheet.

Sync data

Transferring data from hca-util upload area to ingest upload area

Once the contributor has uploaded all the data that is specified for the project or you have transferred the raw data files from the archive into an hca-util upload area and you have a valid metadata submission in the ingest UI, follow the hca-util guide to sync the data to the Upload Area Location that is specified on the submission at the bottom of the Data tab.

Experimental graph validation

Graph validator allows wranglers to perform some tests to check against our current graph assumptions.

Validation currently happens automatically in ingest. Once the metadata is validated against the schema, in the “Validate” tab in ingest you will find a button. Click on it and your experimental graph will be validated against the test currently deployed to master.

Any test that fails to pass will show a useful error message, alongside the entity that is related to the error. An example of this:

image image

Here you can see the list of rules. If you want to run the tests locally, or suggest a new test/report a bug, please follow the documentation in the ingest graph validator repository. Local deployment allows wrangler to visualise the experimental graph of the experiment.

Secondary Review

Once the spreadsheet has passed both phases of validation, the primary wrangler should ask another wrangler in the team to review the spreadsheet and suggest any required edits or updates. Once someone asks for secondary review, they should move the ticket to the Review status on the tracking board.

If any edits or updates are made, the existing submission in ingest will need to be deleted and the new spreadsheet uploaded in its place.

If any changes may have also affected the linking in the spreadsheet it should also be run through the ingest-graph-validator again.

A detailed guide to performing secondary review can be found here.

Once both the Primary and Secondary wrangler are happy with the submission and it is valid in ingest, the data files can now be moved from the contributor bucket into the ingest upload area.

Completing the submission

Once all the files and experimental graph have been validated and reviewed, the project will be ready for submission via Ingest UI.

If wrangling the project with direct input from a contributor, the primary wrangler should email the contributor to confirm:

  • They are happy with the final spreadsheet and curated ontologies, and
  • The date they have identified for the data and metadata to be released publicly

Currently, you can export either:

  1. only metadata Submit only metadata to the Human Cell Atlas.. This is useful only for updates. Known bug: If only metadata is exported, and data files are not in staging area, Import team’s validation will fail. To prevent that, remove all metadata/descriptors/links related to files not in staging.
  2. full submission Submit metadata and data to the Human Cell Atlas. This will make submission public with the next snapshot.

Always untick the Delete the upload area and data files after successful submission. checkbox. This is because once data is removed from ingest upload area, UploadApi is not allowing re-assignment of data files.

Once submission is submitted, there is a separate export SOP beyond ingest:

Export

A quite nice Data Portal Release SOP has been implemented by the Broad folks here.

  1. As soon as a dataset is ready for export, the wrangler should hit the submit button in the UI with the Submit to the Human Cell Atlas... checkbox ticked to trigger export and note the project UUID.
    1. Retrieve the project UUID from the URL of project page in the ingest browser or by clicking the “ID” value in the top right of project page
  2. The submitting wrangler checks export is complete.
    1. Check status in the UI, should change from exporting to exported. (This will take ~1-4 hours for most projects)
    2. If export is “stuck” in exporting for more than 4 hours –> Notify the ingest operations developer via slack and provide the project UUID so they can review the logs and work out what has happened. They will work with the wrangler to resolve this and re-export if necessary.
  3. The wrangling team is notified of export
    1. Move the dataset wrangling ticket to Verify pipeline of wrangling board
  4. The Broad data import team are notified of successful export
    1. Fill the HCA Import request for Production Releases (see values below)
      • wait until monthly release
  5. The submitting wrangler is notified that import has been successful or if there are issues for EBI to investigate
    • Broad data import team will notify via slack in the dcp-ops channel slack of any import problems (see bellow) or forward ticket UCSC team
  6. UCSC Browser team will notify submitting wrangler and Broad team when release is indexed and in the browser or if issues are encountered.
    • Via slack in the dcp-ops channel notifying wranglers when a release is in the browser to review or of any issues.
  7. A wrangler will do a final check that everything looks ok and notify UCSC on the data-ops channel.
    1. If big issue occurs, snapshot can be removed from release, fixed and re-export in next release.
    2. If minor issue occurs, dataset can stay in release but a fix will be re-exported in next release.
      • In order to move submission status, you can edit *any* field and submission state will go to draft and validate.
      • Contents of the project staging area in the staging bucket might be required to be deleted as well
      • Wrangler will trigger export by hitting submit and following steps 2-7 until the project is available and looks ok
  8. When the project is available in the browser, the wrangler can email the contributor or contacts from the publication to inform them of the URL where the project is available

What to review in browser?

When wrangler is reviewing data in browser, we want to make sure that all the changes are applied and no weird thing is shown into the Portal. This could be:

  • all facet’ed fields are as expected (righten side of page)
  • contributors names are presented as expected
  • matrix files are shown in tab as expected
  • publication name and url is valid
  • file formats and number of files are as expected
  • biomaterial and atlas icon is correct

Import Form Details for DCP data releases

Import form
Import form result sheet

Field Explanation
Email address So you can be contacted if any issues with import
Release # The integer number of the ~monthly release cutoff
Your Institution The institution of submitting wrangler
Project Name Project name of exported project
Is this new data, updated data or analysis results? Choose the appropriate responses.
Project UUID Exported project uuid
Project DUOS ID If project is MA, specify the assigned DUOS-ID
Does this project contain Managed Access data? Yes/ No
Additional info Any other notes you want to communicate to the import team.
Optional Release Note for HCA Releases Public notes to be included in Release notes
Additional Info Internal information for other components
Project Contact Corresponding author of the project to be contacted upon release
Highlight in an upcoming DCP users announcement? Help engagement team to select projects for Release notes

Notes

  • EBI will export on demand, and notify the Broad via import form
  • Broad will batch import once prior to the monthly release (check the exact date on Import form release description)
  • Contents of the staging area should be deleted for a clean export, unless data already in staging can be used
  • See this sheet for a rolling list of projects where a an import request form has been filled out.

Warning: Wranglers should be aware of when prod releases are occurring and not upload/submit until after the release to that environment is completed. Releases do not currently follow a set schedule so stay tuned to updates posted in the #hca slack channel in the AIT workspace. See the Ingest release SOP for more details.

Additionally, move all the corresponding documents to Google Drive/Brokering/PROJECTS-FINISHED

## TDR import errors

If some errors in Import into TDR occur, we have an SOP that provides some insights on how to handle those errors https://ebi-ait.github.io/hca-ebi-wrangler-central/SOPs/After_export/tdr_issues.html

Brokering to SCEA

hca_to_scea_tools_SOP See documentation on the hca-to-scea repo