Wrangling Datasets SOP
Background
This SOP document aims to describe the process of wrangling datasets into ingest for submission into the Data Portal.
Keep track of the wrangling process documentation using the wrangling process diagram.
We’ll focus on submitting from a publication, but more alternatives are found here.
Table of contents
- Consent type
- Project managment
- Gathering data
- Data upload
- Gather metadata
- Curating metadata
- Submission process in Ingest
- Completing the submission
- Export
- Brokering to SCEA
Once a dataset needs wrangling is assigned to us, first thing to do is to decide on the consent type of the donors.
Consent type
Before any metadata and data can be received from contributors they are required to sign the HCA Data contributor agreement (DCA). Two versions are available,
- a light DCA to be used for open access dataseta, requiring no institutional signer - DCA is being finalised, todo: add link to pdf
- a full DCA to be used for managed access dataset.
After confirming what type of consent the contributor have obtained for their dataset, send them the appropriate DCA.
For more information on managed access SOPs, including how to sign the DCA read here:
For open access template emails:
OA DCA SOPs might have to change soon.
Project managment
When working with new projects we need to be sure that we keep track of everything in the same way, so all team members can track their status. Two main tracking systems exist:
- Ingest project
- Github project issue
Read more in the relevant SOP
Gathering data
Collecting data files is important because submissions with no data are not allowed in the Data portal, and a submission with no data will fail the validation step when ingested.
Data for the submission might be already deposited as open access in other archives or repositories in the case of published datasets, or it might come directly from a contributor in case of direct submission to the Data Portal. When working on a direct submission for a contributor wranglers need to have a signed DCA in place before collecting data.
EBI team has created a very handy aws-wrapper cli tool to upload data into the appropriate bucket called hca-util. You can read more about how to use hca-util to create and manage upload areas for data here.
Once we have made sure that files are available for the project we need to:
- Create an
hca-utilupload area - Upload files to the upload area
Note that this step does not need to be completed now, and can wait until after the metadata spreadsheet has been gathered.
Data download
If the data that is going to be deposited is in a public archive, the wranlger is in charge of transferring the data to the hca-util upload area.
Raw (fastq) data files: Once the upload area has been created, there are several ways to upload the files from ENA/SRA. We have multiple SOPs depending on where files are deposited:
- Node SOP
- Globus SOP
- Dryad SOP
- aspera SOP
- ENA script to upload to an s3 bucket
- This script may fail the request to get the files sometimes if ENA’s servers are overloaded.
- SRA/ NCBI
GEO Matrix data files: For most publications with a GEO accession, gene matrices files are available for download. The matrices files can be directly downloaded either locally to your desktop by clicking the link, or via wget in the terminal and on EC2.
There is an SOP to help with programmatic downloading of GEO supplementary files.
Matrix and Cell types: For each project, wranglers should endeavour to find an expression matrix and if not embedded within that matrix, a listing of cell type annotations. These are generally linked from a publication, present as a supplementary file on the publication, GEO or ArrayExpress submission.
The preferred formats for matrix files are:
loomh5adrdsorRobj- cell ranger format (
csv&mtx) - split into three files for_features.csv,_matrix.mtxand cell_barcodes.csv
Other repos: It is very common for contributors to upload their processed data into a custom webside that is not an archive. In this case, if you are sure there is consent to download and re-distribute data, you should get files locally.
Data upload
Once all files are downloaded locally (or in EC2 / S3), the data files need to be uploaded to an hca-util upload area. This is an intermediate bucket that allows us fast upload to final upload area in case submission envelope changes.
You can follow the upload instructions.
Data upload - contributor
If contributor is to upload their data, this should be done via hca-util as well. In order to do that an aws account needs to be created using this SOP.
Then share this email template for contributor to fill the metadata. # TODO add link to email template for contributor how to uploading files
Gather metadata
First, start by getting some initial information about the dataset from reading the published paper, and checking if there is data that is publicly available.
Once you have an understanding of which biomaterials, protocols, and processes are used, it is time to generate a metadata spreadsheet.
For published datasets only
Since there is no contributor involved, do make the spreadsheet as comprehensive as you think is necessary. Instead of the iterative process of the contributor filling in what they can, the wrangler reviewing, curating, and asking questions, there is only you (the wrangler) working with the spreadsheet. It is easy to get stuck, so don’t forget that you’re working as a team and get a second opinion if necessary!
Then move onto the ‘curating metadata’ section.
Spreadsheet template tailoring
Once the DCA has been signed, and you have some initial information about the dataset from a questionnaire, the publication or ameeting, it is time to tailor the spreadsheet for the contributor to fill in. You can use schema-template-generator but other option is to get the most recent full HCA metadata template from geo-to-hca repo, (download link).
The spreadsheet should be as simple as possible for the given experiment so that it is more easily understandable for the contributor so only select schema and properties that are relevant to their experiment.
After the spreadsheet is generated some manual steps can help contributors understand the spreadsheet including:
- Ordering of tabs and ordering protocols between biomaterials can help contributors understand our graphical experiment model
- Ordering of columns: Move linking columns e.g.
INPUT DONOR ORGANISMto be in the first few columns of the sheet to be more visible - Ensure every tab that requires a link has a linking column (these are easy to miss)
- Delete or hide columns that you know aren’t relevant
- Pre-fill any metadata you already know (optional): if the dataset has a publication it is normally possible to gain information from the publication and prefill it into the spreadsheet
Once you have a customised and potentially pre-filled spreadsheet it can be sent to the contributor along with the contributor spreadsheet guide. It is generally an iterative process of the contributor filling in what they can, the wrangler reviewing, curating and asking questions before further curation until the metadata is complete.
Curating metadata
CAUTION!
When wranging from publication wranglers need to take care to only include donor metadata that are mapped to Tier 1. Without a DCA we are unable to confirm that the dataset is consented for open access publication in compliance with local laws, so to protect potentially sensitive metadata we only collect metadata that can be released publicly.
donor_organism.biomaterial_core.biomaterial_iddonor_organism.biomaterial_core.ncbi_taxon_iddonor_organism.sexdonor_organism.organism_age- in 10 year bins (i.e. 20-29 year, 70-79 years old)donor_organism.development_stage- in 10 year bins (i.e. 20-29 years old, 70-79 years old)specimen_from_organism.biomaterial_core.biomaterial_idspecimen_from_organism.organ
General best practices
For best practices on dataset wrangling, please refer to the document Wrangling best practices
Ontology curation
For ontologised fields, wranglers need to assign an ontology term that best suits the text provided by the contributor. You can use EBI’s Ontology Lookup Service (OLS). The ontologised fields must be present within the latest version of the HCAO or OLS4 and meet the graph restriction present in the schema. The ontology filler script can help with this process, but should be reviewed once complete.
If a wrangler cannot find an accurate ontology term and believes a term should be added to the relevant ontology, they should follow the Request ontology terms SOP.~~ _Is HCAO able to get new terms?
General metadata curation tips
organormodel_organshould be the broad major organ type, not a system- The
organ_partormodel_organ_partshould be the most specific organ part available biomaterial_ids should be either consistent with the publication or the tissue provider to allow identification of biomaterials with shared donors- Try to encourage the contributor to provide a meaningful identifier, not just a single digit
Library preparation protocol
You can refer to the assay cheat sheet for existing, standard assays so that we stay consistent across Lattice, EBI and UCSC wrangling teams.
Filling in metadata about the files
For datasets with large number of files, the ENA filename extractor tool can be of use. It requires at least to have already filled the ‘INSDC Experiment Accesion’ at the ‘Cell suspension’ and the ‘Sequence file’ tabs. The wrangler has to manually download a JSON report from the corresponding project’s page at ENA. This script will fill in the ‘File name’ column at the ‘Sequence file’ tab.
For each expression matrix or cell type annotation file that is found, a row needs to be filled in the metadata spreadsheet, in the ‘Analysis file’ tab. Analysis files can be linked to biomaterial entities via processes. Information related to the analysis protocol is captured in the Analysis_protocol entity (See the Analysis protocol tab) linked to the process.
??? The best practice is to link the analysis files to sequence file entities, if possible.
Link the analysis files to cell suspension entities. This is currently done by adding the ‘Input Cell Suspension ID’ column to the ‘Analysis File’ tab and adding the linked cell suspensions to the cell.
The gene expression matrix and cell annotations files should be added to the S3 bucket in the ingest-area together with raw data files, using the ‘hca-util tool’

Submission process in Ingest
Once the spreadsheet is considered complete by the primary wrangler, there are two phases of metadata validation that can be completed.
Create submission
The primary wrangler should upload the spreadsheet to the ingest production ui to check the validity of the spreadsheet.
In order to upload a spreadsheet, you need to attach spreadsheet into a submission to a project that already exists in the ingest ui:
- Click register project in homepage
- Fill as many project metadata as possible
- Register project
- In project page, go to
3. Data uploadtab and click theSubmit to Projectbutton to upload the spreadsheet- If you added project metadata in your spreadsheet (prefferable) select:
Update the project metadata.This will overwrite the existing project metadata. - If spreadsheet does not have project metadata select:
Do not update the project metadata.This will ignore project worksheet in the spreadsheet.
- If you added project metadata in your spreadsheet (prefferable) select:
Metadata validation
If any metadata is invalid, you will need to resolve any errors or problems. This can be done directly in the UI or by uploading a new fixed spreadsheet to the existing project.
- To upload a fixed spreadsheet to the project:
- Fix the errors in the spreadhseet and return to the existing project in the ui
- Click the
3. Uploadtab to view the submissions - Delete the submission with errors by clicking the trash icon next to the submission
- Go to the
3. Uploadtab and click theSubmit to Projectbutton to upload the fixed spreadsheet. (If the button doesn’t appear try refreshing the page.) - Repeat these steps until you have a project with valid metadata
- To edit directly in the ui, click the pencil symbol in specific entity to open the editing form
- change any fields that need to be edited
- click save
Please note:
- Once a project has been created in the UI, it is best practice to retain the project’s unique identifier throughout the submission and validation process, so please only delete the project if there are serious issues with project level metadata that cannot be fixed easily in the UI.
- There should never be duplicate projects in the production ui, if you do need to reupload an entire project, please delete the existing project before re-uploading a spreadsheet.
Sync data
Transferring data from hca-util upload area to ingest upload area
Once the contributor has uploaded all the data that is specified for the project or you have transferred the raw data files from the archive into an hca-util upload area and you have a valid metadata submission in the ingest UI, follow the hca-util guide to sync the data to the Upload Area Location that is specified on the submission at the bottom of the Data tab.
Experimental graph validation
Graph validator allows wranglers to perform some tests to check against our current graph assumptions.
Validation currently happens automatically in ingest. Once the metadata is validated against the schema, in the “Validate” tab in ingest you will find a button. Click on it and your experimental graph will be validated against the test currently deployed to master.
Any test that fails to pass will show a useful error message, alongside the entity that is related to the error. An example of this:

Here you can see the list of rules. If you want to run the tests locally, or suggest a new test/report a bug, please follow the documentation in the ingest graph validator repository. Local deployment allows wrangler to visualise the experimental graph of the experiment.
Secondary Review
Once the spreadsheet has passed both phases of validation, the primary wrangler should ask another wrangler in the team to review the spreadsheet and suggest any required edits or updates. Once someone asks for secondary review, they should move the ticket to the Review status on the tracking board.
If any edits or updates are made, the existing submission in ingest will need to be deleted and the new spreadsheet uploaded in its place.
If any changes may have also affected the linking in the spreadsheet it should also be run through the ingest-graph-validator again.
A detailed guide to performing secondary review can be found here.
Once both the Primary and Secondary wrangler are happy with the submission and it is valid in ingest, the data files can now be moved from the contributor bucket into the ingest upload area.
Completing the submission
Once all the files and experimental graph have been validated and reviewed, the project will be ready for submission via Ingest UI.
If wrangling the project with direct input from a contributor, the primary wrangler should email the contributor to confirm:
- They are happy with the final spreadsheet and curated ontologies, and
- The date they have identified for the data and metadata to be released publicly
Currently, you can export either:
- only metadata
Submit only metadata to the Human Cell Atlas.. This is useful only for updates.Known bug: If only metadata is exported, and data files are not in staging area, Import team’s validation will fail. To prevent that, remove all metadata/descriptors/links related to files not in staging. - full submission
Submit metadata and data to the Human Cell Atlas.This will make submission public with the next snapshot.
Always untick the Delete the upload area and data files after successful submission. checkbox. This is because once data is removed from ingest upload area, UploadApi is not allowing re-assignment of data files.
Once submission is submitted, there is a separate export SOP beyond ingest:
Export
A quite nice Data Portal Release SOP has been implemented by the Broad folks here.
- As soon as a dataset is ready for export, the wrangler should hit the submit button in the UI with the
Submit to the Human Cell Atlas...checkbox ticked to trigger export and note the project UUID.- Retrieve the project UUID from the URL of project page in the ingest browser or by clicking the “ID” value in the top right of project page
- The submitting wrangler checks export is complete.
- Check status in the UI, should change from exporting to exported. (This will take ~1-4 hours for most projects)
- If export is “stuck” in exporting for more than 4 hours –> Notify the ingest operations developer via slack and provide the project UUID so they can review the logs and work out what has happened. They will work with the wrangler to resolve this and re-export if necessary.
- The wrangling team is notified of export
- Move the dataset wrangling ticket to
Verifypipeline of wrangling board
- Move the dataset wrangling ticket to
- The Broad data import team are notified of successful export
- Fill the HCA Import request for Production Releases (see values below)
- wait until monthly release
- Fill the HCA Import request for Production Releases (see values below)
- The submitting wrangler is notified that import has been successful or if there are issues for EBI to investigate
- Broad data import team will notify via slack in the dcp-ops channel slack of any import problems (see bellow) or forward ticket UCSC team
- UCSC Browser team will notify submitting wrangler and Broad team when release is indexed and in the browser or if issues are encountered.
- Via slack in the dcp-ops channel notifying wranglers when a release is in the browser to review or of any issues.
- A wrangler will do a final check that everything looks ok and notify UCSC on the data-ops channel.
- If big issue occurs, snapshot can be removed from release, fixed and re-export in next release.
- If minor issue occurs, dataset can stay in release but a fix will be re-exported in next release.
- In order to move submission status, you can edit *any* field and submission state will go to draft and validate.
- Contents of the project staging area in the staging bucket might be required to be deleted as well
- Wrangler will trigger export by hitting submit and following steps 2-7 until the project is available and looks ok
- When the project is available in the browser, the wrangler can email the contributor or contacts from the publication to inform them of the URL where the project is available
What to review in browser?
When wrangler is reviewing data in browser, we want to make sure that all the changes are applied and no weird thing is shown into the Portal. This could be:
- all facet’ed fields are as expected (righten side of page)
- contributors names are presented as expected
- matrix files are shown in tab as expected
- publication name and url is valid
- file formats and number of files are as expected
- biomaterial and atlas icon is correct
Import Form Details for DCP data releases
Import form
Import form result sheet
| Field | Explanation |
|---|---|
| Email address | So you can be contacted if any issues with import |
| Release # | The integer number of the ~monthly release cutoff |
| Your Institution | The institution of submitting wrangler |
| Project Name | Project name of exported project |
| Is this new data, updated data or analysis results? | Choose the appropriate responses. |
| Project UUID | Exported project uuid |
| Project DUOS ID | If project is MA, specify the assigned DUOS-ID |
| Does this project contain Managed Access data? | Yes/ No |
| Additional info | Any other notes you want to communicate to the import team. |
| Optional Release Note for HCA Releases | Public notes to be included in Release notes |
| Additional Info | Internal information for other components |
| Project Contact | Corresponding author of the project to be contacted upon release |
| Highlight in an upcoming DCP users announcement? | Help engagement team to select projects for Release notes |
Notes
- EBI will export on demand, and notify the Broad via import form
- Broad will batch import once prior to the monthly release (check the exact date on Import form release description)
- Contents of the staging area should be deleted for a clean export, unless data already in staging can be used
- See this sheet for a rolling list of projects where a an import request form has been filled out.
Warning: Wranglers should be aware of when prod releases are occurring and not upload/submit until after the release to that environment is completed. Releases do not currently follow a set schedule so stay tuned to updates posted in the #hca slack channel in the AIT workspace. See the Ingest release SOP for more details.
Additionally, move all the corresponding documents to Google Drive/Brokering/PROJECTS-FINISHED
## TDR import errors
If some errors in Import into TDR occur, we have an SOP that provides some insights on how to handle those errors https://ebi-ait.github.io/hca-ebi-wrangler-central/SOPs/After_export/tdr_issues.html
Brokering to SCEA
hca_to_scea_tools_SOP See documentation on the hca-to-scea repo