Guidelines

TOC {:toc}

Purpose of the document

As wranglers come and go, we have generated over the years a set of best practices to be maintained for the sake of the FAIRness of our data.

As such, this document should be an updated living document containing the best practices, historical and new, that arise from wrangling datasets.

General Best Practices

VERY IMPORTANT: Do not use “||” in the project spreadsheet apart from when used for linking entities. It should not end up in the metadata of any of the entities.
Always have someone review your dataset before submission
Look for examples to model after if it’s a cell line dataset or one with a unique experimental design
Use the assay cheat sheet for existing, standard assays so that we stay consistent across Lattice, EBI and UCSCwrangling teams

Project

General

Fill the following from the main project page in ingest UI:
- Cell count
- Technology
- Species
- Organ
- Accessions
- Estimated cell count

(Some guidelines can be found here)

Fill out the admin area, this helps other people when using ingest as a backoffice
Project shortname → should be more descriptive than IDs (For example: CoolOrganProject.)

Project - Contributors

Don’t forget to add yourself as a contributor!
- EBI-Specific: Your institution is EMBL-EBI
- Your role should be data curator (This ensures that you appear as a data curator in the browser)
Try to add the email address for the corresponding contributor(s) (Usually available). This will help the scientists trying to re-use the data contact the authors or someone that can point them in the right direction

Project - Funders

If the ID of the grant is unknown, use the term unspecified, lowercase. This helps maintain consistency.
Seed Network funded datasets should be wrangled by the lattice team; when in doubt, please contact them.

Biomaterials

All biomaterials

Use a human-readable and nice description/name when possible
Use the term normal for disease if the donor does not have any disease stated (as opposed to leaving it blank)
When assessing the ID of the biomaterials:
- If the contributor listed an ID that looks like a patient number, remove it and give unique ID
- Scan for information that may have been accidentally shared with us and could jeopardise patient privacy

Donor organism

Fetal samples are their own donors
If it is a fetal donor age should be gestational age
Use HSAPDV for human-specific development stages. Avoid using over-specific ontologies for non-fetal donors, as we have the “age” field for that, favouring ontologies such as “child stage” or “human adult stage”.
- Use EFO for humans only when age is not publicly available for GDPR-affected donors (Living individuals)
If donors are alive at the time of biomaterial collection, you shouldn’t ask for extra metadata to the authors. This is to maintain consistency with our GDPR guidelines, that state that for living organisms, we wrangle data that is publicly available (Diagram here https://ebi-ait.github.io/hca-ebi-wrangler-central/SOPs/GDPR_Guidelines.html)
Donor organism name can match the description
If the dataset is curated from publicly available sources, chances are the donor organism does not have an accession. Sample accessions are usually generated automatically by ENA/SRA when archiving, and usually scientists take samples as the input to the sequencing processes.
Genus species is required even though it’s not schema required
Donors of organs that were given to them via transplant are still the primary donors, the transplant part could be captured in the donor description:
- “56YO Latin female with a heart transplant”
- Example dataset: KidneybiopsyscRNA-seq
In xenograft experiments:
- The donor is the one that the sample comes from, not the one that the sample is grafted inside. Think of the grafted individual as a “glorified petri dish”
- Xenograft would be mentioned in the cell suspension section, under growth environment or free text description
- Example dataset: transplantedHumanIsletsNuclei

Specimen from organism

For the organ, be as general as possible; for the organ part, be as specific as possible. When we integrate ontology expansion, this information can help create a very detailed map of the organ sampled.
Disease here is almost more important than the donor level disease. Make sure to include information.
- Donor could have HepC, but the specimen they donate is still normal
Some specimens are better listed as systems, instead of organs
- Some datasets in the dcp label organ = bone marrow and others use organ=hematopoietic system/organ_part = bone marrow

Cell Suspension

For SS2 datasets, set the estimated cell count to 1 for each cell suspension
- Well and plate number are nice to have for QC purposes
The input biomaterial can be any of the following depending on the experiment:
- Specimen (single cell/nuclei OR bulk)
- Cell line
- Organoid
- Cell suspension (WIP; please see this ticket)

Organoids

Some are made via differentiation protocols, others are made by growing multipotent cells together which signal differentiation without an applied protocol
- Example datasets
  - Differentiation included: HumanCerebralOrganoidsFetalNeocortex
Input biomaterial can be:
- Specimen
- Cell line

Cell line

Cell lines should always have information about the donor in the Donor organism tab.
- Example dataset: pyleSkeletalMuscle
A cell line can make another cell line (also the input)
- Example dataset: iPSCderivedTenocyte
If the cell line was purchased, please use the name listed here: Cellosaurus. This helps with consistency within the database.
Organ model - for embryonic and pluripotent stem cells this is not ideal, but you can list embryo or whole body
If it’s an embryonic cell line, the donor would be the female that donates the embryonic tissue.

Imaging

Visium Datasets are modelled with the following subgraphs:
- Donor_organism → specimen_from_organism → imaged_specimen → …
- Imaged_specimen → image_file
- Imaged_specimen → sequence_file
It’s helpful to use the Visium Spatial Gene Expression ontology term as the library preparation protocol to generate the Visium fastq files
FFPE vs Fresh-Frozen information can be stored in the preservation method term
Add the permeabilization time to the imaging_preparation_protocol

Files

All files

Always fill up the file source. This helps the downstream components to identify the CGMs.
- Do not use DCP/2 Ingest. That term is reserved for the spreadsheet generated by ingest.
Always include ontologies for content description
- If there is a zip file, list multiple ontologies as an array

Image File

All image files and files related to analyzing, understanding, processing the image files should be in this tab
This includes .csv files containing spatial barcodes, and files which link the annotations of the image file image coordinates
The JSON files containing the spot diameter and scale factors for the image acquisition are important for the reusability of the data. These should be included, and the recommended ontology term for their content description is data:3546: Image metadata.
Obtain HIGH RESOLUTION images if possible. Low resolution images are not useful for analysis.

Sequence File

Remember to always fill out the process IDs; one per run (Multiple FASTQs may be grouped together by this method).
- If there are multiple sets (R1, R2, I1…) of FASTQs per run (e.g. technical replicas), please give each of them a different lane index.
Library IDs, with the same information as above.
Input biomaterial is always a cell suspension or an imaged specimen.

Analysis files

Analysis files content description cheat sheet should contain the needed values for the content description field. Open for discussion!
- There’s also some discussion here
Inputs to analysis_files should always be cell_suspensions rather than sequence_files. Reasons for this modelling decision:
- Tricky to model file → file relationships.
- Fewer linkings for cell_suspension to analysis_file, as generally there are many sequence_files generated from one cell_suspension.
Differential gene expression files can be included in the dcp if the following information is known:
- Required: The groups compared can be clearly identified. This information should be included in the description field
- Optional but very appreciated:
  - Software
  - Covariates
  This information should be included in the analysis protocol. The ontology term for this protocol should be analysis of matrices (EFO:0030024)

Protocols

Dissociation

No dissociation protocol for blood and other fluids
Dissociation is applied to a solid specimen that need to be broken down into smaller pieces before split into cell suspensions.

Enrichment

Not ideal, but you can use this schema to capture nuclei isolation from cells for snRNA-seq

Differentiation

Applicable to cell lines or organoids
Example: H1 cell line was differentiated into a cardiomyocyte cell line which was then dissociated and processed into a cell suspension
- Input biomaterial for H1 cell line would be an embryo specimen
- Input biomaterial for cardiomyocyte would be a H1 cell line
- Input biomaterial for cell suspension would be a cardiomyocyte cell line

IPSC induction

Applied only to a cell line

Library Prep

Source of truth for existing assays: https://docs.google.com/spreadsheets/d/1H9i1BK-VOXtMgGVv8LJZZZ9rbTG4XCQTBRxErdqMvWk/edit#gid=0
- Modified versions of assays need to be double checked

Field	Ontology	url
species	NCBITaxon	NCBITaxon
ethnicity	HANCESTRO	HANCESTRO
developmental stage	HsapDv (for human), EFO (for mouse)	HSAPDV
disease	MONDO, PATO (if normal)	MONDO
units of measurement	UO	UO
enrichment method	EFO	EFO
dissociation method	EFO	EFO
collection method	EFO	EFO
library preparation method	EFO	EFO
sequencing approach	EFO	EFO
organ & organ_part	UBERON	UBERON
cell type	CL	CL
biological macromolecule	OBI, CHEBI	OBI CHEBI
library pre/amplification	OBI, EFO	OBI, EFO
sequencing instrument	EFO	EFO
file content description	data (EDAM)	EDAM
project role	EFO	EFO
mouse strain	EFO	EFO
cell cycle	GO	GO

Archiving

Once the submission has been successfully archived, accessions should be communicated back to the contributor. If there is a risk that the deadline the contributor gave will not be met, the contributor should be contacted to inform them of the risk and offer alternatives or workarounds. The project level accessions should be provided within the main body of the email.

By default, the release date will be set up to 2 years from the moment the submission is archived. This date can be changed to an earlier date (Provided by the contributor) but we won’t hold the data for more than 2 years. After 2 years the data will be made public.