Guidelines

Table of contents

  1. TOC {:toc}

Purpose of the document

As wranglers come and go, we have generated over the years a set of best practices to be maintained for the sake of the FAIRness of our data.

As such, this document should be an updated living document containing the best practices, historical and new, that arise from wrangling datasets.

General Best Practices

  • VERY IMPORTANT: Do not use “||” in the project spreadsheet apart from when used for linking entities. It should not end up in the metadata of any of the entities.
  • Always have someone review your dataset before submission
  • Look for examples to model after if it’s a cell line dataset or one with a unique experimental design
  • Use the assay cheat sheet for existing, standard assays so that we stay consistent across Lattice, EBI and UCSCwrangling teams

Project

General

  • Fill the following from the main project page in ingest UI:
    • Cell count
    • Technology
    • Species
    • Organ
    • Accessions
    • Estimated cell count

(Some guidelines can be found here)

  • Fill out the admin area, this helps other people when using ingest as a backoffice
  • Project shortname → should be more descriptive than IDs (For example: CoolOrganProject.)

Project - Contributors

  • Don’t forget to add yourself as a contributor!
    • EBI-Specific: Your institution is EMBL-EBI
    • Your role should be data curator (This ensures that you appear as a data curator in the browser)
  • Try to add the email address for the corresponding contributor(s) (Usually available). This will help the scientists trying to re-use the data contact the authors or someone that can point them in the right direction

Project - Funders

  • If the ID of the grant is unknown, use the term unspecified, lowercase. This helps maintain consistency.
  • Seed Network funded datasets should be wrangled by the lattice team; when in doubt, please contact them.

Biomaterials

All biomaterials

  • Use a human-readable and nice description/name when possible
  • Use the term normal for disease if the donor does not have any disease stated (as opposed to leaving it blank)
  • When assessing the ID of the biomaterials:
    • If the contributor listed an ID that looks like a patient number, remove it and give unique ID
    • Scan for information that may have been accidentally shared with us and could jeopardise patient privacy

Donor organism

  • Fetal samples are their own donors
  • If it is a fetal donor age should be gestational age
  • Use HSAPDV for human-specific development stages. Avoid using over-specific ontologies for non-fetal donors, as we have the “age” field for that, favouring ontologies such as “child stage” or “human adult stage”.
    • Use EFO for humans only when age is not publicly available for GDPR-affected donors (Living individuals)
  • If donors are alive at the time of biomaterial collection, you shouldn’t ask for extra metadata to the authors. This is to maintain consistency with our GDPR guidelines, that state that for living organisms, we wrangle data that is publicly available (Diagram here https://ebi-ait.github.io/hca-ebi-wrangler-central/SOPs/GDPR_Guidelines.html)
  • Donor organism name can match the description
  • If the dataset is curated from publicly available sources, chances are the donor organism does not have an accession. Sample accessions are usually generated automatically by ENA/SRA when archiving, and usually scientists take samples as the input to the sequencing processes.
  • Genus species is required even though it’s not schema required
  • Donors of organs that were given to them via transplant are still the primary donors, the transplant part could be captured in the donor description:
  • In xenograft experiments:
    • The donor is the one that the sample comes from, not the one that the sample is grafted inside. Think of the grafted individual as a “glorified petri dish”
    • Xenograft would be mentioned in the cell suspension section, under growth environment or free text description
    • Example dataset: transplantedHumanIsletsNuclei

Specimen from organism

  • For the organ, be as general as possible; for the organ part, be as specific as possible. When we integrate ontology expansion, this information can help create a very detailed map of the organ sampled.
  • Disease here is almost more important than the donor level disease. Make sure to include information.
    • Donor could have HepC, but the specimen they donate is still normal
  • Some specimens are better listed as systems, instead of organs
    • Some datasets in the dcp label organ = bone marrow and others use organ=hematopoietic system/organ_part = bone marrow

Cell Suspension

  • For SS2 datasets, set the estimated cell count to 1 for each cell suspension
    • Well and plate number are nice to have for QC purposes
  • The input biomaterial can be any of the following depending on the experiment:
    • Specimen (single cell/nuclei OR bulk)
    • Cell line
    • Organoid
    • Cell suspension (WIP; please see this ticket)

Organoids

  • Some are made via differentiation protocols, others are made by growing multipotent cells together which signal differentiation without an applied protocol
  • Input biomaterial can be:
    • Specimen
    • Cell line

Cell line

  • Cell lines should always have information about the donor in the Donor organism tab.
  • A cell line can make another cell line (also the input)
  • If the cell line was purchased, please use the name listed here: Cellosaurus. This helps with consistency within the database.
  • Organ model - for embryonic and pluripotent stem cells this is not ideal, but you can list embryo or whole body
  • If it’s an embryonic cell line, the donor would be the female that donates the embryonic tissue.

Imaging

  • Visium Datasets are modelled with the following subgraphs:
    • Donor_organism → specimen_from_organism → imaged_specimen → …
    • Imaged_specimen → image_file
    • Imaged_specimen → sequence_file
  • It’s helpful to use the Visium Spatial Gene Expression ontology term as the library preparation protocol to generate the Visium fastq files
  • FFPE vs Fresh-Frozen information can be stored in the preservation method term
  • Add the permeabilization time to the imaging_preparation_protocol

Files

All files

  • Always fill up the file source. This helps the downstream components to identify the CGMs.
    • Do not use DCP/2 Ingest. That term is reserved for the spreadsheet generated by ingest.
  • Always include ontologies for content description
    • If there is a zip file, list multiple ontologies as an array

Image File

  • All image files and files related to analyzing, understanding, processing the image files should be in this tab
  • This includes .csv files containing spatial barcodes, and files which link the annotations of the image file image coordinates
  • The JSON files containing the spot diameter and scale factors for the image acquisition are important for the reusability of the data. These should be included, and the recommended ontology term for their content description is data:3546: Image metadata.
  • Obtain HIGH RESOLUTION images if possible. Low resolution images are not useful for analysis.

Sequence File

  • Remember to always fill out the process IDs; one per run (Multiple FASTQs may be grouped together by this method).
    • If there are multiple sets (R1, R2, I1…) of FASTQs per run (e.g. technical replicas), please give each of them a different lane index.
  • Library IDs, with the same information as above.
  • Input biomaterial is always a cell suspension or an imaged specimen.

Analysis files

  • Analysis files content description cheat sheet should contain the needed values for the content description field. Open for discussion!
    • There’s also some discussion here
  • Inputs to analysis_files should always be cell_suspensions rather than sequence_files. Reasons for this modelling decision:
    • Tricky to model file → file relationships.
    • Fewer linkings for cell_suspension to analysis_file, as generally there are many sequence_files generated from one cell_suspension.
  • Differential gene expression files can be included in the dcp if the following information is known:
    • Required: The groups compared can be clearly identified. This information should be included in the description field
    • Optional but very appreciated:
      • Software
      • Covariates

      This information should be included in the analysis protocol. The ontology term for this protocol should be analysis of matrices (EFO:0030024)

Protocols

Dissociation

  • No dissociation protocol for blood and other fluids
  • Dissociation is applied to a solid specimen that need to be broken down into smaller pieces before split into cell suspensions.

Enrichment

  • Not ideal, but you can use this schema to capture nuclei isolation from cells for snRNA-seq

Differentiation

  • Applicable to cell lines or organoids
  • Example: H1 cell line was differentiated into a cardiomyocyte cell line which was then dissociated and processed into a cell suspension
    • Input biomaterial for H1 cell line would be an embryo specimen
    • Input biomaterial for cardiomyocyte would be a H1 cell line
    • Input biomaterial for cell suspension would be a cardiomyocyte cell line

IPSC induction

  • Applied only to a cell line

Library Prep

Field Ontology url
species NCBITaxon NCBITaxon
ethnicity HANCESTRO HANCESTRO
developmental stage HsapDv (for human), EFO (for mouse) HSAPDV
disease MONDO, PATO (if normal) MONDO
units of measurement UO UO
enrichment method EFO EFO
dissociation method EFO EFO
collection method EFO EFO
library preparation method EFO EFO
sequencing approach EFO EFO
organ & organ_part UBERON UBERON
cell type CL CL
biological macromolecule OBI, CHEBI OBI CHEBI
library pre/amplification OBI, EFO OBI, EFO
sequencing instrument EFO EFO
file content description data (EDAM) EDAM
project role EFO EFO
mouse strain EFO EFO
cell cycle GO GO

Archiving

Once the submission has been successfully archived, accessions should be communicated back to the contributor. If there is a risk that the deadline the contributor gave will not be met, the contributor should be contacted to inform them of the risk and offer alternatives or workarounds. The project level accessions should be provided within the main body of the email.

By default, the release date will be set up to 2 years from the moment the submission is archived. This date can be changed to an earlier date (Provided by the contributor) but we won’t hold the data for more than 2 years. After 2 years the data will be made public.