Spreadsheet Quick Guide

Before you start
Spreadsheet tab organization
How to link entities
1. Linking biomaterials and files
2. Linking biomaterials and protocols
How to indicate a library preparation was sequenced more than once
How to link more than one entity in a cell
Example spreadsheets
Additional guidelines on how to curate specific technologies
Additional resources about HCA metadata

Google Sheets and Excel spreadsheets are used to gather metadata about your project. This document is a brief walkthrough to help you get started filling out an HCA metadata spreadsheet.

If you have any questions at any stage while filling out the spreadsheet, please contact the data wrangling team at:
wrangler-team@data.humancellatlas.org.

Before you start

Metadata is represented in a spreadsheet with tabs that relate to the experimental design. For example, an Organoid tab would be included if an experiment includes organoid samples, whereas a Donor organism tab would be included if your experiment derived from a Donor. If you think your spreadsheet is missing fields or tabs to properly describe your experiment, please contact a data wrangler at wrangler-team@data.humancellatlas.org.

Spreadsheet tab organization

Metadata is collected for each component of your experiment, for example: project, donor, sequence file, and dissociation protocol. We refer to each component as an entity. Each of these entities has a separate tab in the spreadsheet.

Each row in each tab corresponds to one instance of the entity described by the tab name. For example, in the Donor organism tab, each row describes one donor. In the Sequence file tab, each row describes one sequence file. The same applies for the Analysis file tab. Only one row in the Project tab should be filled in to describe the whole project.

Tab layout

All tabs share some common properties:

The first row is the metadata field name
The second row is the field description
The third row contains one or two example values and guidelines, when appropriate, separate examples are separated with ;
The fourth row is used for programmatic spreadsheet processing and is hidden
The fifth row is intentionally left blank
You should enter your values from the sixth row down

Project metadata tabs

The following tabs should be in your spreadsheet as they apply to most projects:

Project (fill out one row)
Project - Contributors
Project - Funders
Project - Publications (only fill out if your data has been published or has a pre-print)

Biomaterial metadata tabs

We use the term ‘Biomaterial’ to refer to all parts of your experiment that derive from a biological source.

The following tabs may be in your sheet depending on your experimental design:

Donor organism
Specimen from organism
Cell suspension
Cell line
Organoid
Imaged specimen

File metadata tabs

Every data or supplementary file associated with your project requires an entry in one of the following tabs:

Sequence file
Analysis file
Image file
Supplementary file

Protocol metadata tabs

Each protocol used in your experiment needs to be described once in the relevant protocol tab (listed below). Once it has been described once it can be referred to using its ID to link between the different Biomaterial and file entities.

Aggregate generation protocol - Used in organoid experiments to describe how cultured cells are developed into cell aggregates
Collection protocol - Commonly used between Donor organism and Specimen from organism to describe the process of collecting an organ or tissue sample from a donor.
Differentiation protocol - Used in cell line and organoid experiments to describe how a cell is differentiated into a particular cell type or organoid
Dissociation protocol - Used to describe how a Specimen from organism was separated into individual cells or nuclei. This is not required if the collected specimen is a fluid
Enrichment protocol - Used to describe how a biomaterial was enriched for a particular feature or characteristic of interest, such as a particular cell type. For example it can be used to describe FACS, MACS, size filtering
iPSC induction protocol - Used to describe how a biomaterial was treated to become an induced pluripotent stem cell line
Imaging preparation protocol (imaging transcriptomics only)
Imaging protocol (imaging transcriptomics only)
- Probe (imaging only)
- Channel (imaging only)
Library preparation protocol - used in sequencing experiments to describe how the sequencing library was prepared
Sequencing protocol - used in sequencing experiments to describe how the library was sequenced
Analysis protocol - used in sequencing experiments to describe how the count matrices/expression analysis were generated

How to link entities

Linking biomaterials and files

The first ID columns in the Biomaterial and Protocols tabs are unique identifiers for each row in that tab. For example DONOR ORGANISM ID in the Donor organism tab and COLLECTION PROTOCOL ID in the Collection protocol tab. Similarly, the FILE NAME columns in the File tabs are unique identifiers for each row in that tab. Values in these columns must be unique within the spreadsheet. These unique identifiers are used in multiple tabs to link entities together. Please note that files can only be linked to biomaterials, so Analysis files need to be linked with the biomaterial closest to the generation of the file (Usually a Cell suspension).

For example, consider an experiment with entities that are linked as follows:

Biomaterial flow

In each of these tabs, one entity (row) is assigned an ID. For example: ‘donor1’, ‘donor1kidney’, ‘donor1_kidney_cells’, and ‘donor1_kidney(I|R)(1|2).fastq.gz’, respectively.

The first few fields of the DONOR ORGANISM tab would look like this:

Donor organism tab

Link ‘donor1_kidney’ to ‘donor1’ by entering ‘donor1’ in the INPUT DONOR ORGANISM ID column in the Specimen from organism tab.

Specimen from organism tab

Link ‘donor1_kidney_cells’ to ‘donor1_kidney’ by entering ‘donor1_kidney’ in the INPUT SPECIMEN FROM ORGANISM ID column in the Cell suspension tab.

Cell suspension tab

Link the set of sequencing files;
- ‘donor1_kidney_I1.fastq.gz’
- ‘donor1_kidney_R1.fastq.gz’
- ‘donor1_kidney_R2.fastq.gz’
to ‘donor_1_kidney_cells’ by entering ‘donor1_kidney_cells’ in the INPUT CELL SUSPENSION ID column in the Sequence file tab.

Sequence file tab

Linking biomaterials and protocols

Protocols are used to describe the process of deriving a biomaterial or file from an input biomaterial.

Biomaterials are linked to the protocols used to generate them by entering the protocols’ IDs (from the corresponding Protocol tab) in the respective ID columns in the biomaterial tabs.

For example, consider a sequencing experiment that followed a dissociation protocol to produce a cell suspension from a kidney tissue sample.

Protocol linking

In each of these tabs, one entity is assigned an ID. For example: ‘donor1_kidney’, ‘enzymatic_dissociation’, and ‘donor1_kidney_cells’, respectively.

In this case the first few fields of the Dissociation protocol tab would look like this:

Dissociation protocol tab

We can link ‘donor1_kidney_cells’ to ‘enzymatic_dissociation’ (the protocol that derived the cell suspension) by entering ‘enzymatic_dissociation’ in the DISSOCIATION PROTOCOL ID column in the Cell suspension tab.

Cell suspension tab

You can then follow the same process for linking other protocols, such as the LIBRARY PREPARATION PROTOCOL ID and SEQUENCING PROTOCOL IDs within the Sequence file tab to describe the process between Cell suspension and Sequence file.

In the Library preparation protocol tab, fill in details about the library preparation protocol. You may find the parameters to fill for some well known technologies here.

Library preparation tab

In the Sequencing protocol tab, fill in details about how the library was sequenced. You may find the parameters to fill for some well known technologies here.

Sequencing protocol tab

In the Sequence file tab, link the FILE NAME to the INPUT CELL SUSPENSION with the relevant LIBRARY PREPARATION PROTOCOL ID and SEQUENCING PROTOCOL ID.

Sequencing file tab

If you have more than one read or index files per run you need indicate which files come from the same run by filling in the process id. Files with the same process id will be grouped together. See below for an example with read1, read2 and index1.

Sequencing file tab - process id 1

If you have reads from multiple lanes you need to fill in the process id as above and the lane index. See below for an example sheet.

Sequencing file tab - process id 2

How to indicate a library preparation was sequenced more than once

If the same library preparation has been sequenced more than once, this can be indicated in the spreadsheet by providing a unique identifier for each library preparation in the LIBRARY PREPARATION ID field in the Sequencing file tab.

For example:

File Name	LIBRARY PREPARATION ID
file100_R1.fastq.gz	lib_prep_1
file101_R1.fastq.gz	lib_prep_1
file102_R1.fastq.gz	lib_prep_2
file103_R1.fastq.gz	lib_prep_2

indicates that files file100_R1.fastq.gz and file101_R1.fastq.gz were both generated from the same library preparation, and files file102_R1.fastq.gz and file103_R1.fastq.gz were both generated from a different library preparation.

How to link more than one entity in a cell

If one entity (e.g. analysis file) derives from multiple entities of the same type (e.g. multiple cell suspension), this can be indicated in the spreadsheet by adding the || separator.

Multiple links

In this example, we have linked several cell suspensions by their ID (GSM3841608,GSM3841609,GSM3841610 and GSM3841611) to the analysis file with ID GSE132065_P2568.tsv.gz

Example spreadsheets

To see fully worked examples of filled in spreadsheets for a 10X and SmartSeq2 experiment, please use the buttons below:

Download 10X example spreadsheet

Download SS2 example spreadsheet

Additional guidelines on how to curate specific technologies

For technologies like Visium, scATAC-seq and CITE-seq, we have specified overview and curation guidelines.

Visium: overview and guide
scATAC-seq: overview and guide
CITE-seq: overview and guide

Additional resources about HCA metadata

More information about HCA metadata can be found in the HCA Metadata Schema Github repo: HumanCellAtlas/metadata-schema