DEPRECATED?

Contributor matrix and cell type annotations

DEPRECATED?
Contributor matrix and cell type annotations

Getting contributor matrix files

For each project, wranglers should endeavour to find an expression matrix and if not embedded within that matrix, a listing of cell type annotations. These are generally linked from a publication, present as a supplementary file on the publication, GEO or ArrayExpress submission.

The preferred formats for matrix files are:

loom
h5ad
RData
RDS

Where either the expression matrix or cell type annotations cannot be found, the primary wrangler should write an email to the contributor/author asking for them to provide the appropriate files in the preferred format. If the contributors cannot provide in the preferred format, we will take whatever is available. It is important to be able to link the cell type annotations to the cell suspensions and/or cell barcodes provided in the metadata.

Filling in metadata about the files

For each expression matrix or cell type annotation file that is found, a row needs to be filled in the metadata spreadsheet, in the ‘Analysis file’ tab. Analysis files can be linked to sequence files or biomaterial entities via processes; This is done in the spreadsheet in the same way that other entities are linked. Information related to the analysis protocol is captured in the Analysis_protocol entity (See the Analysis protocol tab) linked to the process

The best practice is to link the analysis files to sequence file entities, if possible. Alternatively, you can also link the analysis files to cell suspension entities. This is currently done by adding the ‘Input Cell Suspension ID’ column to the ‘Analysis File’ tab and adding the linked cell suspensions to the cell.

The gene expression matrix and cell annotations files should be added to the S3 bucket in the ingest-area together with raw data files, for instructions on how to use hca-util to do this, see ‘here’

The following process here is outdated, but kept for recording: For each file that is found, a row needs to be filled in the contributor_matrices_metadata found in the Contributor Matrices folder in the Brokering folder.

Field	Definition
date_added	the date the row was added to the sheet in YYYY-MM-DD format.
project_uuid	the uuid of the project
project_shortname	the shortname of the project
gex matrix	Y/N whether the file is a count matrix
cell type	Y/N whether the file contains cell type annotations
other	Y/N whether the file is some other kind of file
file_name	name of the file, unchanged from where it was sourced
file_source	Where the file was sourced from: contributor/geo/arrayexpress [any other categories needed here?]
genusSpecies	the species in that file, usually Homo sapiens or Mus musculus
developmentStage	the developmental stage present in the matrix, if more than one species, need to specify which stage goes with which species
organ	the organ present in the matrix, if multiple other fields, need to deconvolute
libraryConstructionApproach	the ontology label of the `library_preparation_method` used to generate the matrix/file, if multiple, need to unambiguously deconvolute
uploaded	Whether it has been uploaded into the google bucket for matrices
date_imported	the date the file was imported by UCSC in YYYY-MM-DD format (filled in by UCSC when import is performed)

If there are multiple values in one cell, they need to be delimited with comma

Uploading the files to the google bucket

Files need to be uploaded to the google bucket

Files from the same project are put into a folder [project_uuid]-[project_shortname]

If you can’t access the bucket in the console link above, you need to request access from the UCSC browser team. Also check whether you have the option to upload/download/delete.

Uploading through the console

Go to the console website https://console.cloud.google.com/storage/browser/hca-prod-ebi-matrices
Select either UPLOAD FILES or UPLOAD FOLDER and choose the files/folder to upload
Don’t navigate away from the page until the upload is complete
Once files are confirmed as uploaded, mark in the spreadsheet that they are uploaded

Uploading through `gsutil` cli tool

Install Google Cloud SDK https://cloud.google.com/sdk/docs/install
Authorise with your google account https://cloud.google.com/sdk/gcloud/reference/auth/login
Use gsutil cp to copy files to the bucket, something like:

# to copy a file
gsutil cp <name of file> gs://hca-prod-ebi-matrices/<name of folder>
# to copy a directory of files
gsutil cp -r <name of directory> gs://hca-prod-ebi-matrices/

for more gsutil actions see: https://cloud.google.com/storage/docs/gsutil