GEO to HCA Guide

Pre-requirements
Brief description
Usage
1. SOP: Curating data from GEO
Arguments
1. Required
2. Optional
Use cases
Known issues and how to fix them
1. Warnings
2. Errors
File issues workarounds
1. If only single fastqs available when multiple are expected
2. If only bam files available

Pre-requirements

Clone the Geo to HCA Repo to your computer

git clone https://github.com/ebi-ait/geo_to_hca.git

Go to the root folder and install the requirements
```
cd geo_to_hca
pip install -r requirements.txt --upgrade
```
Please note that you might need to install with pip3 instead of pip depending on your configuration

Brief description

This tool was written to assist in the automatic conversion of geo metadata to abide to the HCA metadata standard.

Usage

The script is stored under the apps folder, under the name geo_to_hca.py. It takes as input a single GEO accession or a comma-delimited list of GEO accessions and a template HCA metadata excel spreadsheet (Included in the repository under the docs folder). It returns a pre-filled HCA metadata spreadsheet for each accession given. Each spreadsheet can then be used as an intermediate file for completion by manual curation. Optionally an output log file can also be generated which lists the availability of an SRA study accession and fastq file names for each GEO accession given as input.

usage: geo_to_hca.py [-h] [--accession ACCESSION]
                     [--accession_list ACCESSION_LIST]
                     [--input_file INPUT_FILE] [--template TEMPLATE]
                     [--header_row HEADER_ROW] [--input_row1 INPUT_ROW1]
                     [--output_dir OUTPUT_DIR] [--output_log OUTPUT_LOG]

SOP: Curating data from GEO

Producing the metadata spreadsheet

Go to the root of the repository and run the script with the accession wanted
```
 cd geo_to_hca
 python3 apps/geo_to_hca.py --accession <GEO_accession>
```
While running, it will output a log. Please refer to the most common warnings section if you don’t know what they mean.
If it can’t find the article information in the GEO metadata, it will perform a quick search in europePMC. Just state “y” or “n” when prompted.

It will output a small log like this:

       All fastq files are available SRA Study available
GSE149689                            no                 yes

Once it’s done, the spreadsheet will be saved under the folder spreadsheets/ with the name: <geo_accession>.xlsx. Another folder can be specified with the --output_dir argument.

Uploading the data to an s3 bucket

Follow the guide provided in the handy snippets documentation

Checking the data uploaded matches the expected

Note: There are many ways to check if data files are the same. These are just guidelines on a quick and easy way to look at it, but feel free to suggest other ways.

Use the following command to extract the filenames of the uploaded files:
```
aws s3 ls s3://hca-util-upload-area/<area_id>/ | awk '{printf("%s\n", $4)}' | sort
```
And copy the resulting output.
Paste the filenames in a new excel book, and paste the filenames from the spreadsheet as well. Sort the ones in the spreadsheet using excel.
Compare them by using the =EXACT(text1,text2) macro in excel (Expected output: TRUE)

Arguments

Required

1 of the following 3 arguments is required:

--accession: A string that matches a GSE series accession
--accession_list: A space-delimited list of accessions
--input_file: A path (relative or absolute) to a file containing geo accessions. The file should include an “accession” header and have one accession per row.

Please note that if more than one required argument is given, the script will exit.

Optional

These arguments do not need to be specified.

--template, default="docs/hca_template.xlsx"

The default template is an empty HCA metadata spreadsheet in excel format, with the relevant HCA metdata headers in rows 1-5. The default header row with programmatic names is row 4; the default start input row is row 6. It is not necessary to specify this argument unless the HCA spreadsheet format changes.
--header_row, type=int, default=4

The default header row with programmatic names is row 4. It is not necessary to specify this argument unless the HCA spreadsheet format changes.
--input_row, type=int, default=6

The default start input row (The row where it will start writing the output of the script) is row 6. It is not necessary to specify this argument unless the HCA spreadsheet format changes.
--output_dir, default="spreadsheets/"

An output directory can be specified by it’s path. If the path does not already exist, it will be created. If this argument is not given, the default output directory is “spreadsheets/’”
--output_log, type=bool, default=True

An optional arugment to retrieve an output log file stating whether an SRA study id and fastq file names were available for each GEO accession given as input.

Use cases

Get the HCA metadata for 1 GEO accession

 python apps/geo_to_hca.py --accession GSE97168

Get the HCA metadata for a comma-separated list of GEO accessions

 python apps/geo_to_hca.py --accession_list GSE97168,GSE124872,GSE126030

Get the HCA metadata for all the accessions listed in a file. The file should have a header accessions and contain 1 accession per row. An example can be found in docs/example_accessions.txt
```
 python apps/geo_to_hca.py --input_file docs/example_accessions.txt
```

Known issues and how to fix them

Warnings

No fastq file name for Run accession: <Run_accession>

This means that for the run selected, there was at least 1 fastq for which its filename could not be found. This could be due to either:

The run does not have this information
The script assumes there are 3 files (R1,R2,I1) but there are only 2 files (R1 and R2)

Errors

When running the script, I get a weird xml.Etree error. All my inputs are valid so I don't understand what is happening

Please make sure you are using the option –upgrade when installing the repo requirements. There is a bug in openpyxl 3.0.2+ where this happens quite often. Reverting back to 3.0.1 should fix this error.

I have an error not addressed here. What should I do?

Please issue a ticket in the issues section of the geo_to_hca repository.

File issues workarounds

If only single fastqs available when multiple are expected

Sometimes there are only single fastq files available in SRA or ENA, even though paired (R1, R2) or three fastqs are expected (I1, R1, R2). In these cases you can download the sra object then convert it into the required fastq files. You will then need to make sure you match up the correct read index to the correct fastq file.

Downloading and converting sra objects

If only bam files available

If they are 10X bam files that were generated by cellranger, check that header of the bam contains the expected cellranger tags:

@CO	10x_bam_to_fastq:I1(BC:QT)
@CO	10x_bam_to_fastq:R1(CR:CY,UR:UY,TR:TQ)
@CO	10x_bam_to_fastq:R2(SEQ:QUAL)

Then follow the guide on how to convert them here: Downloading 10X files from archives