Uploading files to an s3 bucket from the archives

Pre-requisites and installation

virtualenv
aws
wget

Connect to the EC2, create a virtual environment and install the python dependencies:

virtualenv -p python3.7 <name_of_env>
source <name_of_env>/bin/activate
wget https://raw.githubusercontent.com/ebi-ait/hca-ebi-wrangler-central/master/src/requirements.txt
pip3 install -r requirements.txt

Set up your default aws credentials:
- Go to ~/.aws/
- vim credentials
- Copy this at the top of the file:
```
 [default]
 aws_secret_access_key = <AWS_secret_key>
 aws_access_key_id = <AWS_access_key>
```
  Replacing the keys with your wrangler access and secret keys

Usage

Connect to the EC2
Create an upload area using the hca-util tool.

wget the move_data_from_indsc.py script to your root directory in EC2

wget https://raw.githubusercontent.com/ebi-ait/hca-ebi-wrangler-central/master/src/move_data_from_insdc.py

Activate your python>=3.6 virtual environment if not already active
```
 source <name_of_env>/bin/activate
```

Run the script:

python3 move_data_from_insdc.py -s <study/project accession> -o s3://hca-util-upload-area/<upload_area_id> -t <number_of_threads>

Enjoy while your data gets loaded into the s3 area! If you are running into some errors, the default database accession is the SRA, so try the corresponding SRA accession as an input to the script.

Notes

Right now, even with a good amount of threads (>= 5) it takes about 5 hours to move 1 TB of data. It is best practice to set up a virtual screen and leave it running.
The output_path (-o) argument can be pointed out to a local directory
Some GEO datasets do not have all their data available in a fastq format. For those datasets, a warning pointing to which runs the script failed to retrieve information from will be issued.