Consider other option check_fastq.py

Validating files on the ec2: fastq_info

TODO: Check whether we can do the syncing to ec2 currently

This is only needed if you need to validate the files manually for some reason, generally this should occur automatically once files are uploaded to the ingest upload bucket.

Pre-requisites and installation

  • fastq_info

Run the following command after ssh-ing into the EC2 instance:

export PATH=$PATH:/home/ubuntu/fastq_utils/bin

Usage

Sync the files stuck in VALIDATING status from their S3 bucket to their corresponding folder on the EC2.

aws s3 sync <s3 bucket URI> /data/<data-folder>/

Include only certain files using --exclude and --include:

aws s3 sync <s3 bucket URI> /data/<data-folder>/ --exclude "*" --include "SRR43*.fastq.gz"

Run fastq_info for a particular file:

fastq_info -r -s </path/to/fastq-file-name>

Response like this means that the file is valid:

zperova@ip-172-31-3-111:/data/zperova-fetal-heart-10x-staging-0$ fastq_info -r -s 10X109_2_S4_L001_I1_001.fastq.gz
fastq_utils 0.19.2
Skipping check for duplicated read names
CASAVA=1.8
410700000
------------------------------------
Number of reads: 410725632
Quality encoding range: 35 70
Quality encoding: 33
Read length: 9 9 9
OK

If the response contains the word ERROR, the file is unvalid.