Download Data From Sra Through Google Cloud

How to download NCBI SRA (Sequence Read Archive) data through Google Cloud (GCP)?

In some cases, you might need to download the orignal file from SRA, instead of downloading the data in *.sra format.

You need the *.bam or .*h5 format for 3rd generation sequencing data.
You would like to have the orignal read name for the *.fastq format.

NCBI do not provide download link for the orignal files, but already stored the files on both Amazon Cloud (AWS) and Google Cloud. You can find the data link in the Run Browser page.

Take dataset SRR13168868 as an example, you can have more details on page https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR13168868. And Under Data access tab, there are links startswith “gs://” or “s3://” prefix. The “gs://” one is from Google Cloud (GCP).

cloud download links

There is an instruction page on NCBI, but it is to easy to follow, and do not provide command line demo. There are some articles about how to download data from AWS, but very few about GCP. So I will only describe GCP.

A google cloud account is needed, and you might also need to complete the payment method.
On the linux server, download and install the gcloud package, with the gsutil command.
Don’t forget to login the account.

gcloud auth login

If not, you will have an error mesage during the data download.

…serviceusage.services.use access to the Google Cloud Project…

Creat a key follow the instruction below.

https://cloud.google.com/iam/docs/creating-managing-service-account-keys

And download the key in json format

You can download the data link by command:

gsutil -m -u project_name cp gs://sra-pub-src-17/SRR13168867/HeLa_polyA_Input_rep1.R1.fastq.gz.1 ./

replace project_name in the command above into your own project name.

Last modified on 2023-11-01