Batch downloading from OSF
The data are all hosted on OSF: https://osf.io/xv7q9/
If you only want a few files, then browsing the website an manually downloading works fine. This page describes how to download files in bulk. It is not immediately obvious how to do so, because you need to get the URLs of the files you want to download.
All latest files
We provide a tab-delimited file of all the files that are on OSF for the whole AllTheBacteria project: all_atb_files.tsv. This is subject to change, and is updated when files change on OSF.
The columns are:
project
: the name of the project/component, plus its parents, with the names separated by/
. For example the Metadata component sits directly under the main AllTheBacteria project, and is calledAlTheBacteria/Metadata
.project_id
: the project identifier. This is useful if you want to make your own file list (see below)filename
: the name of the fileurl
: the URL of the filemd5
: the MD5 sum of the filesize(MB)
: size of the file in MB
Using this file, it is then relatively easy to batch download. For example, the ENA metadata file for release 0.2 has this information:
project AllTheBacteria/Metadata
project_id h7wzy
filename 0.2/ena_metadata.tsv.gz
url https://osf.io/download/6661ba5c65e1de509c893bb6/
md5 47d03a4892ae6ec5f337c9854e67b7af
You could use wget
to download:
wget -O ena_metadata.tsv.gz https://osf.io/download/6661ba5c65e1de509c893bb6/
and then check the MD5 sum is correct.
Making file lists
Ignore this section if you just want to use the file list provided by us. Otherwise, keep reading.
A file list can be made with this script: https://github.com/martinghunt/bioinf-scripts/blob/master/python/osf_get_files_for_project.py
It gathers all the files for one component on OSF, taking the identifier you see in the component’s URL as input. It recursively finds all descendants of component that you give it, in other words, all of the sub-projects/components underneath the main component.
The default output format is TSV, which is what we use to make the
lists that go on OSF, with the columns as described above.
You can instead get JSON output, which returns
all the metadata for each file, with --format json
. This contains
a large amount of data!
Get all the files
The project identifier of AllTheBacteria is xv7q9
. The file
all_atb_files.tsv
is made with:
osf_get_files_for_project.py xv7q9 > all_atb_files.tsv
Files for one project
You need to find the project identifier to input to
osf_get_files_for_project.py
. This will be in the file
all_atb_files.tsv
(the project_id
column),
or at the end of the URL in your browser.
For example, the URL of the metadata component is https://osf.io/h7wzy/ and you
need to input h7wzy
to the script. The usage is:
osf_get_files_for_project.py h7wzy > file_list.tsv
(replace h7wzy
with the appropriate string depending on the
component you want).