Migration from EBI FTP to OSF
This page has details of moving the data originally hosted at the EBI FTP site to OSF. You probably only want to read this if you used data that was on the FTP site, and are now looking at the AllTheBacteria project on OSF and want to know why some file names have changed. The short explanation is: file names were changed so that they did not have species names in them.
Spare me the details, I just want old -> new names
Assembly tarballs and Phylign index files were renamed. Here is a TSV file that has all the old and new file names, md5 sums, and OSF URLs: atb_ebi_r0.2_ftp_to_osf_rename.tsv. No other filenames were changed.
What was/is on the EBI FTP site?
There was:
release 0.1. This was the first release of the data, and corresponds to the first version of the preprint. The data have since been withdrawn and replaced by version 0.2.
There is:
release 0.2. This is the second release of the data, which is the same as 0.1 but with some human contamination contigs removed.
See the release history for more details on releases 0.1 and 0.2. Everything below is describing release 0.2 and what happened during copying from the FTP site to OSF.
What is on OSF?
All data from release 0.2 are also on OSF. However, some files were renamed before putting on OSF, so that no species names were in the file names. We wanted to separate out species calls from the assemblies themselves. The assemblies will not (or are extremely unlikely to) change. Species calling is difficult, and we do expect that to change.
Why were species names in the files? Because the compression/indexing processes needed species calls. It doesn’t matter if those calls are wrong, it just helps to group similar genomes together for compression efficiency. However, leaving those species names in the file names could be misleading, especially as species calls will change over time.
Metadata files
These files are all the same. All files in the ftp metadata directory were copied to OSF, in the AllTheBacteria metadata component. Some files may get added to OSF, but all files on FTP were copied over to OSF.
Assembly files
No actual assemblies were changed. All FASTA files are identical between the EBI ftp site and on OSF. However, the assembly tarball names, and the directory name that each tarball extracts to, was changed.
This should make sense using an example. This tarball is on the FTP site:
achromobacter_xylosoxidans__01.asm.tar.xz
.
It extracts to a directory achromobacter_xylosoxidans__01/
containing FASTA
files. In other words, running tar xf achromobacter_xylosoxidans__01.asm.tar.xz
would make these files:
achromobacter_xylosoxidans__01/SAMN12335635.fa
achromobacter_xylosoxidans__01/SAMN12335634.fa
achromobacter_xylosoxidans__01/SAMN12335574.fa
...etc
The renamed tarball on OSF is called atb.assembly.r0.2.batch.1.tar.xz
and
it extracts to the directory atb.assembly.r0.2.batch.1/
. In other words,
running tar xf atb.assembly.r0.2.batch.1.tar.xz
would make these files:
atb.assembly.r0.2.batch.1/SAMN12335635.fa
atb.assembly.r0.2.batch.1/SAMN12335634.fa
atb.assembly.r0.2.batch.1/SAMN12335574.fa
...etc
The extracted files SAMN12335635.fa
, SAMN12335634.fa
, SAMN12335574.fa
,
… are identical bewteen the original and renamed tarballs.
The only difference is the tarball name and directory to which it extracts.
The order of the files inside each tarball was preserved.
Index files
Sketchlib
The sketchlib files on the FTP site were copied with no changes to the OSF AllTheBacteria sketchlib component.
Phylign
The Phylign files on the FTP site
were renamed on the OSF
AllTheBacteria Phylign component.
The renaming was done to match the assembly renaming.
For example, the file achromobacter_xylosoxidans__01.cobs_classic.xz
on
the FTP site was renamed to atb.assembly.r0.2.batch.1.cobs_classic.xz
on OSF.
Are 15 Phylign files missing?
No.
You may have noticed that the numbering in the phylign files jumps
from file atb.assembly.r0.2.batch.637.cobs_classic.xz
to
atb.assembly.r0.2.batch.653.cobs_classic.xz
.
There are no 638-652 phylign files. Why is this…?
When renaming the assembly and phylign files, the old names were
just enumerated, so the first file achromobacter_xylosoxidans__01.asm.tar.xz
was renamed atb.assembly.r0.2.batch.1.tar.xz
.
And similarly, the corresponding Phylign old file was
achromobacter_xylosoxidans__01.cobs_classic.xz
,
and renamed to atb.assembly.r0.2.batch.1.cobs_classic.xz
.
The samples with no species call are spread across 15 assembly
tarballs (old name unknown__01.asm.tar.xz
… unknown__15.asm.tar.xz
),
and got new names atb.assembly.r0.2.batch.638.tar.xz
…
atb.assembly.r0.2.batch.652.tar.xz
. These samples were not included in
the Phylign index. To keep the numbering consistent when translating:
old assembly tarball <-> old Phylign <-> new assembly tarball <-> new Phylign
file, we left out new Phylign file numbers 638-652. This means that the
filename numbering for assemblies and Phylign file is consistent and assembly
batch number N
(atb.assembly.r0.2.batch.N.tar.xz
) corresponds to
Phylign index file number N
(atb.assembly.r0.2.batch.N.cobs_classic.xz
).