Metadata and QC
Metadata files are all stored on OSF in the AllTheBacteria Assembly component.
These files all relate to INSDC metadata, tracking which samples have been processed, and then results of running the assembly (and related tools) pipeline. They include:
ENA metadata (this is a snapshot at the time AllTheBacteria was updated to add more samples)
Sample status at a high level: included in AllTheBacteria, or rejected for some reason when running the assembly pipeline
Sylph results on the reads, and species calls made from the Sylph results
Assembly statistics and checkm2 output
Nucmer contig matches of aligning to the human genome
“High quality” samples (defined below)
Latest data for all samples
The latest complete set of data is release 0.2 plus incremental release
2024-08. The latest metadata files for this set are in the
Aggregated/Latest_2024-08/
folder of
the Assembly component.
The latest status of all processed samples is in the file status.202408.tsv.gz. It tracks the result of trying to download the reads, run sylph, assemble, and then human decontamination. The columns are:
Sample: the sample accession (SAM…)
Status: status of the sample. This is either “PASS”, meaning that the pipeline finished successfully and we have an assembly, or “FAIL:…” if it failed and for what reason
Dataset: the dataset the sample belongs to
Comments: any other comments
Older data
We recommend you use the complete data for all samples, since it has everything in one place. However, older metadata files are also available, in folders named by release. At the time of writing these are 0.1 (which was replaced by 0.2), 0.2, and incremental release 2024-08.
Metadata files
Each folder (per dataset, or the latest complete dataset) has the metadata files described below.
Sample lists
The file sample_list.txt.gz
lists all samples that have an
assembly. For aggregated data, it is the samples that have
“PASS” in the “Status” column of the status file (described above).
All of the samples in sample_list.txt.gz
will be in the files described
later (sylph, checkm2 etc). Those files will contain more samples because
not every sample results in an assembly. For example, the reads for a given
sample could be downloaded and sylph run successfully, and then the assembly
fails. That sample would have sylph results, but no assembly, and so does not
appear in sample_list.txt.gz
.
ENA metadata
When processing new samples, the first thing we do is download all metadata
from the ENA for all bacteria. The results are in ena_metadata.tsv.gz
,
providing a snapshot at the time of download. These files are only included
with each release. We do not make an aggregated file across releases, since
it does not really make sense to do so.
Sylph
After downloading the reads, sylph is run on them to get
species abundances. The results are in the file sylph.tsv.gz
, which
is the original sylph output, except for these differences:
The
Sample_file
column is replaced with the INSDC accession columnsSample
andRun
.An extra column
Species
is added, which is a species call from theGenome_file
column, using GTDB species names.
Some samples have no matches and there is no output - these samples are listed
in the file sylph.no_matches.tsv.gz
.
We also try to make a species call from the sylph output, which can be found
in species_calls.tsv.gz
. This is made using a simple method and is
likely to contain some errors: if a sample has a sylph match with
more than 99% abundance then that is the species call, otherwise the species
is called as “unknown”. This call is used for compressing the assemblies
with Miniphy (it requires species calls), and so incorrect calls do not
matter for this use case.
Decontamination
After assembly, we use nucmer to align the contigs to the human genome (plus
HLA sequences). Matching contigs are removed from the assembly.
The complete nucmer output is given in human_nucmer.gz
. We do not
provide an aggregated nucmer file of the latest data
because it is relatively large.
Assembly statistics
The results of running assembly-stats
(from https://github.com/sanger-pathogens/assembly-stats) are provided in
assembly-stats.tsv.gz
.
Checkm2
The results of running checkm2
are provided in checkm2.tsv.gz
.
The columns in the output file are the original output from checkm2 but
with the first “Name” column replaced with “Sample”, and then the values
are the INSDC sample accession IDs.
High quality dataset
We define a high quality dataset for each release. This is samples that:
Have a sylph call with at least 99 percent minimum abundance. If a sample has more than one call (eg where it has more than one run), then require all species calls to be the same
Minimum checkm2 completeness of 90%
Maximum checkm2 contamination of 5%
Total assembly length between 100kbp and 15Mbp
Maximum number of contigs 2,000
Minimum N50 2,000
These samples are listed in hq_set.samples_list.txt.gz
. The rejected
samples are listed in hq_set.removed_samples.tsv.gz
.