Metagenome Read-based Taxonomy Classification Workflow (v1.1.0)
Workflow Overview
This pipeline profiles sequencing files (single- or paired-end, long- or short-read) using modular, selectable taxonomic classification tools. It supports GOTTCHA2, Kraken2, Centrifuge, and SingleM via Cromwell (WDL) and Docker, enabling scalable, reproducible metagenome analysis.
Supported tools
Flexible selection of one or more tools via workflow input variables. Each profiler must be enabled via JSON, and paths to reference databases are required.
Workflow Availability
The workflow is available in GitHub: https://github.com/microbiomedata/ReadbasedAnalysis; the corresponding Docker images are available in DockerHub:
Requirements for Execution:
(recommendations are in bold)
WDL-capable Workflow Execution Tool (Cromwell)
Container Runtime that can load Docker images (Docker v2.1.0.3 or higher)
Hardware Requirements:
Disk space: 152 GB for databases (55 GB, 89 GB, and 8 GB for GOTTCHA2, Kraken2 and Centrifuge databases, respectively)
60 GB RAM
Workflow Dependencies
Third party software:
(These are included in the Docker image.)
GOTTCHA2 v2.1.8.5 (License: BSD-3-Clause-LANL)
Kraken2 v2.1.2 (License: MIT)
Centrifuge v1.0.4 (License: GPL-3)
Requisite databases:
The database for each tool must be downloaded and installed. These databases total 152 GB.
GOTTCHA2 database (gottcha2/):
The database RefSeqr90.cg.BacteriaArchaeaViruses.species.fna contains complete genomes of bacteria, archaea and viruses from RefSeq Release 90. The following commands will download the database:
wget https://edge-dl.lanl.gov/GOTTCHA2/RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar
tar -xvf RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar
rm RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar
Kraken2 database (kraken2/):
This is a standard Kraken 2 database, built from NCBI RefSeq genomes. The following commands will download the database:
mkdir kraken2
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20201202.tar.gz
tar -xzvf k2_standard_20201202.tar.gz -C kraken2
rm k2_standard_20201202.tar.gz
Centrifuge database (centrifuge/):
This is a compressed database built from RefSeq genomes of Bacteria and Archaea. The following commands will download the database:
mkdir centrifuge
wget https://genome-idx.s3.amazonaws.com/centrifuge/p_compressed_2018_4_15.tar.gz
tar -xzvf p_compressed_2018_4_15.tar.gz -C centrifuge
rm p_compressed_2018_4_15.tar.gz
Sample dataset(s):
For best results, using datasets that have already gone through ReadsQC is strongly encouraged.
Short Reads
Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States - ER_DNA_379 metagenome (SRR8553641) with metadata available in the NMDC Data Portal. This dataset has 18.3G bases.
The zipped raw fastq file is available here
Zymobiomics mock-community DNA control (SRR7877884); this dataset is has 6.7G bases.
Long-Reads:
Zymobiomics synthetic metagenome (SRR13128014) For testing we have subsampled the dataset (~57MB), the original dataset is ~18G of bases.
Input:
A JSON file containing the following information:
Selection of profiling tools (optional, default only singlem set true)
Paths to the required database(s) for the selected tools
Paths to the input fastq file(s) (paired-end data shown; output of the Reads QC workflow in interleaved format can be treated as single-end.)
Paired end Boolean
The project name
Long reads Boolean
CPU number requested for the run
{
"ReadbasedAnalysis.enabled_tools": {
"gottcha2": false,
"kraken2": false,
"centrifuge": false,
"singlem": true
},
"ReadbasedAnalysis.db": {
"gottcha2": "/path/to/database/RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna",
"kraken2": "/path/to/kraken2",
"centrifuge": "/path/to/centrifuge/p_compressed"
},
"ReadbasedAnalysis.reads": "/path/to/SRR7877884-int.fastq.gz",
"ReadbasedAnalysis.paired": true,
"ReadbasedAnalysis.proj": "SRR7877884",
"ReadbasedAnalysis.long_read": false,
"ReadbasedAnalysis.cpu": 8
}
Output:
The workflow creates an output JSON file and individual output sub-directories for each tool which include tabular classification results, a tabular report, and a Krona plot (html).
Below is an example of the output directory files with descriptions to the right.
Directory/File Name |
Description |
|---|---|
SRR7877884_profiler.info |
ReadbasedAnalysis profiler info JSON file |
SRR7877884_centrifuge_classification.tsv |
Centrifuge output read classification TSV file |
SRR7877884_centrifuge_report.tsv |
Centrifuge output report TSV file |
SRR7877884_centrifuge_krona.html |
Centrifuge krona plot HTML file |
SRR7877884_gottcha2_full.tsv |
GOTTCHA2 detail output TSV file |
SRR7877884_gottcha2_report.tsv |
GOTTCHA2 output report TSV file |
SRR7877884_gottcha2_krona.html |
GOTTCHA2 krona plot HTML file |
SRR7877884_kraken2_classification.tsv |
Kraken2 output read classification TSV file |
SRR7877884_kraken2_report.tsv |
Kraken2 output report TSV file |
SRR7877884_kraken2_krona.html |
Kraken2 krona plot HTML file |
SRR7877884_singlem_classification.tsv |
SingleM output read classification TSV file |
SRR7877884_singlem_report.tsv |
SingleM output report TSV file |
SRR7877884_singlem_krona.html |
SingleM krona plot HTML file |
Download the example ReadbasedAnalysis output for the short-reads Illumina run SRR7877884 (10% subset) here.
Download the example ReadbasedAnalysis output for the long-reads PacBio run SRR13128014 here.
Version History
1.1.0 (release date 11/23/2025)
Point of contact
Package maintainers: Chienchi Lo <chienchi@lanl.gov>, Po-E Li<po-e@lanl.gov>, Valerie Li <vli@lanl.gov>