Metagenome Read-based Taxonomy Classification Workflow (v1.1.0)

Workflow Overview

This pipeline profiles sequencing files (single- or paired-end, long- or short-read) using modular, selectable taxonomic classification tools. It supports GOTTCHA2, Kraken2, Centrifuge, and SingleM via Cromwell (WDL) and Docker, enabling scalable, reproducible metagenome analysis.

Supported tools

Flexible selection of one or more tools via workflow input variables. Each profiler must be enabled via JSON, and paths to reference databases are required.

Workflow Availability

The workflow is available in GitHub: https://github.com/microbiomedata/ReadbasedAnalysis; the corresponding Docker images are available in DockerHub:

Requirements for Execution:

(recommendations are in bold)

WDL-capable Workflow Execution Tool (Cromwell)
Container Runtime that can load Docker images (Docker v2.1.0.3 or higher)

Hardware Requirements:

Disk space: 152 GB for databases (55 GB, 89 GB, and 8 GB for GOTTCHA2, Kraken2 and Centrifuge databases, respectively)
60 GB RAM

Workflow Dependencies

Third party software:

(These are included in the Docker image.)

GOTTCHA2 v2.1.8.5 (License: BSD-3-Clause-LANL)
Kraken2 v2.1.2 (License: MIT)
Centrifuge v1.0.4 (License: GPL-3)

Requisite databases:

The database for each tool must be downloaded and installed.

GOTTCHA2 database (gottcha2/):

The database gottcha_db.BAVFPt.species.fna is from RefSeq Release 223. The following commands will download the database:

wget https://ref-db.edgebioinformatics.org/NMDC/GOTTCHA2_fungal/gottcha_db.BAVF.species.fna.tar
tar -xvf gottcha_db.BAVF.species.fna.tar
rm gottcha_db.BAVF.species.fna.tar

Kraken2 database (kraken2/):

This is a standard Kraken 2 database, built from NCBI RefSeq genomes. The following commands will download the database:

mkdir kraken2
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20201202.tar.gz
tar -xzvf k2_standard_20201202.tar.gz -C kraken2
rm k2_standard_20201202.tar.gz

Centrifuge database (centrifuge/):

This is a compressed database built from RefSeq genomes of Bacteria and Archaea. The following commands will download the database:

mkdir centrifuge
wget https://genome-idx.s3.amazonaws.com/centrifuge/p_compressed_2018_4_15.tar.gz
tar -xzvf p_compressed_2018_4_15.tar.gz -C centrifuge
rm p_compressed_2018_4_15.tar.gz

Sample dataset(s):

For best results, using datasets that have already gone through ReadsQC is strongly encouraged.

Short Reads

Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States - ER_DNA_379 metagenome (SRR8553641) with metadata available in the NMDC Data Portal. This dataset has 18.3G bases.
- The zipped raw fastq file is available here
Zymobiomics mock-community DNA control (SRR7877884); this dataset is has 6.7G bases.
- The non-interleaved raw fastq files are available as R1 and R2
- The interleaved file is here
  - ReadsQC Cleaned File
- A 10% subset of the interleaved file is available as a quick dataset here
  - ReadsQC Cleaned File

Long-Reads:

Zymobiomics synthetic metagenome (SRR13128014) For testing we have subsampled the dataset (~57MB), the original dataset is ~18G of bases.

ReadsQC Cleaned File

Input:

A JSON file containing the following information:

Selection of profiling tools (optional, default only singlem set true)
Paths to the required database(s) for the selected tools
Paths to the input fastq file(s) (paired-end data shown; output of the Reads QC workflow in interleaved format can be treated as single-end.)
Paired end Boolean
The project name
Long reads Boolean
CPU number requested for the run

{
  "ReadbasedAnalysis.enabled_tools": {
    "gottcha2": false,
    "kraken2": false,
    "centrifuge": false,
    "singlem": true
  },
  "ReadbasedAnalysis.db": {
    "gottcha2": "/path/to/database/RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna",
    "kraken2": "/path/to/kraken2",
    "centrifuge": "/path/to/centrifuge/p_compressed"
  },
  "ReadbasedAnalysis.reads": "/path/to/SRR7877884-int.fastq.gz",
  "ReadbasedAnalysis.paired": true,
  "ReadbasedAnalysis.proj": "SRR7877884",
  "ReadbasedAnalysis.long_read": false,
  "ReadbasedAnalysis.cpu": 8
}

Output:

The workflow creates an output JSON file and individual output sub-directories for each tool which include tabular classification results, a tabular report, and a Krona plot (html).

Below is an example of the output directory files with descriptions to the right.

Directory/File Name	Description
SRR7877884_profiler.info	ReadbasedAnalysis profiler info JSON file
SRR7877884_centrifuge_classification.tsv	Centrifuge output read classification TSV file
SRR7877884_centrifuge_report.tsv	Centrifuge output report TSV file
SRR7877884_centrifuge_krona.html	Centrifuge krona plot HTML file
SRR7877884_gottcha2_full.tsv	GOTTCHA2 detail output TSV file
SRR7877884_gottcha2_report.tsv	GOTTCHA2 output report TSV file
SRR7877884_gottcha2_krona.html	GOTTCHA2 krona plot HTML file
SRR7877884_kraken2_classification.tsv	Kraken2 output read classification TSV file
SRR7877884_kraken2_report.tsv	Kraken2 output report TSV file
SRR7877884_kraken2_krona.html	Kraken2 krona plot HTML file
SRR7877884_singlem_classification.tsv	SingleM output read classification TSV file
SRR7877884_singlem_report.tsv	SingleM output report TSV file
SRR7877884_singlem_krona.html	SingleM krona plot HTML file

Download the example ReadbasedAnalysis output for the short-reads Illumina run SRR7877884 (10% subset) here.

Download the example ReadbasedAnalysis output for the long-reads PacBio run SRR13128014 here.

Version History

1.1.0 (release date 11/23/2025)

Point of contact

Package maintainers: Samantha Obermiller samantha.obermiller@pnnl.gov Alicia Clum, aclum@lbl.gov