Metagenome Read-based Taxonomy Classification Workflow (v1.1.0)

../../_images/rba_workflow2025.svg

Workflow Overview

This pipeline profiles sequencing files (single- or paired-end, long- or short-read) using modular, selectable taxonomic classification tools. It supports GOTTCHA2, Kraken2, Centrifuge, and SingleM via Cromwell (WDL) and Docker, enabling scalable, reproducible metagenome analysis.

Supported tools

Flexible selection of one or more tools via workflow input variables. Each profiler must be enabled via JSON, and paths to reference databases are required.

Workflow Availability

The workflow is available in GitHub: https://github.com/microbiomedata/ReadbasedAnalysis; the corresponding Docker images are available in DockerHub:

Requirements for Execution:

(recommendations are in bold)

  • WDL-capable Workflow Execution Tool (Cromwell)

  • Container Runtime that can load Docker images (Docker v2.1.0.3 or higher)

Hardware Requirements:

  • Disk space: 152 GB for databases (55 GB, 89 GB, and 8 GB for GOTTCHA2, Kraken2 and Centrifuge databases, respectively)

  • 60 GB RAM

Workflow Dependencies

Third party software:

(These are included in the Docker image.)

Requisite databases:

The database for each tool must be downloaded and installed. These databases total 152 GB.

  • GOTTCHA2 database (gottcha2/):

The database RefSeqr90.cg.BacteriaArchaeaViruses.species.fna contains complete genomes of bacteria, archaea and viruses from RefSeq Release 90. The following commands will download the database:

wget https://edge-dl.lanl.gov/GOTTCHA2/RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar
tar -xvf RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar
rm RefSeq-r90.cg.BacteriaArchaeaViruses.species.tar
  • Kraken2 database (kraken2/):

This is a standard Kraken 2 database, built from NCBI RefSeq genomes. The following commands will download the database:

mkdir kraken2
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20201202.tar.gz
tar -xzvf k2_standard_20201202.tar.gz -C kraken2
rm k2_standard_20201202.tar.gz
  • Centrifuge database (centrifuge/):

This is a compressed database built from RefSeq genomes of Bacteria and Archaea. The following commands will download the database:

mkdir centrifuge
wget https://genome-idx.s3.amazonaws.com/centrifuge/p_compressed_2018_4_15.tar.gz
tar -xzvf p_compressed_2018_4_15.tar.gz -C centrifuge
rm p_compressed_2018_4_15.tar.gz

Sample dataset(s):

For best results, using datasets that have already gone through ReadsQC is strongly encouraged.

Short Reads

Long-Reads:

Zymobiomics synthetic metagenome (SRR13128014) For testing we have subsampled the dataset (~57MB), the original dataset is ~18G of bases.

Input:

A JSON file containing the following information:

  1. Selection of profiling tools (optional, default only singlem set true)

  2. Paths to the required database(s) for the selected tools

  3. Paths to the input fastq file(s) (paired-end data shown; output of the Reads QC workflow in interleaved format can be treated as single-end.)

  4. Paired end Boolean

  5. The project name

  6. Long reads Boolean

  7. CPU number requested for the run

{
  "ReadbasedAnalysis.enabled_tools": {
    "gottcha2": false,
    "kraken2": false,
    "centrifuge": false,
    "singlem": true
  },
  "ReadbasedAnalysis.db": {
    "gottcha2": "/path/to/database/RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna",
    "kraken2": "/path/to/kraken2",
    "centrifuge": "/path/to/centrifuge/p_compressed"
  },
  "ReadbasedAnalysis.reads": "/path/to/SRR7877884-int.fastq.gz",
  "ReadbasedAnalysis.paired": true,
  "ReadbasedAnalysis.proj": "SRR7877884",
  "ReadbasedAnalysis.long_read": false,
  "ReadbasedAnalysis.cpu": 8
}

Output:

The workflow creates an output JSON file and individual output sub-directories for each tool which include tabular classification results, a tabular report, and a Krona plot (html).

Below is an example of the output directory files with descriptions to the right.

Directory/File Name

Description

SRR7877884_profiler.info

ReadbasedAnalysis profiler info JSON file

SRR7877884_centrifuge_classification.tsv

Centrifuge output read classification TSV file

SRR7877884_centrifuge_report.tsv

Centrifuge output report TSV file

SRR7877884_centrifuge_krona.html

Centrifuge krona plot HTML file

SRR7877884_gottcha2_full.tsv

GOTTCHA2 detail output TSV file

SRR7877884_gottcha2_report.tsv

GOTTCHA2 output report TSV file

SRR7877884_gottcha2_krona.html

GOTTCHA2 krona plot HTML file

SRR7877884_kraken2_classification.tsv

Kraken2 output read classification TSV file

SRR7877884_kraken2_report.tsv

Kraken2 output report TSV file

SRR7877884_kraken2_krona.html

Kraken2 krona plot HTML file

SRR7877884_singlem_classification.tsv

SingleM output read classification TSV file

SRR7877884_singlem_report.tsv

SingleM output report TSV file

SRR7877884_singlem_krona.html

SingleM krona plot HTML file

Download the example ReadbasedAnalysis output for the short-reads Illumina run SRR7877884 (10% subset) here.

Download the example ReadbasedAnalysis output for the long-reads PacBio run SRR13128014 here.

Version History

  • 1.1.0 (release date 11/23/2025)

Point of contact