Metagenomic Workflow

Workflow Overview

The NMDC standardized metagenome workflow leverages parts of JGI’s production pipeline for short-read and long-read data and consists of: reads quality control (QC), metagenome assembly, metagenome annotation, read-based taxonomy, and binning of population genomes to generate metagenome-assembled genomes (MAGs) workflows (with links to documentation).

The reads QC workflow utilizes rqcfilter2 to trim and filter low quality data from raw metagenome Illumina reads (FASTQ files) and uses pbmarkdup and bbtools to filter PacBio reads. The workflow additionally removes artifacts, linkers, adapters, spike-in reads, and reads mapping to several hosts and common contaminants.

The read-based taxonomy classification workflow utilizes GOTTCHA2, Kraken2, and Centrifuge to profile quality-controlled reads to accommodate varied project goals and sequencing approaches that cover a spectrum from high sensitivity to high specificity that is dependent on the algorithms and cut-off levels chosen from different tools.

The metagenome assembly short reads workflow uses bbcms, metaSPAdes, and BBMap to run error correction, assembly, and assembly validation, respectively. While the metagenome assembly long reads workflow uses Flye, pbmm2, Racon, and minimap2 for assembly, alignment, polishing, and mapping, respectively. The metagenome annotation workflow takes in assembled metagenomes and generates structural and functional annotations. The MAGs workflow uses MetaBAT 2 to generate metagenome bins and applies the MiMAG standards using annotated tRNAs, rRNAs, and marker genes with checkM to estimate completeness and contamination and subsequent taxonomic lineage assignment.

Users can run a single workflow within the metagenome pipeline with the appropriate input files, but the entire metagenome workflow is available to run from start to finish on NMDC EDGE from a single input raw Illumina file or PacBio file.

Workflow Availability

The workflow is available in its individual components on GitHub (repositories linked) and as a whole to run on NMDC EDGE. Documentation links are available in Workflow Overview.

Requirements for Execution outside NMDC EDGE:

(recommendations are in bold)

WDL-capable Workflow Execution Tool (Cromwell)
Container Runtime that can load Docker images (Docker v2.1.0.3 or higher)

Hardware Requirements:

Disk space: 258 GB for databases
86 GB RAM

Workflow Dependencies

Third-party software:

Docker images containing third-party software are available in the respective repositories. More information is available in the workflow repositories listed above.

Sample dataset(s):

Soil microbial communities from the East River watershed near Crested Butte, Colorado, United States - ER_DNA_379 metagenome (SRR8553641) with metadata available in the NMDC Data Portal. This dataset has 18.3G bases
- The zipped raw fastq file is available here
Zymobiomics mock-community DNA control (SRR7877884); this dataset is has 6.7G bases.
- The non-interleaved raw fastq files are available as R1 and R2
- The interleaved file is here
- A 10% subset of the interleaved file is available as a quick dataset here

Input:

To run the full workflow via NMDC EDGE web UI, the following inputs are needed:

Project/Run Name
Is interleaved (boolean)
Interleaved fastq(s), (FASTQ #1; FASTQ #2…)
If non-interleaved paired-end reads, Pair(FASTQ R1, FASTQ R2)…

To run individual workflows, see website or individual GitHub repositories. (See Workflow Availability links)

Output:

Upon completion of the run, the NMDC EDGE interface provides results grouped by individual workflow for viewing.

In addition to the workflow outputs are summary tables for each portion:

ReadsQC: statistics and metrics, including the number of reads and bases before and after QC filtering
Read-based taxonomy: summary tables and interactive Krona plots as visual outputs
Assembly: summary statistics table
Annotation: statistics for processed sequences, predicted genes, and general quality information
MAGs: summary section with information on binned and unbinned contigs, genome completeness, estimated contamination, and the number of genes present on all bins determined to be high quality or medium quality

Point of contact

Workflow maintainers: Chienchi Lo <chienchi@lanl.gov>, Mark Flynn <mflynn@lanl.gov>