Metagenome Assembled Genomes Workflow (v1.3.9)

Workflow Overview

The workflow is based on IMG metagenome binning pipeline and has been modified specifically for the NMDC project. For all processed metagenomes, it classifies contigs into bins using MetaBat2. Next, the bins are refined using the functional Annotation file (GFF) from the Metagenome Annotation workflow and optional contig lineage information. The completeness of and the contamination present in the bins are evaluated by CheckM and bins are assigned a quality level (High Quality (HQ), Medium Quality (MQ), Low Quality (LQ)) based on MiMAG standards. In the end, GTDB-Tk is used to assign lineage for HQ and MQ bins and EukCC is used to evaluated LQ bins.

The visualization component calls the thirdparty tools ko_mapper.py and KronaTools to map protein KO information with their respective modules and calculates the completeness percentage of each module present using the custom MicrobeAnnotator1 database and generate barplot/heatmap/krona plots for the KO annotation summary visualization. KEGG module completeness is calculated based on the total steps in a module, the proteins (KOs) required for each step, and the KOs present in each MAG. KEGG modules are defined as functional gene units that are linked to higher metabolic capabilities (pathways), structural complexes, and phenotypic characteristics.

Workflow Availability

The workflow from GitHub uses all the listed docker images to run all third-party tools.
The workflow is available in GitHub:
- https://github.com/microbiomedata/metaMAGs
The corresponding Docker image is available in DockerHub:
- https://hub.docker.com/r/microbiomedata/nmdc_mbin
- https://hub.docker.com/r/microbiomedata/nmdc_mbin_vis

Requirements for Execution

(recommendations are in bold):

WDL-capable Workflow Execution Tool (Cromwell)
Container Runtime that can load Docker images (Docker v2.1.0.3 or higher)

Hardware Requirements

Disk space: > 100 GB for the CheckM, GTDB-Tk and EukCC databases
Memory: ~150GB memory for GTDB-tk.

Workflow Dependencies

Third party software (These are included in the Docker image.)

Metabat2 v2.15 (License: BSD-3-Clause)
CheckM v1.2.1 (License: GPLv3)
GTDB-TK v2.1.1 (License: GPLv3)
hmmer v3.3.2 (License: BSD-3-Clause)
prodigal v2.6.3 (License: GPLv3)
pplacer v1.1.alpha19 (License GPLv3)
FastTree v2.1.11 (License: GPLv2)
FastANI v1.33 (License: Apache 2.0)
mash v2.3 (License: Open-source)
Sqlite 3.39.2 (License: Public Domain)
samtools > v1.6 (License: MIT License)
EukCC v2.1.2 (License GPLv3)
metaeuk 4.a0f584d (License GPLv3)
Biopython v1.74 (License: BSD-3-Clause)
epa-ng v0.3.8 (License: GPLv3)
Pymysql (License: MIT License)
requests (License: Apache 2.0)
MicrobeAnnotator v2.0.5 (License: Artistic 2.0)
KronaTools2 v2.8.1 (License: Open-source)

Requisite databases

CheckM database is 275MB contains the databases used for the Metagenome Binned contig quality assessment. (requires 40GB+ of memory, included in the image)

wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
tar -xvzf checkm_data_2015_01_16.tar.gz
mkdir -p refdata/CheckM_DB && tar -xvzf checkm_data_2015_01_16.tar.gz -C refdata/CheckM_DB
rm checkm_data_2015_01_16.tar.gz

GTDB-Tk requires ~78G of external data that need to be downloaded and unarchived. (requires ~150GB of memory):

wget https://data.gtdb.ecogenomic.org/releases/release214/214.0/auxillary_files/gtdbtk_r214_data.tar.gz
mkdir -p refdata/GTDBTK_DB && tar -xvzf gtdbtk_r214_data.tar.gz
mv release214 refdata/GTDBTK_DB
rm gtdbtk_r214_data.tar.gz

EuKCC requires ~12G of external data that need to be downloaded and unarchived.:

wget http://ftp.ebi.ac.uk/pub/databases/metagenomics/eukcc/eukcc2_db_ver_1.2.tar.gz
tar -xvzf eukcc2_db_ver_1.2.tar.gz
mv eukcc2_db_ver_1.2 EUKCC2_DB
rm eukcc2_db_ver_1.2.tar.gz

Sample dataset(s)

The following test datasets include an assembled contigs file, a SAM.gz file, and functional annotation files:

dataset: with HQ, MQ and MQ bins (38G) . You can find input/output in the downloaded tar gz file.

Input

A JSON file containing the following:

Project Name
Metagenome Assembled Contig fasta file
Sam/Bam file from reads mapping back to contigs.
Contigs functional annotation result in gff format
Contigs functional annotated protein FASTA file
Tab delimited file for COG annotation.
Tab delimited file for EC annotation.
Tab delimited file for KO annotation.
Tab delimited file for PFAM annotation.
Tab delimited file for TIGRFAM annotation.
Tab delimited file for CRISPR annotation.
Tab delimited file for Gene Product name assignment.
Tab delimited file for Gene Phylogeny assignment.
Tab delimited file for Contig/Scaffold lineage.
GTDBTK Database
CheckM Database
(optional) nmdc_mags.threads: The number of threads used by metabat/samtools/checkm/gtdbtk. default: 64
(optional) nmdc_mags.pthreads: The number of threads used by pplacer (Use lower number to reduce the memory usage) default: 1
(optional) nmdc_mags.map_file: MAP file containing mapping of contig headers to annotation IDs

An example JSON file is shown below:

{
    "nmdc_mags.proj_name": "nmdc_wfmgan-xx-xxxxxxxx",
    "nmdc_mags.contig_file": "/path/to/Assembly/nmdc_wfmgas-xx-xxxxxxx_contigs.fna",
    "nmdc_mags.sam_file": "/path/to/Assembly/nmdc_wfmgas-xx-xxxxxxx_pairedMapped_sorted.bam",
    "nmdc_mags.gff_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_functional_annotation.gff",
    "nmdc_mags.proteins_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_proteins.faa",
    "nmdc_mags.cog_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_cog.gff",
    "nmdc_mags.ec_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_ec.tsv",
    "nmdc_mags.ko_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_ko.tsv",
    "nmdc_mags.pfam_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_pfam.gff",
    "nmdc_mags.tigrfam_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxxtigrfam.gff",
    "nmdc_mags.crispr_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_crt.crisprs,
    "nmdc_mags.product_names_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_product_names.tsv",
    "nmdc_mags.gene_phylogeny_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_gene_phylogeny.tsv",
    "nmdc_mags.lineage_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_scaffold_lineage.tsv",
    "nmdc_mags.gtdbtk_db": "refdata/GTDBTK_DB",
    "nmdc_mags.checkm_db": "refdata/CheckM_DB"
}

Output

TThe output will have a bunch of files, including statistical numbers, status log and zipped bins files etc.

Below is an example of all the output files with descriptions to the right.

FileName/DirectoryName	Description
project_name_mags_stats.json	MAGs statistics in json format
project_name_hqmq_bin.zip	HQ and MQ bins. Each bin tar.gz file, sqlite db file, ko_matrix* text file.
project_name_lq_bin.zip	LQ bins. Each bin tar.gz file, sqlite db file, EukCC result csv file, ko_matrix* text file.
project_name_bin.info	Third party software inforamtion used in the workflow
project_name_bins.lowDepth.fa	LowDepth (mean cov <1 ) filtered contigs fasta file by metaBat2
project_name_bins.tooShort.fa	TooShort (< 3kb) filtered contigs fasta file by metaBat2
project_name_bins.unbinned.fa	Unbinned fasta file
project_name_checkm_qa.out	Checkm statistics report
project_name_gtdbtk.ar122.summary.tsv	Summary tsv file for gtdbtk archaeal genomes (bins) classification
project_name_gtdbtk.bac122.summary.tsv	Summary tsv file for gtdbtk bacterial genomes (bins) classification
project_name_heatmap.pdf	The Heatmap presents the pdf file containing the KO analysis results for metagenome bins
project_name_barplot.pdf	The Bar chart presents the pdf file containing the KO analysis results for metagenome bins
project_name_kronaplot.html	The Krona plot presents the HTML file containing the KO analysis results for metagenome bins

* Each bin tar.gz file has bin’s contig fasta (.fna), protein fasta (.faa) and coresponding ko, cog, phylodist, ec, gene_product, gff, tigr, crisprs and pfam annotation text files.

** ko_matrix file in bin.zip: The row of the matrix is each KO modules and its name/pathway group. The value of each MAG (per column) is the module completeness. This file can be used to generate customized plots with other graphic tools/libraries.

Version History

1.3.9 (release date 08/23/2024; previous versions: 1.3.8)

Point of contact

Original author: Neha Varghese <njvarghese@lbl.gov>
Package maintainer: Chienchi Lo <chienchi@lanl.gov>