Metagenome Assembled Genomes Workflow (v1.3.9)

Metagenome assembled genomes generation

Workflow Overview

The workflow is based on IMG metagenome binning pipeline and has been modified specifically for the NMDC project. For all processed metagenomes, it classifies contigs into bins using MetaBat2. Next, the bins are refined using the functional Annotation file (GFF) from the Metagenome Annotation workflow and optional contig lineage information. The completeness of and the contamination present in the bins are evaluated by CheckM and bins are assigned a quality level (High Quality (HQ), Medium Quality (MQ), Low Quality (LQ)) based on MiMAG standards. In the end, GTDB-Tk is used to assign lineage for HQ and MQ bins and EukCC is used to evaluated LQ bins.

The visualization component calls the thirdparty tools ko_mapper.py and KronaTools to map protein KO information with their respective modules and calculates the completeness percentage of each module present using the custom MicrobeAnnotator1 database and generate barplot/heatmap/krona plots for the KO annotation summary visualization. KEGG module completeness is calculated based on the total steps in a module, the proteins (KOs) required for each step, and the KOs present in each MAG. KEGG modules are defined as functional gene units that are linked to higher metabolic capabilities (pathways), structural complexes, and phenotypic characteristics.

Workflow Availability

Requirements for Execution

(recommendations are in bold):

  • WDL-capable Workflow Execution Tool (Cromwell)

  • Container Runtime that can load Docker images (Docker v2.1.0.3 or higher)

Hardware Requirements

  • Disk space: > 100 GB for the CheckM, GTDB-Tk and EukCC databases

  • Memory: ~150GB memory for GTDB-tk.

Workflow Dependencies

Third party software (These are included in the Docker image.)

Requisite databases

  • CheckM database is 275MB contains the databases used for the Metagenome Binned contig quality assessment. (requires 40GB+ of memory, included in the image)

    wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
    tar -xvzf checkm_data_2015_01_16.tar.gz
    mkdir -p refdata/CheckM_DB && tar -xvzf checkm_data_2015_01_16.tar.gz -C refdata/CheckM_DB
    rm checkm_data_2015_01_16.tar.gz
    
  • GTDB-Tk requires ~78G of external data that need to be downloaded and unarchived. (requires ~150GB of memory):

    wget https://data.gtdb.ecogenomic.org/releases/release214/214.0/auxillary_files/gtdbtk_r214_data.tar.gz
    mkdir -p refdata/GTDBTK_DB && tar -xvzf gtdbtk_r214_data.tar.gz
    mv release214 refdata/GTDBTK_DB
    rm gtdbtk_r214_data.tar.gz
    
  • EuKCC requires ~12G of external data that need to be downloaded and unarchived.:

    wget http://ftp.ebi.ac.uk/pub/databases/metagenomics/eukcc/eukcc2_db_ver_1.2.tar.gz
    tar -xvzf eukcc2_db_ver_1.2.tar.gz
    mv eukcc2_db_ver_1.2 EUKCC2_DB
    rm eukcc2_db_ver_1.2.tar.gz
    

Sample dataset(s)

The following test datasets include an assembled contigs file, a SAM.gz file, and functional annotation files:

Input

A JSON file containing the following:

  1. Project Name

  2. Metagenome Assembled Contig fasta file

  3. Sam/Bam file from reads mapping back to contigs.

  4. Contigs functional annotation result in gff format

  5. Contigs functional annotated protein FASTA file

  6. Tab delimited file for COG annotation.

  7. Tab delimited file for EC annotation.

  8. Tab delimited file for KO annotation.

  9. Tab delimited file for PFAM annotation.

  10. Tab delimited file for TIGRFAM annotation.

  11. Tab delimited file for CRISPR annotation.

  12. Tab delimited file for Gene Product name assignment.

  13. Tab delimited file for Gene Phylogeny assignment.

  14. Tab delimited file for Contig/Scaffold lineage.

  15. GTDBTK Database

  16. CheckM Database

  17. (optional) nmdc_mags.threads: The number of threads used by metabat/samtools/checkm/gtdbtk. default: 64

  18. (optional) nmdc_mags.pthreads: The number of threads used by pplacer (Use lower number to reduce the memory usage) default: 1

  19. (optional) nmdc_mags.map_file: MAP file containing mapping of contig headers to annotation IDs

An example JSON file is shown below:

{
    "nmdc_mags.proj_name": "nmdc_wfmgan-xx-xxxxxxxx",
    "nmdc_mags.contig_file": "/path/to/Assembly/nmdc_wfmgas-xx-xxxxxxx_contigs.fna",
    "nmdc_mags.sam_file": "/path/to/Assembly/nmdc_wfmgas-xx-xxxxxxx_pairedMapped_sorted.bam",
    "nmdc_mags.gff_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_functional_annotation.gff",
    "nmdc_mags.proteins_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_proteins.faa",
    "nmdc_mags.cog_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_cog.gff",
    "nmdc_mags.ec_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_ec.tsv",
    "nmdc_mags.ko_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_ko.tsv",
    "nmdc_mags.pfam_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_pfam.gff",
    "nmdc_mags.tigrfam_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxxtigrfam.gff",
    "nmdc_mags.crispr_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_crt.crisprs,
    "nmdc_mags.product_names_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_product_names.tsv",
    "nmdc_mags.gene_phylogeny_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_gene_phylogeny.tsv",
    "nmdc_mags.lineage_file": "/path/to/Annotation/nmdc_wfmgas-xx-xxxxxxx_scaffold_lineage.tsv",
    "nmdc_mags.gtdbtk_db": "refdata/GTDBTK_DB",
    "nmdc_mags.checkm_db": "refdata/CheckM_DB"
}

Output

TThe output will have a bunch of files, including statistical numbers, status log and zipped bins files etc.

Below is an example of all the output files with descriptions to the right.

FileName/DirectoryName

Description

project_name_mags_stats.json

MAGs statistics in json format

project_name_hqmq_bin.zip

HQ and MQ bins. Each bin tar.gz file*, sqlite db file, ko_matrix** text file.

project_name_lq_bin.zip

LQ bins. Each bin tar.gz file*, sqlite db file, EukCC result csv file, ko_matrix** text file.

project_name_bin.info

Third party software inforamtion used in the workflow

project_name_bins.lowDepth.fa

LowDepth (mean cov <1 ) filtered contigs fasta file by metaBat2

project_name_bins.tooShort.fa

TooShort (< 3kb) filtered contigs fasta file by metaBat2

project_name_bins.unbinned.fa

Unbinned fasta file

project_name_checkm_qa.out

Checkm statistics report

project_name_gtdbtk.ar122.summary.tsv

Summary tsv file for gtdbtk archaeal genomes (bins) classification

project_name_gtdbtk.bac122.summary.tsv

Summary tsv file for gtdbtk bacterial genomes (bins) classification

project_name_heatmap.pdf

The Heatmap presents the pdf file containing the KO analysis results for metagenome bins

project_name_barplot.pdf

The Bar chart presents the pdf file containing the KO analysis results for metagenome bins

project_name_kronaplot.html

The Krona plot presents the HTML file containing the KO analysis results for metagenome bins

* Each bin tar.gz file has bin’s contig fasta (.fna), protein fasta (.faa) and coresponding ko, cog, phylodist, ec, gene_product, gff, tigr, crisprs and pfam annotation text files.

** ko_matrix file in bin.zip: The row of the matrix is each KO modules and its name/pathway group. The value of each MAG (per column) is the module completeness. This file can be used to generate customized plots with other graphic tools/libraries.

Version History

  • 1.3.9 (release date 08/23/2024; previous versions: 1.3.8)

Point of contact