Metagenome Assembly Workflow (v1.0.7)
Workflow Overview
This workflow takes in paired-end Illumina short reads or PacBio long reads.
Short Reads:
In short reads, the workflow reformats the interleaved file into two FASTQ files for downstream tasks using bbcms (BBTools). The corrected reads are assembled using metaSPAdes. After assembly, the reads are mapped back to contigs by bbmap (BBTools) for coverage information. The .wdl (Workflow Description Language) file includes five tasks: bbcms, assy, create_agp, read_mapping_pairs, and make_output.
The bbcms task takes in interleaved FASTQ inputs, performs error correction, and reformats the interleaved FASTQ into two output FASTQ files for paired-end reads for the next tasks.
The assy task performs metaSPAdes assembly.
Contigs and Scaffolds (output of metaSPAdes) are processed by the create_agp task to rename the FASTA header and generate an AGP format which describes the assembly.
The read_mapping_pairs task maps reads back to the final assembly to generate coverage information.
The final make_output task collects all output files into the specified directory.
Long Reads:
In long reads, the workflow uses Flye for assembly, pbmm2 for alignment, Racon for polishing, and minimap2 for read mapping and coverage analysis. The .wdl
(Workflow Description Language) file includes six tasks: combine_fastq, assy, racon, format_assembly, map, and make_info_file.
The combine_fastq task combines the input FASTQ files into a single FASTQ file, which is used as input for polishing and mapping tasks.
The assy task takes in the input FASTQ files and performs assembly using Flye.
The racon task cleans up the assembled contigs through two rounds of error correction using
pbmm2
andRacon
.The format_assembly task formats the polished assembly using BBTools’
fungalrelease.sh
, creating release-ready scaffolds and contigs, along with an AGP format file and a legend file that describes the assembly.The map task maps the input reads back to the final assembly using minimap2 to generate coverage data.
The final make_info_file task produces a summary file documenting tool versions, parameters, memory usage, and Docker containers used throughout the workflow.
Workflow Availability
The workflow from GitHub uses all the listed Docker images to run all third-party tools.
The workflow is available on GitHub: https://github.com/microbiomedata/metaAssembly
The corresponding Docker images are available on DockerHub:
Requirements for Execution
(Recommendations are in bold)
WDL-capable Workflow Execution Tool (Cromwell)
Container Runtime that can load Docker images (Docker v2.1.0.3 or higher)
Hardware Requirements
Memory: >40 GB RAM
The memory requirement depends on the input complexity. Here is a simple estimation equation for the memory required based on kmers in the input file:
predicted_mem = (kmers * 2.962e-08 + 1.630e+01) * 1.1 (in GB)
Note
The kmers variable for the equation above can be obtained using the kmercountmulti.sh script from BBTools.
Example command:
kmercountmulti.sh -k=31 in=your.read.fq.gz
Workflow Dependencies
Third-party software: (This is included in the Docker image.)
metaSPAdes v4.0.0 (License: GPLv2)
BBTools v39.03 (License: BSD-3-Clause-LBNL)
Sample dataset(s)
Short Reads:
Small dataset: Ecoli 10x (287M) (Input/output included in tar.gz file)
Large dataset: Zymobiomics mock-community DNA control (22G) (Input/output included in tar.gz file)
Zymobiomics mock-community DNA control (SRR7877884); this dataset is has 6.7G bases.
Long Reads:
Zymobiomics synthetic metagenome (SRR13128014) For testing we have subsampled the dataset, the original dataset is ~18GB.
Input
A JSON file containing the following information:
The path to the input FASTQ file (Illumina paired-end interleaved FASTQ or PacBio paired-end interleaved FASTQ) (recommended: output of the Reads QC workflow).
Project name example:
nmdc:XXXXXX
Memory (optional) e.g.,
"jgi_metaAssembly.memory": "105G"
Threads (optional) e.g.,
"jgi_metaAssembly.threads": "16"
Whether the input is short reads (boolean)
Example input JSON for short reads:
{
"jgi_metaAssembly.input_files": ["https://portal.nersc.gov/project/m3408/test_data/smalltest.int.fastq.gz"],
"jgi_metaAssembly.proj": "nmdc:XXXXXX",
"jgi_metaAssembly.memory": "105G",
"jgi_metaAssembly.threads": "16",
"jgi_metaAssembly.shortRead": true
}
Example input JSON for long reads:
{
"jgi_metaAssembly.input_files": ["/global/cfs/cdirs/m3408/www/test_data/SRR13128014.pacbio.subsample.ccs.fastq.gz"],
"jgi_metaAssembly.proj": "nmdc:XXXXXX",
"jgi_metaAssembly.memory": "105G",
"jgi_metaAssembly.threads": "16",
"jgi_metaAssembly.shortRead": false
}
Output
The output directory will contain the following files for short reads:
output/
├── nmdc_XXXXXX_metaAsm.info
├── nmdc_XXXXXX_covstats.txt
├── nmdc_XXXXXX_contigs.fna
├── nmdc_XXXXXX_bbcms.fastq.gz
├── nmdc_XXXXXX_scaffolds.fna
├── nmdc_XXXXXX_assembly.agp
├── stats.json
├── nmdc_XXXXXX_pairedMapped.sam.gz
└── nmdc_XXXXXX_pairedMapped_sorted.bam
The output directory will contain the following files for long reads:
output/
├── nmdc_XXXXXX_assembly.legend
├── nmdc_XXXXXX_contigs.fna
├── nmdc_XXXXXX_pairedMapped_sorted.bam
├── nmdc_XXXXXX_read_count_report.txt
├── nmdc_XXXXXX_metaAsm.info
├── nmdc_XXXXXX_summary.stats
├── nmdc_XXXXXX_scaffolds.fna
├── nmdc_XXXXXX_pairedMapped.sam.gz
├── stats.json
├── nmdc_XXXXXX_contigs.sam.stats
├── nmdc_XXXXXX_contigs.sorted.bam.pileup.basecov
├── nmdc_XXXXXX_assembly.agp
└── nmdc_XXXXXX_contigs.sorted.bam.pileup.out
Example output stats JSON file:
{
"scaffolds": 58,
"contigs": 58,
"scaf_bp": 28406,
"contig_bp": 28406,
"gap_pct": 0.00000,
"scaf_N50": 21,
"scaf_L50": 536,
"ctg_N50": 21,
"ctg_L50": 536,
"scaf_N90": 49,
"scaf_L90": 317,
"ctg_N90": 49,
"ctg_L90": 317,
"scaf_logsum": 22.158,
"scaf_powsum": 2.245,
"ctg_logsum": 22.158,
"ctg_powsum": 2.245,
"asm_score": 0.000,
"scaf_max": 1117,
"ctg_max": 1117,
"scaf_n_gt50K": 0,
"scaf_l_gt50K": 0,
"scaf_pct_gt50K": 0.0,
"gc_avg": 0.39129,
"gc_std": 0.03033
}
The table provides all of the output directories, files, and their descriptions.
Directory |
File Name |
Description |
---|---|---|
Short Reads |
Short reads assembly output directory |
|
/make_info_file |
nmdc_XXXXXX_metaAsm.info |
Summary information about the short reads assembly process |
/finish_asm |
nmdc_XXXXXX_covstats.txt |
Coverage statistics for assembled contigs |
/finish_asm |
nmdc_XXXXXX_contigs.fna |
Final contig sequences in FASTA format |
/finish_asm |
nmdc_XXXXXX_bbcms.fastq.gz |
Error-corrected FASTQ file from bbcms |
/finish_asm |
nmdc_XXXXXX_scaffolds.fna |
Final scaffold sequences in FASTA format |
/finish_asm |
nmdc_XXXXXX_assembly.agp |
Assembly information in AGP format |
/finish_asm |
stats.json |
Assembly statistics in JSON format |
/finish_asm |
nmdc_XXXXXX_pairedMapped.sam.gz |
SAM file with reads mapped back to assembly |
/finish_asm |
nmdc_XXXXXX_pairedMapped_sorted.bam |
Sorted BAM file with reads mapped back to assembly |
Long Reads |
Long reads assembly output directory |
|
/finish_lrasm |
nmdc_XXXXXX_assembly.legend |
Mapping file from contig to scaffold names |
/finish_lrasm |
nmdc_XXXXXX_contigs.fna |
Final contig sequences in FASTA format |
/finish_lrasm |
nmdc_XXXXXX_pairedMapped_sorted.bam |
Sorted BAM file with reads mapped back to assembly |
/finish_lrasm |
nmdc_XXXXXX_read_count_report.txt |
Read count report for validation |
/make_info_file |
nmdc_XXXXXX_metaAsm.info |
Summary information about the long reads assembly process |
/finish_lrasm |
nmdc_XXXXXX_summary.stats |
Summary statistics for assembly |
/finish_lrasm |
nmdc_XXXXXX_scaffolds.fna |
Final scaffold sequences in FASTA format |
/finish_lrasm |
nmdc_XXXXXX_pairedMapped.sam.gz |
SAM file with reads mapped back to assembly |
/finish_lrasm |
stats.json |
Assembly statistics in JSON format |
/finish_lrasm |
nmdc_XXXXXX_contigs.sam.stats |
SAM file statistics for contigs |
/finish_lrasm |
nmdc_XXXXXX_contigs.sorted.bam.pileup.basecov |
Base coverage information for contigs |
/finish_lrasm |
nmdc_XXXXXX_assembly.agp |
Assembly information in AGP format |
/finish_lrasm |
nmdc_XXXXXX_contigs.sorted.bam.pileup.out |
BAM file pileup output for contigs |
Version History
1.0.7 (release date 11/14/24; previous versions: 1.0.6)
Point of contact
Original author: Brian Foster <bfoster@lbl.gov>
Package maintainer: Chienchi Lo <chienchi@lanl.gov>