Metagenome Annotation Workflow (v1.1.4)
Workflow Overview
This workflow takes assembled metagenomes and generates structural and functional annotations. It is based on the JGI/IMG annotation pipeline (more details) and uses a number of open-source tools and databases to generate the structural and functional annotations.
The input assembly is first split into 10MB splits to be processed in parallel. Depending on the workflow engine configuration, the split can be processed in parallel. Each split is first structurally annotated, then those results are used for the functional annotation. The structural annotation uses tRNAscan-SE
, Rfam
, CRT
, Prodigal
and GeneMarkS
. These results are merged to create a consensus structural annotation. The resulting GFF is the input for functional annotation which uses multiple protein family databases (SMART
, COG
, TIGRFAM
, SUPERFAMILY
, Pfam
and Cath-FunFam
) along with custom HMM
models. The functional predictions are created using Last
and HMM
. These annotations are also merged into a consensus GFF file. Finally, the respective split annotations are merged together to generate a single structural annotation file and single functional annotation file. In addition, several summary files are generated in TSV format.
Workflow Availability
The workflow is available in GitHub: https://github.com/microbiomedata/mg_annotation/ and the corresponding Docker image is available in DockerHub:
Requirements for Execution (recommendations are in bold):
WDL-capable Workflow Execution Tool (Cromwell)
Container Runtime that can load Docker images (Docker v2.1.0.3 or higher)
Hardware Requirements:
Disk space: 106 GB for the reference databases
Memory: >100 GB RAM
Workflow Dependencies
Third party software (This is included in the Docker image.)
Conda (3-clause BSD)
tRNAscan-SE >= 2.0.12 (GNU GPL v3)
Infernal 1.1.4 (BSD)
CRT-CLI 1.8.4 (Public domain software, last official version is 1.2)
Prodigal 2.6.3_patched (GNU GPL v3)
GeneMarkS-2 >= 1.25 (Academic license for GeneMark family software)
Last >= 1584 (GNU GPL v3)
HMMER 3.3.2 (3-clause BSD, thread optimized)
GeNomad 1.8.1 (GNU GPL v3, pulled from IMG Annotation Pipeline repo)
Requisite databases:
Rfam 13.0 (public domain/CC0 1.0; more info)
KEGG (paid subscription, getting KOs/ECs indirectly via IMG-NR 20240916; more info)
SMART 01_06_2016 (restrictive license/custom; more info)
COG 2003 (copyright/unlicensed; more info)
TIGRFAM v15.0 (copyleft/LGPL 2.0 or later; more info)
SUPERFAMILY v1.75 (permissive/custom; more info)
Pfam v37.0 (public domain/ CC0 1.0; more info)
Cath-FunFam v4.2.0 (permissive/CC BY 4.0; more info)
GeNomad DB v1.7 (permissive/CC BY 4.0; more info)
Sample datasets
Processed Metatranscriptome of soil microbial communities from the East River watershed near Crested Butte, Colorado, United States - ER_RNA_119 (SRR11678315) with metadata available in the NMDC Data Portal.
Inputs
A JSON file containing the following:
The path to the assembled contigs FASTA file
output file prefix
(optional) parameters for memory
(optional) number of threads requested
An example JSON file is shown below:
{
"annotation.input_file": "https://portal.nersc.gov/cfs/m3408/test_data/metaT/SRR11678315/assembly_output/SRR11678315-int-0.1_contigs.fna",
"annotation.proj": "SRR11678315-int-0.1",
"annotation.imgap_project_id": "SRR11678315-int-0.1"
}
Output
The final structural and functional annotation files are in GFF format and the summary files are in TSV format.
Directory/File Name |
Description |
---|---|
prefix_cath_funfam.gff |
GFF / tab-delimited functional annotation generated from Cath-FunFam (Functional Families) database |
prefix_cog.gff |
GFF / tab-delimited functional annotation generated from COG (Clusters of Orthologous Groups) database |
prefix_contig_names_mapping.tsv |
Tab-delimited file with mapping of original contig/read IDs (headers of submitted fasta file) to specified contig names |
prefix_contigs.fna |
FASTA nucleic acid file for taxon. |
prefix_crt.crisprs |
Tab-delimited file for CRISPR array annotation details |
prefix_crt.gff |
GFF / tab-delimited structural annotation generated with CRT |
prefix_ec.tsv |
Tab-delimited file file for EC annotation |
prefix_functional_annotation.gff |
GFF / tab-delimited with functional annotations |
prefix_genemark.gff |
GFF / tab-delimited with structural annotation by GeneMark |
prefix_gene_phylogeny.tsv |
Tab-delimited file of gene phylogeny |
prefix_imgap.info |
Workflow information |
prefix_ko_ec.gff |
GFF / tab-delimited annotation with KO and EC terms |
prefix_ko.tsv |
Tab-delimited file of only KO terms |
prefix_pfam.gff |
GFF / tab-delimited functional annotation from Pfam database |
prefix_prodigal.gff |
GFF3 structural annotation by Prodigal |
prefix_product_names.tsv |
Tab-delimited file of annotation products |
prefix_proteins.faa |
FASTA amino acid file for taxon |
prefix_rfam.gff |
GFF / tab-delimited structural annotation for non-coding RNA and regulatory RNA motif and binding site annotation by Rfam |
prefix_scaffold_lineage.tsv |
Tab-delimited file of phylogeny at scaffold level |
prefix_smart.gff |
GFF / tab-delimited functional annotation from SMART database |
prefix_stats.json |
JSON of annotation statistics report |
prefix_stats.tsv |
Tab-delimited file of annotation statistics report |
prefix_structural_annotation.gff |
GFF / tab-delimited structural annotation |
prefix_supfam.gff |
GFF / tab-delimited functional annotation from SUPERFAMILY database |
prefix_tigrfam.gff |
GFF / tab-delimited functional annotation from TIGRFAM database |
prefix_trna.gff |
GFF / tab-delimited structural annotation by tRNAscan-SE |
Structure of GFF and tab-delimited text files
General GFFs
Column |
Header |
Description |
---|---|---|
1 |
seqid |
Sequence ID |
2 |
source |
Version of IMG database |
3 |
type |
Feature type |
4 |
start_coord |
Starting coordinate |
5 |
end_coord |
Ending coordinate |
6 |
score |
NA |
7 |
strand |
Strand orientation |
8 |
phase |
NA |
9 |
attributes |
ID=<feature_id>;locus_tag=<gene_id>;product=<initial product> |
prefix_cog.gff
(From NCBI RPSBLAST or hmmsearch with COG HMMs)
Column |
Header |
Description |
---|---|---|
1 |
gene_id |
Gene object identifier of query gene |
2 |
cog_id |
COG identifier |
3 |
percent_identity |
Percent identity of aligned amino acid residues (Not valid for HMM’s, retained for compatibility with legacy data) |
4 |
align_length |
Alignment length |
5 |
query_start |
Start coordinate of alignment on query gene |
6 |
query_end |
End coordinate of alignment on query gene |
7 |
subj_start |
Start coordinate of alignment on subject sequence |
8 |
subj_end |
End coordinate of alignment on subject sequence |
9 |
eHeader |
Expectation Header |
10 |
bit_score |
Bit score of alignment |
prefix_pfam.gff
(From hmmsearch with Pfam HMMs)
Column |
Header |
Description |
---|---|---|
1 |
gene_id |
Gene identifier of query gene |
2 |
pfam_id |
Pfam identifier |
3 |
percent_identity |
(Always “100%”. Not valid for HMMs, retained for compatibility with legacy data) |
4 |
query_start |
Start coordinate of alignment on query gene |
5 |
query_end |
End coordinate of alignment on query gene |
6 |
subj_start |
Start coordinate of alignment on subject sequence |
7 |
subj_end |
End coordinate of alignment on subject sequence |
8 |
eHeader |
Expectation Header |
9 |
bit_score |
Bit score of alignment |
10 |
align_length |
Alignment length |
prefix_tigrfam.gff
(TIGRFAM annotation)
Column |
Header |
Description |
---|---|---|
1 |
gene_id |
Gene identifier of query gene |
2 |
tfam_id |
TIGRFAM identifier |
3 |
percent_identity |
(Always “100%”. Not valid for HMMs, retained for compatibility with legacy data) |
4 |
query_start |
Start coordinate of alignment on query gene |
5 |
query_end |
End coordinate of alignment on query gene |
6 |
subj_start |
Start coordinate of alignment on subject sequence |
7 |
subj_end |
End coordinate of alignment on subject sequence |
8 |
eHeader |
Expectation Header |
9 |
bit_score |
Bit score of alignment |
10 |
align_length |
Alignment length |
prefix_cath_funfam.gff
(CATH FUNFAM annotation)
Column |
Header |
Description |
---|---|---|
1 |
gene_id |
Gene identifier of query gene |
2 |
cathfunfam_id |
CATH FUNFAM identifier |
3 |
percent_identity |
Percent identity match in alignment (Not valid for HMMs, retained for compatibility with legacy data) |
4 |
query_start |
Start coordinate of alignment on query gene |
5 |
query_end |
End coordinate of alignment on query gene |
6 |
subj_start |
Start coordinate of alignment on subject sequence |
7 |
subj_end |
End coordinate of alignment on subject sequence |
8 |
eHeader |
Expectation Header |
9 |
bit_score |
Bit score of alignment |
10 |
align_length |
Alignment length |
prefix_supfam.gff
(SUPERFAM annotation)
Column |
Header |
Description |
---|---|---|
1 |
gene_id |
Gene identifier of query gene |
2 |
superfam_id |
SUPERFAM identifier |
3 |
percent_identity |
Percent identity match in alignment (Not valid for HMMs, retained for compatibility with legacy data) |
4 |
query_start |
Start coordinate of alignment on query gene |
5 |
query_end |
End coordinate of alignment on query gene |
6 |
subj_start |
Start coordinate of alignment on subject sequence |
7 |
subj_end |
End coordinate of alignment on subject sequence |
8 |
eHeader |
Expectation Header |
9 |
bit_score |
Bit score of alignment |
10 |
align_length |
Alignment length |
prefix_smart.gff
(SMART annotation)
Column |
Header |
Description |
---|---|---|
1 |
gene_id |
Gene identifier of query gene |
2 |
smart_id |
SMART identifier |
3 |
percent_identity |
Percent identity match in alignment (Not valid for HMMs, retained for compatibility with legacy data) |
4 |
query_start |
Start coordinate of alignment on query gene |
5 |
query_end |
End coordinate of alignment on query gene |
6 |
subj_start |
Start coordinate of alignment on subject sequence |
7 |
subj_end |
End coordinate of alignment on subject sequence |
8 |
eHeader |
Expectation Header |
9 |
bit_score |
Bit score of alignment |
10 |
align_length |
Alignment length |
prefix_gene_phylogeny.tsv
(from LAST on non-redundant database of IMG proteins extracted from high-quality genomes)
Column |
Header |
Description |
---|---|---|
1 |
gene_id |
Gene identifier of query gene |
2 |
homolog_gene_oid |
IMG gene object identifier of LAST hit (subject sequence) |
3 |
homolog_taxon_oid |
IMG taxon object identifier of LAST hit protein (subject sequence) |
4 |
percent_identity |
Percent identity match in alignment |
5 |
lineage |
Domain;phylum;class;order;family;genus;species;taxon_name of the genome in which LAST hit was found |
prefix_ko.tsv
(from LAST on IMG genes)
Column |
Header |
Description |
---|---|---|
1 |
gene_id |
Gene object identifier of query gene |
2 |
img_ko_flag |
IMG generated KO assignment. Always ‘Yes’. |
3 |
ko_term |
KEGG Orthology (KO) identifier of LAST hit (subject sequence) |
4 |
percent_identity |
Percent identity of aligned amino acid residues |
5 |
query_start |
Start coordinate of alignment on query gene |
6 |
query_end |
End coordinate of alignment on query gene |
7 |
subj_start |
Start coordinate of alignment on subject sequence |
8 |
subj_end |
End coordinate of alignment on subject sequence |
9 |
eHeader |
Expectation Header |
10 |
bit_score |
Bit score of alignment |
11 |
align_length |
Alignment length |
prefix_ec.tsv
(from LAST on IMG genes)
Column |
Header |
Description |
---|---|---|
1 |
gene_id |
Gene object identifier of query gene |
2 |
img_ko_flag |
IMG generated KO assignment. Always ‘Yes’. |
3 |
EC |
EC derived from KEGG Orthology (KO) identifier of LAST hit (subject sequence) |
4 |
percent_identity |
Percent identity of aligned amino acid residues |
5 |
query_start |
Start coordinate of alignment on query gene |
6 |
query_end |
End coordinate of alignment on query gene |
7 |
subj_start |
Start coordinate of alignment on subject sequence |
8 |
subj_end |
End coordinate of alignment on subject sequence |
9 |
eHeader |
Expectation Header |
10 |
bit_score |
Bit score of alignment |
11 |
align_length |
Alignment length |
prefix_product_names.tsv
(from COG, Pfam, TIGRfam)
Column |
Header |
Description |
---|---|---|
1 |
gene_id |
Gene identifier |
2 |
product_name |
Product name |
3 |
source |
Source of assignment |
prefix_contig_names_mapping.tsv
Column |
Header |
Description |
---|---|---|
1 |
orig_id |
Original sequence ID (derived from the headers of the fasta file submitted to IMG) |
2 |
new_id |
New sequence ID assigned by IMG annotation pipeline |
prefix_crt.crisprs
Column |
Header |
Description |
---|---|---|
1 |
contig_id |
Contig/Scaffold ID |
2 |
crispr_no |
CRISPR number |
3 |
pos |
Starting position of array element |
4 |
repeat_seq |
Repeat sequence |
5 |
spacer_seq |
Spacer sequence |
6 |
tool_code |
Single letter code for tool used |
Version History
1.1.4 (08/09/2024)
1.0.0 (release data)
Point of contact
Author: Shane Canon <scanon@lbl.gov>
Maintainer: Kaitlyn Li <kli@lanl.gov>