MetaT Reads QC Workflow (v0.0.7)
Workflow Overview
This workflow utilizes the program “rqcfilter2” from BBTools to perform quality control on raw Illumina reads. The workflow performs quality trimming, artifact removal, linker trimming, adapter trimming, and spike-in removal (using BBDuk), and performs human/cat/dog/mouse/microbe removal (using BBMap). It is a replicate of the QA protocol implemented at JGI.
The following parameters are used for “rqcfilter2” in this workflow:
|
Description |
---|---|
|
Disable improper barcodes filter |
|
Remove illumina reads failing chastity filter |
|
Run clumpify; all deduplication flags require this |
|
Extend reads during merging to allow insert size estimation of non-overlapping reads |
|
Enable C code for higher speed and identical results |
|
Do alignments in C code, which is faster, if an edit distance is allowed. This will require compiling the C code |
|
Generate a kmer-frequency histogram of the output data |
|
Reads with average quality (before trimming) below this will be discarded |
|
Reads with more Ns than this will be discarded |
|
Reads shorter than this after trimming will be discarded. Pairs will be discarded only if both are shorter |
|
Reads shorter than this fraction of original length after trimming will be discarded |
|
Spike-in bbduk removal mtst parameter |
|
Remove reads containing phiX kmers |
|
Use pigz for compression |
|
Quality-trim from right ends before mapping |
|
Remove cat reads via mapping |
|
Remove dog reads via mapping |
|
Remove human reads via mapping |
|
Remove common contaminant microbial reads via mapping, and place them in a separate file |
|
Remove mouse reads via mapping |
|
Remove ribosomal reads via kmer-matching, and place them in a separate file |
|
Parameter for RNA-seq analysis (this is main difference between ReadsQC and MetaT ReadsQC) |
|
Run SendSketch on 2M read pairs |
|
Trim all known Illumina adapter sequences, including TruSeq and Nextera |
|
Trim quality threshold |
|
Trim reads that start or end with a G polymer at least this long |
|
Use pigz for decompression |
Workflow Availability
The workflow from GitHub uses all the listed docker images to run all third-party tools. The workflow is available in GitHub: https://github.com/microbiomedata/metaT_ReadsQC; the corresponding Docker images are available in DockerHub:
Requirements for Execution
(recommendations are in italics)
WDL-capable Workflow Execution Tool (Cromwell)
Container Runtime that can load Docker images (Docker v2.1.0.3 or higher)
Hardware Requirements
Disk space: 106 GB for the RQCFilterData database
Memory: >40 GB RAM
Workflow Dependencies
Third party software (This is included in the Docker image.)
BBTools v38.96 (License: BSD-3-Clause-LBNL)
Requisite database
The RQCFilterData Database must be downloaded and installed. This is a 106 GB tar file which includes reference datasets of artifacts, adapters, contaminants, the phiX genome, and some host genomes.
The following commands will download the database
Sample datasets
Processed Metatranscriptome of soil microbial communities from the East River watershed near Crested Butte, Colorado, United States - ER_RNA_119 (SRR11678315) with metadata available in the NMDC Data Portal.
Inputs
A JSON file containing the following information:
the path to the database directory
the path to the fastq file(s) ([R1, R2] if not interleaved)
input_interleaved (boolean)
output file prefix
(optional) parameters for memory
(optional) number of threads requested
An example input JSON file is shown below:
{
"metaTReadsQC.input_files": ["https://portal.nersc.gov/project/m3408//test_data/metaT/SRR11678315.fastq.gz"],
"metaTReadsQC.proj":"SRR11678315-int-0.1",
"metaTReadsQC.rqc_mem": 180,
"metaTReadsQC.rqc_thr": 64,
"metaTReadsQC.database": "/refdata/"
}
Output
In the workflow execution directories, there will be a folder called filtered
containing all the below listed output files. The bolded outputs below will be copied over to the primary output folder for the full workflow, these are what are shown through the NMDC-EDGE website. The rqcfilter2.sh
output is named raw.anqdpht.fastq.gz
. Using the dataset above as an example, the main output would be renamed SRR11678315-int-0.1.filtered.fastq.gz
. Other files include statistics on the quality of the data; what was trimmed, detected, and filtered in the data; a status log, and a shell script documenting the steps implemented so the workflow can be reproduced.
An example output JSON file (filterStats.json) is shown below:
{
"inputReads": 16809276,
"kfilteredBases": 4500,
"qfilteredReads": 3978,
"ktrimmedReads": 467761,
"outputBases": 1473400259,
"ktrimmedBases": 60463632,
"kfilteredReads": 15,
"qtrimmedBases": 2345,
"outputReads": 4974016,
"gcPolymerRatio": 112.898477,
"inputBases": 5076401352,
"qtrimmedReads": 292,
"qfilteredBases": 1185765
}
Below is an example of all the filtered
output directory files from rqcfilter2.sh
with descriptions to the right. The italicized files are selected for output through NMDC-EDGE.
Directory/File Name |
Description |
---|---|
raw.anqrpht.fastq.gz |
main output (clean data) |
rRNA.fastq.gz |
filtered ribosomal reads |
adaptersDetected.fa |
adapters detected and removed |
bhist.txt |
base composition histogram by position |
cardinality.txt |
estimation of the number of unique kmers |
commonMicrobes.txt |
detected common microbes |
file-list.txt |
output file list for rqcfilter2.sh |
filterStats.txt |
summary statistics |
filterStats.json |
summary statistics in JSON format |
filterStats2.txt |
more detailed summary statistics |
gchist.txt |
GC content histogram |
human.fq.gz |
detected human sequence reads |
ihist_merge.txt |
insert size histogram |
khist.txt |
kmer-frequency histogram |
kmerStats1.txt |
synthetic molecule (phix, linker, lamda, pJET) filter run log |
kmerStats2.txt |
synthetic molecule (short contamination) filter run log |
ktrim_kmerStats1.txt |
detected adapters filter run log |
ktrim_scaffoldStats1.txt |
detected adapters filter statistics |
microbes.fq.gz |
detected common microbes sequence reads |
microbesUsed.txt |
common microbes list for detection |
peaks.txt |
number of unique kmers in each peak on the histogram |
phist.txt |
polymer length histogram |
refStats.txt |
human reads filter statistics |
reproduce.sh |
the shell script to reproduce the run |
scaffoldStats1.txt |
detected synthetic molecule (phix, linker, lamda, pJET) statistics |
scaffoldStats2.txt |
detected synthetic molecule (short contamination) statistics |
scaffoldStatsSpikein.txt |
detected spike-in kapa tag statistics |
sketch.txt |
mash type sketch scanned result against nt, refseq, silva database sketches |
spikein.fq.gz |
detected spike-in kapa tag sequence reads |
status.log |
rqcfilter2.sh running log |
synth1.fq.gz |
detected synthetic molecule (phix, linker, lamda, pJET) sequence reads |
synth2.fq.gz |
detected synthetic molecule (short contamination) sequence reads |
Version History
0.0.7 (release date 08/23/2024; previous versions: 0.0.6)
Point of contact
Original author: Brian Bushnell <bbushnell@lbl.gov>
Package maintainers: Chienchi Lo <chienchi@lanl.gov>