GeneLab_Data_Processing BulkRNASeq workflow should determine adaptor type automatically

Currently workflow user is expected to replace this value manually in workflow module file. Instead, the adaptor should be automatically determine, perhaps from the raw fastQC reports/multiQC and supplied to the trimming processing.

DPPD Reference

https://github.com/nasa/GeneLab_Data_Processing/blob/0fe1dfd46ee662a333ac49e6013dbd82f86cb987/RNAseq/Pipeline_GL-DPPD-7101_Versions/GL-DPPD-7101-F.md?plain=1#L207

Workflow Reference

https://github.com/nasa/GeneLab_Data_Processing/blob/0fe1dfd46ee662a333ac49e6013dbd82f86cb987/RNAseq/Workflow_Documentation/NF_RCP-F/workflow_code/modules/quality.nf#L73-L76

Apr 07 '23 16:04 J-81

Potential route using within trim_galore adaptor auto-detection: https://github.com/FelixKrueger/TrimGalore/blob/0.6.7/Docs/Trim_Galore_User_Guide.md#adapter-auto-detection

Apr 07 '23 16:04 J-81

I'll try using auto-detect by omitting the flag, will of course validate if the auto detect is consistent with direct user supply of the parameter.

Apr 12 '23 21:04 J-81

Testing Results using GLDS-426_Truncated (Known to have Nextera adapters):

CURRENT (With --illumina)

Input filename: EU236_R2_raw.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 3.7
Number of cores used for trimming: 1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; user defined)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Output file will be GZIP compressed


This is cutadapt 3.7 with Python 3.9.6
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC EU236_R2_raw.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 0.09 s (301 µs/read; 0.20 M reads/minute).

=== Summary ===

Total reads processed:                     300
Reads with adapters:                       118 (39.3%)
Reads written (passing filters):           300 (100.0%)

Total basepairs processed:        45,000 bp
Quality-trimmed:                   2,140 bp (4.8%)
Total written (filtered):         42,715 bp (94.9%)

With --nextera instead of --illumina

Input filename: EU236_R2_raw.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 3.7
Number of cores used for trimming: 1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; user defined)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Output file will be GZIP compressed


This is cutadapt 3.7 with Python 3.9.6
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a CTGTCTCTTATA EU236_R2_raw.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 0.09 s (297 µs/read; 0.20 M reads/minute).

=== Summary ===

Total reads processed:                     300
Reads with adapters:                       233 (77.7%)
Reads written (passing filters):           300 (100.0%)

Total basepairs processed:        45,000 bp
Quality-trimmed:                   2,140 bp (4.8%)
Total written (filtered):         33,046 bp (73.4%)

With neither --nextera nor --illumina (i.e. autodetect mode)

Input filename: EU236_R2_raw.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 3.7
Number of cores used for trimming: 1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Using Nextera adapter for trimming (count: 113). Second best hit was smallRNA (count: 16)
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Output file will be GZIP compressed


This is cutadapt 3.7 with Python 3.9.6
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a CTGTCTCTTATA EU236_R2_raw.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 0.09 s (310 µs/read; 0.19 M reads/minute).

=== Summary ===

Total reads processed:                     300
Reads with adapters:                       233 (77.7%)
Reads written (passing filters):           300 (100.0%)

Total basepairs processed:        45,000 bp
Quality-trimmed:                   2,140 bp (4.8%)
Total written (filtered):         33,046 bp (73.4%)

May 08 '23 17:05 J-81

[x] Implemented in 3b7e0bab4017e90481359c48f9cf7c8837ed54d2
[x] DPPD Updated in 2a56552

May 25 '23 21:05 J-81