demultiplexing fast5
Hi From #14 It seems that MotifSeq could demultiplex a bundle of fast5 to retrieve the fast5 for each strains. Is this right? Or there is some options in Squiggle Kit to do the same? Thanks
It depends how you are demultiplexing, and what you mean by strains?
More information is required for me to answer your question.
Hi basically I have a bundle of fast5 files from a MinIon run which includes sequencing from different bacterial strains (i.e. samples). Normally, I do basecall and then demultiplex using guppy on the fastq. Now, I'd like to perform directly on the fast5 the demutliplex, so divide them per barcode without passing through the basecall. Hope it is clear.
oooh right. fast5_fetcher_multi paired with ont_fast5_api would be the tool for that.
for each barcodeXX.fastq file do something like this
mkdir dmux_barcode01_single
# extract the individual fast5 files
python3 fast5_fetcher_multi -q barcode01.fastq -s sequencing_sumary.txt -m /path/to/fast5s/ -o ./dmux_barcode01_single/
# package the individual files up again (I really should just do this in fast5_fetcher...one day)
single_to_multi_fast5 -i dmux_barcode01_single/ -o dmux_barcode01_multi --filename_base barcode01
# remove intermediate fast5 files
rm -r dmux_barcode01_single/
This will used the readIDs in the demultiplexed fastq file, match them with the fast5 filenames in the sequencing summary, and find them in the path given with -m and saved to directory -o. The output directory should be made before running fast5_fetcher_muilti
Then the ont_fast5_api has a script called single_to_multi_fast5 which will pack the fast5s extracted into multi files again.
Note, that if you are on a system with hard file number limits, like a HPC, check how many reads are in each barcodeXX.fastq file, as each read will make 1 fast5 file. So you could hit limits. If that is an issue, you can split the file up and run in parts. Or extract the readIDs manually and use ont_fast5_api only to extract the reads.
I hope this helps.
Right. So there is not possibility to avoid to go through basecalling/demultiplexing first, without using fastq files. Actually MinIon makes a sequencing_summary.txt during the run when generating fast5 only. Could I use that file to call each reads per barcode.
Ahh, well the only DNA signal level barcode out there I know of is Deepbinner. But it's depricated now if I remember correctly.
Motifseq isn't sensitive enough to do it as well as base level demultiplexing.
So no, there isn't really an easy way to avoid basecalling.
So the approach described here https://psy-fer.github.io/SquiggleKitDocs/MotifSeq/#background in the Nanopore adapter identification is not useful for this.
It would work, yes, but not as effectively as a base level derived demultiplexer. Only a system using some form of machine learning/learning like used in Deepbinner or what we have done with deeplexicon, would get similar or better results.
Is there a particular reason to do this? Perhaps there is another solution.
Well, my purpose is to bypass the basecalling in order to reduce one source of error and then use uncalled pipeline (https://github.com/skovaka/UNCALLED) to map fast5 on genome reference, i.s. amplicon analysis. That's why I need to demultiplex a MinIon run in the different barcodes before to map it.
Uncalled uses the Readuntil api, are you planning to do the demultiplexing in real time? Or are you looking to run uncalled after a run?
The accuracy of uncalled is not as good as basecalling and aligning, as the base sequence it uses is only an approximation.
The idea is to run it after run on amplicons so on huge depth (>4000), and then compare with standard procedure to check whether the approach is feasible of course. Thanks for your comments
If you want to benchmark to see how well it does, you can use the regular demultiplexing data to split the uncalled data output and assess that way. Then if it is better, look into the demultiplexing with signal.
There is a possibility for me to extend deeplexicon algorithms to DNA, rather than just RNA.
Let me see if I understood well. Basecall/demultiplex the fast5 using i.e. guppy. Then as you suggested https://github.com/Psy-Fer/SquiggleKit/issues/46#issuecomment-797118955 get the fast5 per barcode using the sequencing_summary and then use the uncalled pipeline to get the fasta. Finally compare the results. Right?
Sounds about right yea. Plus the fiddly bits in between. Good luck!
Yep! I will update you how it does. In case it is better, we need then to think how to avoid the step of basecalling ... but this in another story. Thanks a lot for you help and comments
You are welcome.
If it is the case, I'll build a demultiplexer
Uaooo - that sounds great really. Keep in touch then!
I see indeed that you have similar but for RNA https://github.com/Psy-Fer/deeplexicon. Good to know
Yes.
I'm going to extend that to DNA. Planning to have something in a few months.
That's great! I will wait for your tool. If you need to debug before to release it - I will be happy to do it.