SquiggleKit demultiplexing fast5

Hi From #14 It seems that MotifSeq could demultiplex a bundle of fast5 to retrieve the fast5 for each strains. Is this right? Or there is some options in Squiggle Kit to do the same? Thanks

Mar 09 '21 22:03 aspitaleri

It depends how you are demultiplexing, and what you mean by strains?

More information is required for me to answer your question.

Mar 11 '21 22:03 Psy-Fer

Hi basically I have a bundle of fast5 files from a MinIon run which includes sequencing from different bacterial strains (i.e. samples). Normally, I do basecall and then demultiplex using guppy on the fastq. Now, I'd like to perform directly on the fast5 the demutliplex, so divide them per barcode without passing through the basecall. Hope it is clear.

Mar 11 '21 23:03 aspitaleri

oooh right. fast5_fetcher_multi paired with ont_fast5_api would be the tool for that.

for each barcodeXX.fastq file do something like this

mkdir dmux_barcode01_single

# extract the individual fast5 files
python3 fast5_fetcher_multi -q barcode01.fastq -s sequencing_sumary.txt -m /path/to/fast5s/ -o ./dmux_barcode01_single/

# package the individual files up again (I really should just do this in fast5_fetcher...one day)
single_to_multi_fast5 -i dmux_barcode01_single/ -o dmux_barcode01_multi --filename_base barcode01

# remove intermediate fast5 files
rm -r dmux_barcode01_single/

This will used the readIDs in the demultiplexed fastq file, match them with the fast5 filenames in the sequencing summary, and find them in the path given with -m and saved to directory -o. The output directory should be made before running fast5_fetcher_muilti

Then the ont_fast5_api has a script called single_to_multi_fast5 which will pack the fast5s extracted into multi files again.

Note, that if you are on a system with hard file number limits, like a HPC, check how many reads are in each barcodeXX.fastq file, as each read will make 1 fast5 file. So you could hit limits. If that is an issue, you can split the file up and run in parts. Or extract the readIDs manually and use ont_fast5_api only to extract the reads.

I hope this helps.

Mar 11 '21 23:03 Psy-Fer

Right. So there is not possibility to avoid to go through basecalling/demultiplexing first, without using fastq files. Actually MinIon makes a sequencing_summary.txt during the run when generating fast5 only. Could I use that file to call each reads per barcode.

Mar 11 '21 23:03 aspitaleri

Ahh, well the only DNA signal level barcode out there I know of is Deepbinner. But it's depricated now if I remember correctly.

Motifseq isn't sensitive enough to do it as well as base level demultiplexing.

So no, there isn't really an easy way to avoid basecalling.

Mar 11 '21 23:03 Psy-Fer

So the approach described here https://psy-fer.github.io/SquiggleKitDocs/MotifSeq/#background in the Nanopore adapter identification is not useful for this.

Mar 12 '21 00:03 aspitaleri

It would work, yes, but not as effectively as a base level derived demultiplexer. Only a system using some form of machine learning/learning like used in Deepbinner or what we have done with deeplexicon, would get similar or better results.

Is there a particular reason to do this? Perhaps there is another solution.

Mar 12 '21 00:03 Psy-Fer

Well, my purpose is to bypass the basecalling in order to reduce one source of error and then use uncalled pipeline (https://github.com/skovaka/UNCALLED) to map fast5 on genome reference, i.s. amplicon analysis. That's why I need to demultiplex a MinIon run in the different barcodes before to map it.

Mar 12 '21 00:03 aspitaleri

Uncalled uses the Readuntil api, are you planning to do the demultiplexing in real time? Or are you looking to run uncalled after a run?

The accuracy of uncalled is not as good as basecalling and aligning, as the base sequence it uses is only an approximation.

Mar 12 '21 00:03 Psy-Fer

The idea is to run it after run on amplicons so on huge depth (>4000), and then compare with standard procedure to check whether the approach is feasible of course. Thanks for your comments

Mar 12 '21 00:03 aspitaleri

If you want to benchmark to see how well it does, you can use the regular demultiplexing data to split the uncalled data output and assess that way. Then if it is better, look into the demultiplexing with signal.

There is a possibility for me to extend deeplexicon algorithms to DNA, rather than just RNA.

Mar 12 '21 01:03 Psy-Fer

Let me see if I understood well. Basecall/demultiplex the fast5 using i.e. guppy. Then as you suggested https://github.com/Psy-Fer/SquiggleKit/issues/46#issuecomment-797118955 get the fast5 per barcode using the sequencing_summary and then use the uncalled pipeline to get the fasta. Finally compare the results. Right?

Mar 12 '21 08:03 aspitaleri

Sounds about right yea. Plus the fiddly bits in between. Good luck!

Mar 12 '21 09:03 Psy-Fer

Yep! I will update you how it does. In case it is better, we need then to think how to avoid the step of basecalling ... but this in another story. Thanks a lot for you help and comments

Mar 12 '21 09:03 aspitaleri

You are welcome.

If it is the case, I'll build a demultiplexer

Mar 12 '21 09:03 Psy-Fer

Uaooo - that sounds great really. Keep in touch then!

Mar 12 '21 09:03 aspitaleri

I see indeed that you have similar but for RNA https://github.com/Psy-Fer/deeplexicon. Good to know

Mar 12 '21 09:03 aspitaleri

Yes.

I'm going to extend that to DNA. Planning to have something in a few months.

Mar 30 '21 06:03 Psy-Fer

That's great! I will wait for your tool. If you need to debug before to release it - I will be happy to do it.

Mar 30 '21 07:03 aspitaleri