duplex-tools icon indicating copy to clipboard operation
duplex-tools copied to clipboard

Positional arguments (especially seqkit_stats_nosecondary) in duplex_tools assess_split_on_adapter

Open rocpv1977 opened this issue 2 years ago • 1 comments

Hi!

I am trying to asses how well duplex_tools split_on_adapter is doing its job and duplex_tools assess_split_on_adapter asks for the following positional arguments: seqkit_stats_nosecondary edited_reads unedited_reads split_multiple_times

I imagine the last three are the .pkl files that are created in the folder for split files, but I am not sure what "seqkit_stats_nosecondary". I have tried to introduce the output of

seqkit stats path/to/file --all

and

seqkit stats path/to/file --all

but I get this error:

/media/seq-ur/65225E7076CF2AF3/basecalling_bacterias/K_oxytoca/K_oxytoca_29_03_2023/pass/split/seqkit_stats contains 1 reads Traceback (most recent call last): File "/home/seq-ur/venv/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3652, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'read'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/seq-ur/venv/bin/duplex_tools", line 33, in sys.exit(load_entry_point('duplex-tools==0.3.2', 'console_scripts', 'duplex_tools')()) File "/home/seq-ur/venv/lib/python3.9/site-packages/duplex_tools/init.py", line 39, in main args.func(args) File "/home/seq-ur/venv/lib/python3.9/site-packages/duplex_tools/assess_split_on_adapter.py", line 129, in main assess( File "/home/seq-ur/venv/lib/python3.9/site-packages/duplex_tools/assess_split_on_adapter.py", line 32, in assess txt = txt[txt['read'].isin(expected_read_ids)] File "/home/seq-ur/venv/lib/python3.9/site-packages/pandas/core/frame.py", line 3760, in getitem indexer = self.columns.get_loc(key) File "/home/seq-ur/venv/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3654, in get_loc raise KeyError(key) from err KeyError: 'read'

Could you help me understand what "seqkit_stats_nosecondary" is?

Thanks!

rocpv1977 avatar Apr 03 '23 17:04 rocpv1977

Hi @rocpv1977!

Thanks for the question. You're definitely on the right track. You are expected to give it the output from seqkit bam on a bam file that does not have secondary alignments. If your alignment has been done in a way that includes secondary alignments, you would be expected to filter out secondary reads, for example with samtools view:

samtools view -F 256 input.bam > nosecondary.bam seqkit bam nosecondary.bam 2> nosecondary.txt

Excuse the confusing naming and the lack of documentation regarding this. It's worth tidying up.

Best regards

ollenordesjo avatar Apr 04 '23 08:04 ollenordesjo