smudgeplot icon indicating copy to clipboard operation
smudgeplot copied to clipboard

Non deterministic behaviour of middle_one_away function

Open RNieuwenhuis opened this issue 5 years ago • 8 comments

What did you do

I ran smudgeplot.py hetkmers -o Test_middle --middle K_mer_kmc_db_k_21_L15_U220.dump 4 times in a row on this 2.5gb dump file.

wc -l Test_middle_?_coverages.tsv

gives

  659425 Test_middle_1_coverages.tsv
  659628 Test_middle_2_coverages.tsv
  659258 Test_middle_3_coverages.tsv
  659483 Test_middle_4_coverages.tsv

So largely the same quantity of output and, most importantly, it will probably have 0 effect on the outcome of the smudgeplot.

I was not able to find out where this non-deterministic behavior is originated in the code and would love to know.

python --version
Python 3.6.10

RNieuwenhuis avatar Aug 23 '20 22:08 RNieuwenhuis

Thanks Roland, that is indeed worrying. I wonder what that might be.

The only part I can think of is the step of building a suffix array, I did not go that deep in that rabbit hole, but if I remember well the package I used for SA takes some heuristic shortcuts to be faster. I wonder how to test it. Also, now I get why I the program will be deterministic for a really small input file.

KamilSJaron avatar Aug 24 '20 09:08 KamilSJaron

I would doubt that - the suffix array must be in completely sorted order for it to support variable lengths queries correctly.

Mike

On Mon, Aug 24, 2020 at 5:49 AM Kamil S. Jaron [email protected] wrote:

Thanks Roland, that is indeed worrying. I wonder what that might be.

The only part I can think of is the step of building a suffix array, I did not go that deep in that rabbit hope, but if I remember well the package I used for SA takes some heuristic shortcuts to be faster. I wonder how to test it. Also, now I get why I the program will be deterministic for a really small input file.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/KamilSJaron/smudgeplot/issues/70#issuecomment-679027842, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP343VQCR3NOPRB2WG65TSCIZR5ANCNFSM4QI4UR5A .

mschatz avatar Aug 24 '20 12:08 mschatz

Hmm, interesting. That means I am left without any suspects.

I will add an option to generate more verbose logging to help to debug this one. Stay tuned for updates, probably late this week.

KamilSJaron avatar Aug 24 '20 13:08 KamilSJaron

Were you able to reproduce the issue? I had another look at the code but I don't see the cause of the behavior. A nice example of how a test dataset can influence testing outcomes.

RNieuwenhuis avatar Aug 24 '20 20:08 RNieuwenhuis

Had another look and it had to be the use of set() and the resizing of it, due to the iterative nature of the code, and hash randomization.

I replaced the sets by using dictionaries with values of None, since they are insertion ordered in python3.6+ and the function is now deterministic.

Rounding up the work on the "parallelization" and will then do a pull request that also implements this fix.

RNieuwenhuis avatar Aug 26 '20 21:08 RNieuwenhuis

I find the reason. After the for loop, the value of kmer_R and i1 are changed. for kmer_R in filtered: (i1, coverage1), (i2, coverage2) = kmer_R_to_index_family[kmer_R] Then,kmer_R_to_index_family[kmer_R].append((i1,coverage1)) will store wrong i1 with wrong kmer_R. After fix the bug, the results are fixed

sunzhig avatar Sep 24 '20 05:09 sunzhig

@sunzhig Though I think the non-deterministic behaviour was solved with the described solution in my earlier comment, you are definitely correct that the i1 and coverage1 are overwritten and then referenced again, resulting in faulty behaviour.

I hope you don't mind. but for sake of keeping it simple, I added the fix to the pull request that I had already submitted. Because that one has a lot of merge conflicts it is easier for @KamilSJaron I'd say.

RNieuwenhuis avatar Sep 29 '20 15:09 RNieuwenhuis

Excellent, thanks @RNieuwenhuis and @sunzhig I really appreciate your contributions. I also just returned from my vacation, once I will tame my backlog of things a bit more, I can finally work out this commit.

KamilSJaron avatar Sep 29 '20 16:09 KamilSJaron

one away will be retired

KamilSJaron avatar Aug 17 '23 14:08 KamilSJaron