immunarch icon indicating copy to clipboard operation
immunarch copied to clipboard

top() lacks the "unique" functionality

Open decenwang opened this issue 5 years ago • 2 comments

Hi Dr. Nazarov

I found another bug, may not a bug? Could please help check? When I use the top() function, I set ".n" = 50, however, in some sample, I can get more rows than 50, in fact, in your data I also found this problem, please see the link https://immunarch.com/reference/top.html, sample is A2-i133. the top 11 was chosen. Anyway, that's not the key point. I found in my top 50 clonotypes, there are some repeats in the top 50, so true top number might be 47, 46 less than 50. Yes, it should be derived from the mixcr, now I do not have too much time to check the problem in mixcr, but I think it might because of the same CDR3 DNA sequences/amino acid but different overhangs at both ends, maybe they have different v start and so on, so the mixcr refers to these sequences as unique clonotypes. actually, I don't care about that, but I have to merge these repeats as one clonotype. and get the sum of proportion again. sometimes, in my top 10 clonotypes, I can get the same CDR3 in aa and not. Thus, this may skewed my statistics. Could you help check it again? If I "unique" these clonotypes and "sum" the proportion, the order will be rearranged. Expecting your ideas!

Best,

Decen

decenwang avatar Aug 07 '20 02:08 decenwang

Hi @decenwang

Thank you so much for this questions! This is a very important note to have in the documentation. top takes the top-N abundant clonotypes by their counts from Clones. But if you have several clonotypes with the exact same number of clones, what should you choose? In this case top chooses all clonotypes. Example:

Clones = 5,4,3,2,2,1

top(.n = 4) returns clonotypes with Clones 5,4,3,2,2 despite that .n is 4, because there are two clonotypes with the count 2.

Does it make sense? If so, would you still like us to add an argument to force top to cut those additional clonotypes to make sure the exact .n number of clonotypes is always returned?

vadimnazarov avatar Aug 11 '20 12:08 vadimnazarov

Hi @decenwang

Thank you so much for this questions! This is a very important note to have in the documentation. top takes the top-N abundant clonotypes by their counts from Clones. But if you have several clonotypes with the exact same number of clones, what should you choose? In this case top chooses all clonotypes. Example:

Clones = 5,4,3,2,2,1

top(.n = 4) returns clonotypes with Clones 5,4,3,2,2 despite that .n is 4, because there are two clonotypes with the count 2.

Hi Dr.Nazrrov,

Thanks for quick response! That's right! You don't have to coerce the cutoff, it is not fair to the sequences with the same nubmer. But I am considering, if like this: Clones CDR3.aa 100 CADRFGHEF 80 CDDAGTMF 50 CDDAGTMF

the second one and the third one have the same sequences in aa. Because they may have different overhang sequences, so they are assigned as different clones, while when we do the Bayes inferrence or use the Levenshtein distance, we have to consider them as the same one, it is unfair, but I have to this. If you would like to, please add a function like "unique" to merge them as one. Similar sequence might have similar binding site. Many thanks!

decenwang avatar Aug 12 '20 12:08 decenwang