seqkit icon indicating copy to clipboard operation
seqkit copied to clipboard

Sampling sequences by number is not truly random

Open alvanuffelen opened this issue 2 years ago • 3 comments

Prerequisites

  • [x] make sure you're are using the latest version by seqkit version
  • [x] read the usage

Describe your issue

  • [x] describe the problem When using seqkit sample -n, the first number of sequences in the file will a higher chance of being picked than later sequences.
  • [x] provide a reproducible example Porting your logic: https://github.com/shenwei356/seqkit/blob/22f71ff449d597f4958553ad10c4fcd4cb0e22f6/seqkit/cmd/sample.go#L155-L163 to Python:
import random
import matplotlib.pyplot as plt

# Toy data to sample from
sample = range(1, 101)

# Wanted number of elements (-n)
number = 10
# Calculate proportion based of the wanted number and total number
proportion = number / len(sample)
# Init dictionary to count the number of times each element was selected
element_count = {element: 0 for element in range(1, 101)}

# Iterate a million times
for _ in range(1_000_000):
    # Counter for the number of selected elements in each iteration
    added_element_counter = 0
    for element in sample:
        if random.random() <= proportion:
            element_count[element] += 1
            added_element_counter += 1
        if number == added_element_counter:
            break

plt.bar(element_count.keys(), height=element_count.values())
plt.show()

image

A solution could be to shuffle the records before iterating over them:

import random
import matplotlib.pyplot as plt

# Wanted number of elements (-n)
number = 10
# Calculate proportion based of the wanted number and total number
proportion = number / len(sample)
# Init dictionary to count the number of times each element was selected
element_count = {element: 0 for element in range(1, 101)}
for _ in range(1_000_000):
    # Toy data to sample from
    sample = list(range(1, 101))
    # Shuffle the list each iteration
    random.shuffle(sample)
    # Counter for the number of selected elements in each iteration
    added_element_counter = 0
    for element in sample:
        if random.random() <= proportion:
            element_count[element] += 1
            added_element_counter += 1
        if number == added_element_counter:
            break

plt.bar(element_count.keys(), height=element_count.values())
plt.xlabel('Element')
plt.ylabel('Number of times element was picked.')
plt.tight_layout()
plt.show()

image

alvanuffelen avatar May 16 '23 09:05 alvanuffelen

Thre are no truly random numbers.

Seeds also affect the chosen positions: https://bioinf.shenwei.me/seqkit/note/#location-distribution-of-sampled-records

shenwei356 avatar May 18 '23 16:05 shenwei356

Agreed that there are no truly random numbers and that the seed affects the selected sequences. But aren't you actively selecting against the later sequences in your implementation?

alvanuffelen avatar May 18 '23 21:05 alvanuffelen

Splitting sequences into N bins of equal size and randomly choosing one record in each bin? This could be the best way.

But most of the time, we (I) don't care if they are uniformly located in the sequences file, or if each record has an equal chance been chosen. What I need is just N sequences down-sampled from the original file.

If the sequence file is not too big, shuffling before down-sampling could be a more rigorous way.

shenwei356 avatar May 19 '23 15:05 shenwei356