Sampling sequences by number is not truly random
Prerequisites
- [x] make sure you're are using the latest version by
seqkit version - [x] read the usage
Describe your issue
- [x] describe the problem
When using
seqkit sample -n, the first number of sequences in the file will a higher chance of being picked than later sequences. - [x] provide a reproducible example Porting your logic: https://github.com/shenwei356/seqkit/blob/22f71ff449d597f4958553ad10c4fcd4cb0e22f6/seqkit/cmd/sample.go#L155-L163 to Python:
import random
import matplotlib.pyplot as plt
# Toy data to sample from
sample = range(1, 101)
# Wanted number of elements (-n)
number = 10
# Calculate proportion based of the wanted number and total number
proportion = number / len(sample)
# Init dictionary to count the number of times each element was selected
element_count = {element: 0 for element in range(1, 101)}
# Iterate a million times
for _ in range(1_000_000):
# Counter for the number of selected elements in each iteration
added_element_counter = 0
for element in sample:
if random.random() <= proportion:
element_count[element] += 1
added_element_counter += 1
if number == added_element_counter:
break
plt.bar(element_count.keys(), height=element_count.values())
plt.show()
A solution could be to shuffle the records before iterating over them:
import random
import matplotlib.pyplot as plt
# Wanted number of elements (-n)
number = 10
# Calculate proportion based of the wanted number and total number
proportion = number / len(sample)
# Init dictionary to count the number of times each element was selected
element_count = {element: 0 for element in range(1, 101)}
for _ in range(1_000_000):
# Toy data to sample from
sample = list(range(1, 101))
# Shuffle the list each iteration
random.shuffle(sample)
# Counter for the number of selected elements in each iteration
added_element_counter = 0
for element in sample:
if random.random() <= proportion:
element_count[element] += 1
added_element_counter += 1
if number == added_element_counter:
break
plt.bar(element_count.keys(), height=element_count.values())
plt.xlabel('Element')
plt.ylabel('Number of times element was picked.')
plt.tight_layout()
plt.show()
Thre are no truly random numbers.
Seeds also affect the chosen positions: https://bioinf.shenwei.me/seqkit/note/#location-distribution-of-sampled-records
Agreed that there are no truly random numbers and that the seed affects the selected sequences. But aren't you actively selecting against the later sequences in your implementation?
Splitting sequences into N bins of equal size and randomly choosing one record in each bin? This could be the best way.
But most of the time, we (I) don't care if they are uniformly located in the sequences file, or if each record has an equal chance been chosen. What I need is just N sequences down-sampled from the original file.
If the sequence file is not too big, shuffling before down-sampling could be a more rigorous way.