topaz icon indicating copy to clipboard operation
topaz copied to clipboard

feat: add argument seed to topaz normalize

Open swan343 opened this issue 8 months ago • 7 comments

Hi!

I noticed here mentions the seed argument, which was present in v0.1.0, is no longer present in the latest version. This PR adds the seed arg back. With a specified non zero seed, I can now get the exact same output after multiple preprocess/normalize runs. Without it the outputs are not equivalent.

swan343 avatar Jun 26 '25 19:06 swan343

Hi @swan343, it looks like that was included in a big update @tbepler made in https://github.com/tbepler/topaz/commit/afa606fce1a5fbf6fa03b20b18d99564d219e18a, so let's see if that was intentional or not.

DarnellGranberry avatar Jun 27 '25 15:06 DarnellGranberry

I removed the seed argument because of multiprocessing. If the random number generator is initialized within each process, you will not get the same result when processing a list of micrographs if you change the number of processes being used or run everything in the main thread. Given this potential lack of consistency I felt it best to simply not expose this option.

tbepler avatar Jun 30 '25 02:06 tbepler

hi @tbepler thanks for the response! From my limited testing I do see the results are repeatable with multiprocessing, if the seed argument is given (np.random.RandomState should always return the same pseudo number generator no matter when and where in theory). heres a sample test code. I do notice for at least some mrcs using gpu will yield different results vs cpu, not sure if this variation also applies to different computers (perhaps that's what you meant?); but using the same computer/device the results are the same.

import hashlib
import glob
import shutil
import os

def get_md5(file_path: str) -> str:
    with open(file_path, 'rb') as f:
        md5 = hashlib.md5()
        while True:
            data = f.read(8192)
            if not data:
                break
            md5.update(data)
        return md5.hexdigest()


result = {}
os.system(' '.join([
  'topaz', 'preprocess', '--device=-1', '--seed=5424','--num-workers=12',
  '--format="png"',
  '-s', '4', '-o=/tmp/lu3435571etv8n.tmp/res', '/tmp/lu3435571etv8n.tmp/test/*.mrc'
]))
for i in sorted(glob.glob('/tmp/lu3435571etv8n.tmp/res/*.png')):
    result[i] = get_md5(i)

for i in range(10):
  print('testing iteration', i)
  shutil.rmtree('/tmp/lu3435571etv8n.tmp/res', ignore_errors=True)
  os.mkdir('/tmp/lu3435571etv8n.tmp/res')
  os.system(' '.join([
    'topaz', 'preprocess', '--device=-1', '--seed=5424',f'--num-workers={i}',
    '--format="png"',
    '-s', '4', '-o=/tmp/lu3435571etv8n.tmp/res', '/tmp/lu3435571etv8n.tmp/test/*.mrc'
]))
  for j in sorted(glob.glob('/tmp/lu3435571etv8n.tmp/res/*.png')):
    if get_md5(j) != result[j]:
      print('md5 doesnt match! @ iter',i, j, get_md5(j), result[j])
      raise Exception('md5 doesnt match!')
  print('iteration', i, 'passed')

swan343 avatar Jun 30 '25 16:06 swan343

@swan343 my main concern is actually about consistency between runs with different numbers of workers or no workers. With the same number of workers, the results should be the same, because each process's random number generator is independently seeded with the seed. However, if you change the number of workers the results will change, because, e.g. starting 2 workers and running them in parallel from the same state is not the same as running 1 worker through all the images from that same starting state.

This is a pretty minor replicability issue, but would cause inconsistency between runs even when the seed is set which would be unexpected.

tbepler avatar Jul 01 '25 04:07 tbepler

@tbepler thanks for the reply, by workers are you referring to --num-workers or something else? I can generate the exact same output with --num-workers range from 0 to 9 from the modified script above, cpu or gpu (which overrides it to 1 anyways i believe)

swan343 avatar Jul 01 '25 15:07 swan343

Ah, I see. You want to reset the random number generator to the same seed for every image. This means every image will start from the same random initialization. Is that what you want?

Do you really need to set the random seed for this function? How much variability are you seeing in the results? I would expect there to be a little bit, but not major qualitative differences. If there are, setting the seed will lock that one solution but I'm not sure it addresses the core problem then, which is that the solution is unstable.

I think this is fine, though. I put a few comments on your PR. We can merge it after those are fixed.

tbepler avatar Jul 03 '25 16:07 tbepler

yes that's exactly the intent. The variability isn't a lot per se - I do particle masking and can see about 2%-5% of difference depending on the detection technique and validation criteria. With a fixed seed the variation is eliminated completely. Appreciate the feedback!

swan343 avatar Jul 03 '25 17:07 swan343