Optimize MesaData.read_log_data() and MesaData.remove_backups()

Open d-maclean opened this issue 1 year ago • 2 comments

Hi Bill,

I've been using mesa_reader to handle very large grids of stars(~1000 individual runs adding up to tens of GiB) for my work, and I felt tempted to optimize the file loading/parsing procedure.

Using numpy.genfromtxt() (or numpy.loadtxt() for that matter) runs into pitfalls for large files, as it parses each line in python and concatenates the records into lists before forming the ndarray at the end. Having unknown data-types at runtime adds extra overhead (as in genfromtxt). I switched it to use pandas.read_csv(), which is substantially faster. I wrote a simple parser for the first data line to determine the data types for each column, so it should handle floats, ints, nans, and logicals just fine.

Similarly, implementing pandas.DataFrame.drop_duplicates() in the remove_backups method gives a modest speed increase.

As far as my testing has found, the output of this should be exactly the same, but I do not know how it would handle, say, an incomplete line (if you managed to open a log file mid-write).

Some simple profiling with a test grid (84 history files, ~600 MiB) shows a pretty good speed increase, especially if you have an SSD and are not limited by storage speed:

genfromtxt method (last commit)

         47863545 function calls (47838824 primitive calls) in 28.949 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   28.975   28.975 /home/duncan_m/Projects/sample_history/mesa_test.py:11(test)
       84    0.000    0.000   28.975    0.345 /home/duncan_m/.conda/envs/sci/lib/python3.11/site-packages/mesa_reader/__init__.py:103(__init__)
       84    0.001    0.000   28.975    0.345 /home/duncan_m/.conda/envs/sci/lib/python3.11/site-packages/mesa_reader/__init__.py:152(read_data)
  ---> 84    0.451    0.005   28.974    0.345 /home/duncan_m/.conda/envs/sci/lib/python3.11/site-packages/mesa_reader/__init__.py:187(read_log_data)
...
  ---> 84    0.198    0.002    1.004    0.012 /home/duncan_m/.conda/envs/sci/lib/python3.11/site-packages/mesa_reader/__init__.py:673(remove_backups)

read_csv method (this PR)

         3814731 function calls (3747761 primitive calls) in 6.821 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.821    6.821 /home/duncan_m/Projects/sample_history/mesa_test.py:11(test)
       84    0.000    0.000    6.821    0.081 /home/duncan_m/.conda/envs/sci-test/lib/python3.12/site-packages/mesa_reader/__init__.py:105(__init__)
       84    0.003    0.000    6.821    0.081 /home/duncan_m/.conda/envs/sci-test/lib/python3.12/site-packages/mesa_reader/__init__.py:182(read_data)
  ---> 84    0.007    0.000    6.817    0.081 /home/duncan_m/.conda/envs/sci-test/lib/python3.12/site-packages/mesa_reader/__init__.py:217(read_log_data)**
...
  ---> 84    0.121    0.001    0.187    0.002 /home/duncan_m/.conda/envs/sci-test/lib/python3.12/site-packages/mesa_reader/__init__.py:714(remove_backups)

For my particular test and system, the speedup is around 400%. :)

However, this method does require adding pandas as a dependency, which may not necessarily be desirable.

Jan 03 '25 17:01 d-maclean