mikeio icon indicating copy to clipboard operation
mikeio copied to clipboard

Memory leak / high usage when reading dfsu many times

Open apwebber opened this issue 1 year ago • 4 comments

Describe the bug I have a series of large dfsu files. To keep memory requirements low, I'm reading them bit by bit using the time slicing functionality like so:

import mikeio

def get_data_for_each_timestep(file: str, item: str, time_batch_n: int = 1):
    
    file_info = mikeio.open(file)
    times = file_info.time
    time_batches = list(batched(times, time_batch_n))
    
    for time_batch in time_batches:
        ds = mikeio.read(file, items=[item], time=time_batch)

This is a very stripped down version of the code and nothing is done to the data. However, memory constantly grows and grows, easily reaching 10+ GB in just a few iterations of this code. The problem is worse when reading a lower number of time steps, so there seems to be an overhead involved with mikeio.read(). Calling gc.collect() does help, but doesn't completely solve the problem.

However, running the data fetch in a seperate process solves the problem. Here, the memory associated with that process is freed up when the process is killed:

import mikeio
import multiprocessing

def get_ds_multiprocess(file: str, item: str, time_batch: list[Any], return_dict: dict):
    ds = mikeio.read(file, items=[item], time=time_batch)
    da = ds[item]
    
    return_dict['time'] = ds.time.copy() # not sure if copy() is necessary
    return_dict['data'] = da.values.copy() # not sure if copy() is necessary

def get_data_for_each_timestep(file: str, item: str, time_batch_n: int = 1):
    
    file_info = mikeio.open(file)
    times = file_info.time
    time_batches = list(batched(times, time_batch_n))
    
    for time_batch in time_batches:

        manager = multiprocessing.Manager()
        return_dict = manager.dict()
    
        p = multiprocessing.Process(target=get_ds_multiprocess, args=(file, item, time_batch, return_dict))
        p.start()
        p.join()
        
        t = return_dict['time']
        d = return_dict['data'])

Using this example, memory usage does not grow.

To Reproduce See above

Expected behavior To be able to iteratively read the data from very large .dfsus without running out of memory

Screenshots If applicable, add screenshots to help explain your problem.

System information:

  • Python 3.12
  • MIKE IO version 2.5.0

apwebber avatar Apr 22 '25 17:04 apwebber

Thanks for making us aware of this. MIKE IO is a high-level abstraction on top of the lower level mikecore library. MIKE IO has been optimized for ease of use, but not for performance in all cases. For optimal performance you might consider using the mikecore library directly.

ecomodeller avatar Apr 22 '25 17:04 ecomodeller

Take a look a this example https://github.com/DHI/MIKECore-Examples/blob/7baed71d032c9c5dd2e2eb69abee6a91521a250b/Examples/CSharp/ExamplesDfsu.cs#L282

This example uses C# but can be used with Python in a similar way.

ecomodeller avatar Apr 23 '25 07:04 ecomodeller

Thanks, I will take a look at mikecore. I would add though that I'm not just looking for a performant solution, it's about being able to do anything at all with these files.

apwebber avatar Apr 23 '25 08:04 apwebber

There is indeed overhead involved in calling mikeio.read, you can make a tiny change to avoid reading the header (including the geometry) of the file on every iteration.

import mikeio

def get_data_for_each_timestep(file: str, item: str, time_batch_n: int = 1):
    
    file_info = mikeio.open(file)
    times = file_info.time
    time_batches = list(batched(times, time_batch_n))
    
    for time_batch in time_batches:
        ds = file_info.read(items=[item], time=time_batch) # this line is changed

ecomodeller avatar Oct 28 '25 12:10 ecomodeller