python-rasterstats icon indicating copy to clipboard operation
python-rasterstats copied to clipboard

causes memory overflow in the loop

Open mouzui opened this issue 2 years ago • 7 comments

This problem has been bothering me for a long time. I started by reading the csv containing the outline of 10 million polygons, then divided it into 1000 parts, created new polygons one by one and then did the zonal_stats.

As you can see in line 68 of the picture, the memory increases by a dozen mb while doing the zonal_stats and is not cleared by the end of the loop (it shows 1694.4 mb when going to the next loop, as it does at the end of this loop).

My computer's memory can't cope with the increase in the number of 10,000 cycles and the process is killed (over 140g, that is). I tried to split the data into 100 or 10,000 copies, but the problem still occurs and the process is still killed.

What is the reason for this?

Snipaste_2023-03-02_19-21-40

mouzui avatar Mar 02 '23 11:03 mouzui

Same problem - memory leak. Currently investigating.

MrChebur avatar Jul 24 '24 06:07 MrChebur

I can confirm memory leaks in the for loops and the zonal_stats function.

OS=Windows-10-10.0.19045-SP0 python=3.12.4 (main, Jun 10 2024, 12:48:35) [MSC v.1938 64 bit (AMD64)] gdal=3.9.1 numpy=1.26.4

I attach code for testing and necessary raster, vector data (see test.zip).

I would be grateful to the author of the library for any comment regarding this problem! @perrygeo

изображение

import geopandas
import psutil  # this is not standard library. Check https://pypi.org/project/psutil/
from rasterstats import zonal_stats


def find_process_by_name(name):
    for pid in psutil.pids():
        process_ = psutil.Process(pid)
        if process_.name() == name:
            return process_


vector = r'.\shp\polygon.shp'
raster_path = r'.\raster\MOD10A1F.A2000058.h22v02.061.2020037194056.hdf'
process_name = 'python.exe'

process = find_process_by_name(process_name)
if process is None:
    raise f'{process_name} not found!'

columns = ['iteration'] + list(process.memory_info()._fields)
print()
print('\t'.join(columns))

geo_data_frame = geopandas.read_file(vector)
geo_data_frame_geom = geo_data_frame['geometry']

for iteration in range(1, 1001):

    # Prints memory information for each 100th iteration
    if iteration == 1 or iteration % 100 == 0:
        mem_info = process.memory_info()  # https://psutil.readthedocs.io/en/latest/#psutil.Process.memory_info
        mem_info_as_string = [str(value) for value in mem_info]
        mem_values = [str(iteration)] + mem_info_as_string
        print('\t'.join(mem_values))

    zonal_stats(vectors=geo_data_frame_geom,
                raster=fr"""HDF4_EOS:EOS_GRID:"{raster_path}":MOD_Grid_Snow_500m:CGF_NDSI_Snow_Cover""",
                categorical=True,
                all_touched=True,
                )

MrChebur avatar Jul 24 '24 10:07 MrChebur

I can confirm memory leaks in the for loops and the zonal_stats function.

OS=Windows-10-10.0.19045-SP0 python=3.12.4 (main, Jun 10 2024, 12:48:35) [MSC v.1938 64 bit (AMD64)] gdal=3.9.1 numpy=1.26.4

I attach code for testing and necessary raster, vector data (see test.zip).

I would be grateful to the author of the library for any comment regarding this problem! @perrygeo

изображение

import geopandas
import psutil  # this is not standard library. Check https://pypi.org/project/psutil/
from rasterstats import zonal_stats


def find_process_by_name(name):
    for pid in psutil.pids():
        process_ = psutil.Process(pid)
        if process_.name() == name:
            return process_


vector = r'.\shp\polygon.shp'
raster_path = r'.\raster\MOD10A1F.A2000058.h22v02.061.2020037194056.hdf'
process_name = 'python.exe'

process = find_process_by_name(process_name)
if process is None:
    raise f'{process_name} not found!'

columns = ['iteration'] + list(process.memory_info()._fields)
print()
print('\t'.join(columns))

geo_data_frame = geopandas.read_file(vector)
geo_data_frame_geom = geo_data_frame['geometry']

for iteration in range(1, 1001):

    # Prints memory information for each 100th iteration
    if iteration == 1 or iteration % 100 == 0:
        mem_info = process.memory_info()  # https://psutil.readthedocs.io/en/latest/#psutil.Process.memory_info
        mem_info_as_string = [str(value) for value in mem_info]
        mem_values = [str(iteration)] + mem_info_as_string
        print('\t'.join(mem_values))

    zonal_stats(vectors=geo_data_frame_geom,
                raster=fr"""HDF4_EOS:EOS_GRID:"{raster_path}":MOD_Grid_Snow_500m:CGF_NDSI_Snow_Cover""",
                categorical=True,
                all_touched=True,
                )

Same problem - memory leak. Currently investigating.

The author hasn't updated this package for a long time, which is very unfortunate. I am currently using QGIS as a replacement for this package.

mouzui avatar Jul 24 '24 12:07 mouzui

@mouzui

I am currently using QGIS as a replacement for this package.

And I on the contrary - used this package to replace QGIS, thinking that package is faster and easier to use. =)

MrChebur avatar Jul 24 '24 15:07 MrChebur

@mouzui

I am currently using QGIS as a replacement for this package.

And I on the contrary - used this package to replace QGIS, thinking that package is faster and easier to use. =)

My results show that the calculation speed of zoning statistics in pyQGIS is more than 30 times faster than rasterstats. After all the effort I put into installing pyQGIS, this is the most gratifying and surprising aspect. Of course, the QGIS application itself is not as fast, but its speed is still impressive. The best part is that QGIS does not have memory leaks, which allows me to confidently do other things while running zoning statistics.

mouzui avatar Jul 24 '24 15:07 mouzui

The best part is that QGIS does not have memory leaks

Unfortunately not, at least in QGIS Appication (Windows 10): https://github.com/qgis/QGIS/issues/37861

MrChebur avatar Jul 24 '24 15:07 MrChebur

The best part is that QGIS does not have memory leaks

Unfortunately not, at least in QGIS Appication (Windows 10): qgis/QGIS#37861

Alright. My process involves using ten million polygons for zonal statistics on a raster file. It's possible that too many raster files could cause a memory leak in QGIS.

mouzui avatar Jul 24 '24 15:07 mouzui