PySUS icon indicating copy to clipboard operation
PySUS copied to clipboard

Excessive memory consumption during SIA-PA download

Open heber-augusto opened this issue 4 years ago • 10 comments

There are some SIA-PA files wich are huge and it seems that converting from dbc to parquet is consuming too much memory. Although the following example is somehow related to issue 27 (which occurs when a given month has a lot of data), to simulate the error, I collected the file manually through the sus ftp and called only the function that is failing.

from pysus.utilities.readdbc import dbc2dbf, read_dbc infile = 'PASP2003a.dbc' outfile = 'PASP2003a.dbf'

dbc2dbf(infile,outfile) #exception is raised by the line bellow df = read_dbc(infile)

I was using google colab and tried to hire the pro version for a month to increase the memory limit, but even so, it wasn't enough.

I don't know how to solve exactly with the dbf file. With csvs and pandas files, I usually tried to read the file in chunks to avoid excessive memory consumption.

heber-augusto avatar Jan 05 '22 16:01 heber-augusto

Message that will be displayed on users first issue

github-actions[bot] avatar Jan 05 '22 16:01 github-actions[bot]

Thanks for reporting this bug @heber-augusto. We apply a chunked solution to COVID vaccination data.

The solution shouldn't be hard to implement,

fccoelho avatar Jan 09 '22 20:01 fccoelho

Datasus now decided to split large files into <filename>a.dbc, <filename>b.dbc, etc. The download function needs to be adapted to look for this variant names.

fccoelho avatar Feb 01 '22 12:02 fccoelho

Hello @fccoelho , I wrote a possible solution to read split files inside issue 27 but I think the excessive memory consumption is an issue that exists despite this. I can help with the download function (issue 27).

I tried contact with Datasus team to check if they have a solution to avoid excessive memory consuption but did not have an answer yet. :(

heber-augusto avatar Feb 01 '22 23:02 heber-augusto

Hi @heber-augusto, I have just opened a PR to fix this situation.

It also allows PySUS to automatically handle downloading of the a, b, c, … parts of large files when necessary.

As soon as it passes the review and gets merged, I'll make a new release of PySUS. There is also new documentation showing how to handle these large SIA files.

fccoelho avatar Feb 02 '22 08:02 fccoelho

Hello @fccoelho , I wrote a possible solution to read split files inside issue 27 but I think the excessive memory consumption is an issue that exists despite this. I can help with the download function (issue 27).

I tried contact with Datasus team to check if they have a solution to avoid excessive memory consuption but did not have an answer yet. :(

Hi @heber-augusto , My PR Solves issue #27 as well! Take a look at it.

fccoelho avatar Feb 02 '22 08:02 fccoelho

Thanks a lot @fccoelho!

heber-augusto avatar Feb 02 '22 23:02 heber-augusto

@fccoelho , I did a review and left some comments on PR.

heber-augusto avatar Feb 03 '22 00:02 heber-augusto

Hello @fccoelho , how are you? I was testing the new version and discovered that the download for a single file (when the is not file split) is broken.

You can reproduce the error with the following code:

from pysus.online_data.SIA import download as download_sia

df_pa = download_sia( 'ES', 2020, 3, cache=True, group= ['PA',])

The exception message is "unpack requires a buffer of 32 bytes"

I am creating a new repo using PySus to help downloading files (forgive if I am not using the correct reference yet - I will do that soon) and solved the problem with some customization. I found two possible bugs here:

  1. Line 151 from SIA.py is trying to create a dbf file but there is no code to download dbc file (ftp.retrbinary() and dbc2dbf()).
  2. I think that Line 117 from SIA.py should be changed to if cache and (df is not None). Depending on python version the present version may raise exception.

You can see how I solved the single file download inside this file, just take a look at _fetch_file() , download_single_file() and download_multiples(). Please there are some customizations here.

heber-augusto avatar Mar 13 '22 23:03 heber-augusto

Thanks for reporting this bug, @heber-augusto. I'll take a look when I have some time.

fccoelho avatar Mar 14 '22 07:03 fccoelho