Excessive memory consumption during SIA-PA download
There are some SIA-PA files wich are huge and it seems that converting from dbc to parquet is consuming too much memory. Although the following example is somehow related to issue 27 (which occurs when a given month has a lot of data), to simulate the error, I collected the file manually through the sus ftp and called only the function that is failing.
from pysus.utilities.readdbc import dbc2dbf, read_dbc infile = 'PASP2003a.dbc' outfile = 'PASP2003a.dbf'
dbc2dbf(infile,outfile) #exception is raised by the line bellow df = read_dbc(infile)
I was using google colab and tried to hire the pro version for a month to increase the memory limit, but even so, it wasn't enough.
I don't know how to solve exactly with the dbf file. With csvs and pandas files, I usually tried to read the file in chunks to avoid excessive memory consumption.
Message that will be displayed on users first issue
Thanks for reporting this bug @heber-augusto. We apply a chunked solution to COVID vaccination data.
The solution shouldn't be hard to implement,
Datasus now decided to split large files into <filename>a.dbc, <filename>b.dbc, etc. The download function needs to be adapted to look for this variant names.
Hello @fccoelho , I wrote a possible solution to read split files inside issue 27 but I think the excessive memory consumption is an issue that exists despite this. I can help with the download function (issue 27).
I tried contact with Datasus team to check if they have a solution to avoid excessive memory consuption but did not have an answer yet. :(
Hi @heber-augusto, I have just opened a PR to fix this situation.
It also allows PySUS to automatically handle downloading of the a, b, c, … parts of large files when necessary.
As soon as it passes the review and gets merged, I'll make a new release of PySUS. There is also new documentation showing how to handle these large SIA files.
Hello @fccoelho , I wrote a possible solution to read split files inside issue 27 but I think the excessive memory consumption is an issue that exists despite this. I can help with the download function (issue 27).
I tried contact with Datasus team to check if they have a solution to avoid excessive memory consuption but did not have an answer yet. :(
Hi @heber-augusto , My PR Solves issue #27 as well! Take a look at it.
Thanks a lot @fccoelho!
@fccoelho , I did a review and left some comments on PR.
Hello @fccoelho , how are you? I was testing the new version and discovered that the download for a single file (when the is not file split) is broken.
You can reproduce the error with the following code:
from pysus.online_data.SIA import download as download_sia
df_pa = download_sia( 'ES', 2020, 3, cache=True, group= ['PA',])
The exception message is "unpack requires a buffer of 32 bytes"
I am creating a new repo using PySus to help downloading files (forgive if I am not using the correct reference yet - I will do that soon) and solved the problem with some customization. I found two possible bugs here:
-
Line 151 from SIA.py is trying to create a dbf file but there is no code to download dbc file (
ftp.retrbinary()anddbc2dbf()). - I think that Line 117 from SIA.py should be changed to
if cache and (df is not None). Depending on python version the present version may raise exception.
You can see how I solved the single file download inside this file, just take a look at _fetch_file() , download_single_file() and download_multiples(). Please there are some customizations here.
Thanks for reporting this bug, @heber-augusto. I'll take a look when I have some time.