zlib icon indicating copy to clipboard operation
zlib copied to clipboard

`Zlib::GzipReader` doesn't read some large files

Open inkstak opened this issue 3 years ago • 0 comments

Hi. I have to inflate a .csv.gz file which should return a 4 GB CSV with 25 million rows.

When I use an app or the gzip command line, I get the full file without issue. When I use Zlib::GzipReader, only the first row is returned.

> Zlib::GzipReader.open("adresses-france.csv.gz") { |gz|  print gz.read }
id;id_fantoir;numero;rep;nom_voie;code_postal;code_insee;nom_commune;code_insee_ancienne_commune;nom_ancienne_commune;x;y;lon;lat;type_position;alias;nom_ld;libelle_acheminement;nom_afnor;source_position;source_nom_voie;certification_commune;cad_parcelles
 => nil

The file is provided by the french government:

  • the directory: https://adresse.data.gouv.fr/data/ban/adresses/latest/csv
  • the file: https://adresse.data.gouv.fr/data/ban/adresses/latest/csv/adresses-france.csv.gz

There are many other files in the directory (for each region) but I cannot reproduce the issue with other files.

This service also provided a similar file in Addok format (https://adresse.data.gouv.fr/data/ban/adresses/latest/addok/adresses-addok-france.ndjson.gz) which should return a 3GB file with 2 million rows, but only the 25k first rows are returned by Zlib::GzipReader.

Is there any limit to what Zlib can support ? (size, rows, ..) Does it come from the compressed file ?

inkstak avatar Nov 17 '22 09:11 inkstak