suppdata icon indicating copy to clipboard operation
suppdata copied to clipboard

suppdata (or unzip?) error: zip file is corrupt

Open AlbanSagouis opened this issue 5 years ago • 8 comments

Hi, While using MADtraits::MADtraits to download datasets, I ran into a suppdata::suppdata error:

> unzip(suppdata::suppdata("10.1002/ece3.1456", 1))
Warning message:
In unzip(suppdata::suppdata("10.1002/ece3.1456", 1)) : zip file is corrupt
  • I tried opening the archive from suppdata cache but 7zip confirms it can't be opened.
  • I tried opening other supplementary files in the same format that suppdata just downloaded and could open them.
  • I went on the journal website, downloaded and opened the supplementary without problem
  • If the corrupt archive is replaced by the good one manually downloaded from the site, unzip() does not throw an error.

I don't know what causes the error.

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] drake_7.12.5    MADtraits_1.0-0

loaded via a namespace (and not attached):
  [1] nlme_3.1-149      fs_1.5.0          bold_1.1.0        usethis_1.6.3    
  [5] lubridate_1.7.9   devtools_2.3.2    progress_1.2.2    filelock_1.0.2   
  [9] httr_1.4.2        rprojroot_1.3-2   tools_4.0.2       backports_1.1.10 
 [13] R6_2.4.1          DT_0.15           withr_2.3.0       tidyselect_1.1.0 
 [17] prettyunits_1.1.1 processx_3.4.4    curl_4.3          compiler_4.0.2   
 [21] cli_2.0.2         xml2_1.3.2        desc_1.2.0        triebeard_0.3.0  
 [25] mvtnorm_1.1-1     callr_3.4.4       handlr_0.2.0      convertr_0.1     
 [29] stringr_1.4.0     digest_0.6.25     txtq_0.2.3        rmarkdown_2.4    
 [33] pkgconfig_2.0.3   htmltools_0.5.0   bibtex_0.4.2.3    sessioninfo_1.1.1
 [37] fastmap_1.0.1     htmlwidgets_1.5.2 rlang_0.4.7       readxl_1.3.1     
 [41] rstudioapi_0.11   httpcode_0.3.0    shiny_1.5.0       generics_0.0.2   
 [45] zoo_1.8-8         jsonlite_1.7.1    gtools_3.8.2      dplyr_1.0.2      
 [49] magrittr_1.5      Rcpp_1.0.5        fansi_0.4.1       ape_5.4-1        
 [53] RefManageR_1.2.12 lifecycle_0.2.0   stringi_1.5.3     yaml_2.2.1       
 [57] storr_1.2.1       MASS_7.3-53       pkgbuild_1.1.0    plyr_1.8.6       
 [61] grid_4.0.2        parallel_4.0.2    gdata_2.18.0      promises_1.1.1   
 [65] crayon_1.3.4      miniUI_0.1.1.1    lattice_0.20-41   conditionz_0.1.0 
 [69] hms_0.5.3         knitr_1.30        ps_1.3.4          pillar_1.4.6     
 [73] uuid_0.1-4        taxize_0.9.98     igraph_1.2.5      caper_1.0.1      
 [77] base64url_1.4     codetools_0.2-16  reshape2_1.4.4    pkgload_1.1.0    
 [81] crul_1.0.0        glue_1.4.2        rcrossref_1.1.0   evaluate_0.14    
 [85] data.table_1.13.0 remotes_2.2.0     renv_0.12.0       foreach_1.5.0    
 [89] vctrs_0.3.4       httpuv_1.5.4      urltools_1.7.3    testthat_2.3.2   
 [93] cellranger_1.1.0  purrr_0.3.4       tidyr_1.1.2       reshape_0.8.8    
 [97] assertthat_0.2.1  xfun_0.18         mime_0.9          xtable_1.8-4     
[101] later_1.1.0.1     tibble_3.0.3      iterators_1.0.12  suppdata_1.1-4   
[105] tinytex_0.26      memoise_1.1.0     ellipsis_0.3.1

AlbanSagouis avatar Oct 07 '20 06:10 AlbanSagouis

Thanks for posting this, Alban, and also for transferring this issue over to suppdata. I can't reproduce this behaviour on my machine, and so I think the problem is with your unzip call on your machine:

> list.files()
character(0)
> unzip(suppdata("10.1002/ece3.1456", 1, ))
x=          from=       dir=        vol=        list=
si=         save.name=  cache=      issue=      timeout=
> unzip(suppdata("10.1002/ece3.1456", 1, dir="~/Desktop/demo/"))
> list.files()
[1] "10.1002_ece3.1456_1"              "ece31456-sup-0001-suppl_data.zip"

...I know that unzip can function a bit differently on Windows machines (like yours) and Linux machines (like mine), so I wonder if this is perhaps what is going on.

Hopefully the above helps; if not let me know.

Will

On Wed, 7 Oct 2020 at 07:44, AlbanSagouis [email protected] wrote:

Hi, While using MADtraits::MADtraits to download datasets, I ran into a suppdata::suppdata error:

unzip(suppdata::suppdata("10.1002/ece3.1456", 1)) Warning message: In unzip(suppdata::suppdata("10.1002/ece3.1456", 1)) : zip file is corrupt

  • I tried opening the archive from suppdata cache but 7zip confirms it can't be opened.
  • I tried opening other supplementary files in the same format that suppdata just downloaded and could open them.
  • I went on the journal website https://onlinelibrary.wiley.com/doi/full/10.1002/ece3.1456, downloaded and opened the supplementary without problem
  • If the corrupt archive is replaced by the good one manually downloaded from the site, unzip() does not throw an error.

I don't know what causes the error.

sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252

attached base packages: [1] stats graphics grDevices datasets utils methods base

other attached packages: [1] drake_7.12.5 MADtraits_1.0-0

loaded via a namespace (and not attached): [1] nlme_3.1-149 fs_1.5.0 bold_1.1.0 usethis_1.6.3 [5] lubridate_1.7.9 devtools_2.3.2 progress_1.2.2 filelock_1.0.2 [9] httr_1.4.2 rprojroot_1.3-2 tools_4.0.2 backports_1.1.10 [13] R6_2.4.1 DT_0.15 withr_2.3.0 tidyselect_1.1.0 [17] prettyunits_1.1.1 processx_3.4.4 curl_4.3 compiler_4.0.2 [21] cli_2.0.2 xml2_1.3.2 desc_1.2.0 triebeard_0.3.0 [25] mvtnorm_1.1-1 callr_3.4.4 handlr_0.2.0 convertr_0.1 [29] stringr_1.4.0 digest_0.6.25 txtq_0.2.3 rmarkdown_2.4 [33] pkgconfig_2.0.3 htmltools_0.5.0 bibtex_0.4.2.3 sessioninfo_1.1.1 [37] fastmap_1.0.1 htmlwidgets_1.5.2 rlang_0.4.7 readxl_1.3.1 [41] rstudioapi_0.11 httpcode_0.3.0 shiny_1.5.0 generics_0.0.2 [45] zoo_1.8-8 jsonlite_1.7.1 gtools_3.8.2 dplyr_1.0.2 [49] magrittr_1.5 Rcpp_1.0.5 fansi_0.4.1 ape_5.4-1 [53] RefManageR_1.2.12 lifecycle_0.2.0 stringi_1.5.3 yaml_2.2.1 [57] storr_1.2.1 MASS_7.3-53 pkgbuild_1.1.0 plyr_1.8.6 [61] grid_4.0.2 parallel_4.0.2 gdata_2.18.0 promises_1.1.1 [65] crayon_1.3.4 miniUI_0.1.1.1 lattice_0.20-41 conditionz_0.1.0 [69] hms_0.5.3 knitr_1.30 ps_1.3.4 pillar_1.4.6 [73] uuid_0.1-4 taxize_0.9.98 igraph_1.2.5 caper_1.0.1 [77] base64url_1.4 codetools_0.2-16 reshape2_1.4.4 pkgload_1.1.0 [81] crul_1.0.0 glue_1.4.2 rcrossref_1.1.0 evaluate_0.14 [85] data.table_1.13.0 remotes_2.2.0 renv_0.12.0 foreach_1.5.0 [89] vctrs_0.3.4 httpuv_1.5.4 urltools_1.7.3 testthat_2.3.2 [93] cellranger_1.1.0 purrr_0.3.4 tidyr_1.1.2 reshape_0.8.8 [97] assertthat_0.2.1 xfun_0.18 mime_0.9 xtable_1.8-4 [101] later_1.1.0.1 tibble_3.0.3 iterators_1.0.12 suppdata_1.1-4 [105] tinytex_0.26 memoise_1.1.0 ellipsis_0.3.1

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ropensci/suppdata/issues/53, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJNYUSIIGPKFIPCH5UMDY3SJQE55ANCNFSM4SG7BWGQ .

willpearse avatar Oct 07 '20 08:10 willpearse

Thanks for the answer Will.

So I tried unzip(suppdata("10.1002/ece3.1456", 1)) on a Linux virtual machine and it works as expected so system seems indeed to be a critical aspect of the issue.

But if I download the supplementary from my Windows machine

suppdata("10.1002/ece3.1456", 1, dir = '~/VirtualBox Shared folders/temp')

and try to unzip it from the Linux machine

> unzip('/media/sf_VirtualBox_Shared_folders/temp/10.1002_ece3.1456_1')
Warning message:
In unzip("/media/sf_VirtualBox_Shared_folders/temp/10.1002_ece3.1456_1") :
  zip file is corrupt

it fails again.

So I would say the issue comes before unzip() is called and it is either Windows doing some weird stuff to that specific file, maybe because of its extension or absence of, or suppdata behavior is impacted by Windows?

To check the name.extension idea, I changed the name of the downloaded archive

suppdata::suppdata("10.1002/ece3.1456", 1, save.name = 'test.zip', dir = '~/VirtualBox Shared folders/temp')

but unzip still fails under Windows and under Linux.

Alban

AlbanSagouis avatar Oct 07 '20 15:10 AlbanSagouis

Thanks for this. It sounds like we're in agreement this is a problem related to (potentially your) setup of Windows and unzipping, because the code works fine on Linux (on the same computer) and we agree that the file is being downloaded.

Would you mind humouring me and trying one last thing? The URL for the file you're downloading is ( https://onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fece3.1456&file=ECE31456-sup-0001-suppl_data.zip). Would you mind running:

temp.file <- temp
temp.file <- tempfile()
download.file("
https://onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fece3.1456&file=ECE31456-sup-0001-suppl_data.zip",
temp.file)
unzip(temp.file)

...and seeing if that works? This bypasses suppdata entirely.

willpearse avatar Oct 07 '20 20:10 willpearse

Well, it works with download.file().

> temp.file <- tempfile()
> download.file("https://onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fece3.1456&file=ECE31456-sup-0001-suppl_data.zip",
+               temp.file)
trying URL 'https://onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fece3.1456&file=ECE31456-sup-0001-suppl_data.zip'
Content type 'application/zip; charset=UTF-8' length 391767 bytes (382 KB)
downloaded 382 KB

> is.data.frame(readxl::read_xls(unzip(temp.file)[2], 2, skip = 6))
[1] TRUE
> 
> 
> suppdata::suppdata("10.1002/ece3.1456", 1, dir = tempdir(), save.name = 'tst')
[1] "C:\\Users\\as80fywe\\AppData\\Local\\Temp\\Rtmp00xEbx/tst"
attr(,"suffix")
[1] "suppl"
> unzip(paste0(tempdir(), '/tst'))
Warning message:
In unzip(paste0(tempdir(), "/tst")) : zip file is corrupt
> is.data.frame(readxl::read_xls(unzip(paste0(tempdir(), '/tst'))[2], 2, skip = 6))
Error: `path` does not exist: ‘NA’
In addition: Warning message:
In unzip(paste0(tempdir(), "/tst")) : zip file is corrupt

I tried in R, outside of any R project or renv library, with both CRAN and GitHub versions of suppdata and the error remains. I'll look into suppdata.

Alban

On a side note, the .paquette.2015 function in MADtraits has 2 unzip() calls.

    data <- as.data.frame(read_xls(unzip(unzip(suppdata("10.1002/ece3.1456", 1)))[2], sheet=2, na=c("","NA")))

AlbanSagouis avatar Oct 08 '20 08:10 AlbanSagouis

Thanks for this; this has really helped me. I think this is an edge case where the publisher hasn't named the file with a .zip extension, which means that we're not detecting it as a zip-file when downloading and switched to binary mode when downloading on Windows.

I don't have a Windows box to test this on right now, but I hope I have pushed something up now that can force this through. Would you mind trying the following in a fresh session:

library(devtools)
install_github("ropensci/suppdata", ref="winzip")
library(suppdata)
unzip(suppdata("10.1002/ece3.1456", 1))

...if that works for you then I'll clean up and merge it into the master branch. If it doesn't then I'll leave you alone, figure out a better fix, and then merge it anyway.

Thanks for flagging this.

On Thu, 8 Oct 2020 at 09:27, AlbanSagouis [email protected] wrote:

Well, it works with download.file().

temp.file <- tempfile()

download.file("https://onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fece3.1456&file=ECE31456-sup-0001-suppl_data.zip",

  •           temp.file)
    

trying URL 'https://onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fece3.1456&file=ECE31456-sup-0001-suppl_data.zip'

Content type 'application/zip; charset=UTF-8' length 391767 bytes (382 KB)

downloaded 382 KB

is.data.frame(readxl::read_xls(unzip(temp.file)[2], 2, skip = 6))

[1] TRUE

suppdata::suppdata("10.1002/ece3.1456", 1, dir = tempdir(), save.name = 'tst')

[1] "C:\Users\as80fywe\AppData\Local\Temp\Rtmp00xEbx/tst"

attr(,"suffix")

[1] "suppl"

unzip(paste0(tempdir(), '/tst'))

Warning message:

In unzip(paste0(tempdir(), "/tst")) : zip file is corrupt

is.data.frame(readxl::read_xls(unzip(paste0(tempdir(), '/tst'))[2], 2, skip = 6))

Error: path does not exist: ‘NA’

In addition: Warning message:

In unzip(paste0(tempdir(), "/tst")) : zip file is corrupt

I tried in R, outside of any R project or renv library, with both CRAN and GitHub versions of suppdata and the error remains. I'll look into suppdata.

Alban

On a side note, the .paquette.2015 function in MADtraits has 2 unzip() calls.

data <- as.data.frame(read_xls(unzip(unzip(suppdata("10.1002/ece3.1456", 1)))[2], sheet=2, na=c("","NA")))

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ropensci/suppdata/issues/53#issuecomment-705415255, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJNYUSOTSOINHWWLPYNXMDSJVZWDANCNFSM4SG7BWGQ .

willpearse avatar Oct 08 '20 09:10 willpearse

The fix did not solve the issue for me but I'll keep it in mind for future uses of suppdata

And thanks for putting effort in trying to solve it.

Alban

AlbanSagouis avatar Oct 08 '20 12:10 AlbanSagouis

Oh wait...

> unzip(suppdata::suppdata("10.1002/ece3.1456", 1, zip = TRUE))

works!

This is on a Windows machine, using the fix from your winzip branch.

AlbanSagouis avatar Oct 08 '20 13:10 AlbanSagouis

Ah, that's wonderful news, thank you very much! Thanks for bearing with me while we got this fixed.

On Thu, 8 Oct 2020 at 14:31, AlbanSagouis [email protected] wrote:

Oh wait...

unzip(suppdata::suppdata("10.1002/ece3.1456", 1, zip = TRUE))

works!

This is on a Windows machine, using the fix from your winzip branch.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ropensci/suppdata/issues/53#issuecomment-705569763, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJNYUWAQELIARLKYGL5F5TSJW5JHANCNFSM4SG7BWGQ .

willpearse avatar Oct 08 '20 13:10 willpearse