pkgcache icon indicating copy to clipboard operation
pkgcache copied to clipboard

`meta_cache_list()` - empty `published` column for recent packages

Open pawelru opened this issue 1 year ago • 3 comments

Sys.Date()
#> [1] "2024-04-19"

library(pkgcache)
meta_cache_update()
#> 
#> ℹ Updating metadata database
#> ✔ Updating metadata database ... done
#> 
max(meta_cache_list()$published, na.rm = T) # note the diff from today!
#> [1] "2024-04-09 16:50:05 GMT"

# an example - package `tensorflow` released on 15th of Apr
library(rvest)
read_html("https://cran.r-project.org/web/packages/tensorflow/index.html") |> 
    html_element("table") |> 
    html_table() |> 
    head(x = _, 5)
#> # A tibble: 5 × 2
#>   X1         X2                                                                 
#>   <chr>      <chr>                                                              
#> 1 Version:   "2.16.0"                                                           
#> 2 Depends:   "R (≥ 3.6)"                                                        
#> 3 Imports:   "config, processx, reticulate (≥ 1.32), tfruns (≥ 1.0), utils, yam…
#> 4 Suggests:  "testthat (≥ 2.1.0), keras3, pillar, withr, callr"                 
#> 5 Published: "2024-04-15"

meta_cache_list(packages = "tensorflow")[, c("package", "version", "published")]
#> # A data frame: 2 × 3
#>   package    version published
#> * <chr>      <chr>   <dttm>   
#> 1 tensorflow 2.16.0  NA       
#> 2 tensorflow 2.16.0  NA

Created on 2024-04-19 with reprex v2.1.0

Is this a bug? What I can do to force update the cache? I am analysing CRAN data and the release / publish date is one of my inputs.

pawelru avatar Apr 19 '24 10:04 pawelru

That column is from metadata that is not on CRAN and we need to collect it separately. Unfortunately I had to shut down the infrastructure that collects it, so it hasn't been updated for a couple of days.

The metadata itself is now here: https://github.com/r-hub/cran-metadata/tree/gh-pages but until I write the code that updates it, it won't be updated. The old update code used a local CRAN mirror, which I don't have any more, so we need a completely new way of updating.

The published field is actually easy, so maybe I'll do that first. The hard ones are the hashes, for those I need to download the package files, and Windows binaries are rebuilt all the time, so that's a lot of downloads, potentially.

Anyway, I wan't aware of any use for that metadata, apart from pak printing the file sizes, so opening this issue was a good idea.

gaborcsardi avatar Apr 19 '24 10:04 gaborcsardi

Thanks @gaborcsardi for a prompt reply. I'll have a look what you linked and consider this as an alternative to rvest-ing this from CRAN webpage. Definitely looking forward to bring this back. pkgcache API is so convenient to my use case. If it's comes to me - I don't use hashes at all so if this is a biggest piece of work then this can be definitely postponed.

pawelru avatar Apr 19 '24 10:04 pawelru

No need to scrape this field, you can also do something like

db <- tools::CRAN_package_db()
db$`Date/Publication`

gaborcsardi avatar Apr 19 '24 12:04 gaborcsardi

This is finally now fixed in pkgcache as well, the metadata is updated daily. We could probably update it more often if we wanted to.

The only caveat is that now updates are based on package name and R version, so rebuilds of binaries are not picked up. They should not rebuild source packages, so it should not affect those.

I think it would be possible to update the binaries without keeping a CRAN mirror. I'll open an issue for that: #119.

gaborcsardi avatar Sep 26 '24 10:09 gaborcsardi