frictionless-r icon indicating copy to clipboard operation
frictionless-r copied to clipboard

unexpected behaviour for metadata with NULL or NA content

Open ElsLommelen opened this issue 1 year ago • 3 comments

When adding a metadata item without content (e.g. because all other tables or columns have content for this metadata item), writing and reading the package alters this content: NA becomes NULL, and NULL becomes Named list(), and also the written datapackage.json is different. I don't mind the distinction between NA or NULL, so I don't mind if they would be saved and reloaded as the same value (or not at all), but I find annoying that the written file changes when first giving the metadata value NA (which is kind of a default given by functions in R if no data are available).

The reprex demonstrates the issue for metadata on the table level, but metadata on the column level behave similar.

library(frictionless)
#> Warning: package 'frictionless' was built under R version 4.3.3

# creating a package with metadata title = NULL and description = NA
my_package <-
  create_package() |>
  add_resource(
    resource_name = "iris",
    data = iris,
    title = NULL,
    description = NA
  )
str(my_package)
#> List of 2
#>  $ resources:List of 1
#>   ..$ :List of 10
#>   .. ..$ name       : chr "iris"
#>   .. ..$ data       :'data.frame':   150 obs. of  5 variables:
#>   .. .. ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>   .. .. ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>   .. .. ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>   .. .. ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>   .. .. ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#>   .. ..$ profile    : chr "tabular-data-resource"
#>   .. ..$ format     : NULL
#>   .. ..$ mediatype  : NULL
#>   .. ..$ encoding   : NULL
#>   .. ..$ dialect    : NULL
#>   .. ..$ title      : NULL
#>   .. ..$ description: logi NA
#>   .. ..$ schema     :List of 1
#>   .. .. ..$ fields:List of 5
#>   .. .. .. ..$ :List of 2
#>   .. .. .. .. ..$ name: chr "Sepal.Length"
#>   .. .. .. .. ..$ type: chr "number"
#>   .. .. .. ..$ :List of 2
#>   .. .. .. .. ..$ name: chr "Sepal.Width"
#>   .. .. .. .. ..$ type: chr "number"
#>   .. .. .. ..$ :List of 2
#>   .. .. .. .. ..$ name: chr "Petal.Length"
#>   .. .. .. .. ..$ type: chr "number"
#>   .. .. .. ..$ :List of 2
#>   .. .. .. .. ..$ name: chr "Petal.Width"
#>   .. .. .. .. ..$ type: chr "number"
#>   .. .. .. ..$ :List of 3
#>   .. .. .. .. ..$ name       : chr "Species"
#>   .. .. .. .. ..$ type       : chr "string"
#>   .. .. .. .. ..$ constraints:List of 1
#>   .. .. .. .. .. ..$ enum: chr [1:3] "setosa" "versicolor" "virginica"
#>  $ directory: chr "."
#>  - attr(*, "class")= chr [1:2] "datapackage" "list"

# writing the package
write_package(my_package, "irisdir")

# in datapackage.json, title = {} and description = null

# when reading the package again, title = Named list() and description = NULL
my_loaded_package <- read_package("irisdir/datapackage.json")
str(my_loaded_package)
#> List of 2
#>  $ resources:List of 1
#>   ..$ :List of 9
#>   .. ..$ name       : chr "iris"
#>   .. ..$ path       : chr "iris.csv"
#>   .. ..$ profile    : chr "tabular-data-resource"
#>   .. ..$ format     : chr "csv"
#>   .. ..$ mediatype  : chr "text/csv"
#>   .. ..$ encoding   : chr "utf-8"
#>   .. ..$ title      : Named list()
#>   .. ..$ description: NULL
#>   .. ..$ schema     :List of 1
#>   .. .. ..$ fields:List of 5
#>   .. .. .. ..$ :List of 2
#>   .. .. .. .. ..$ name: chr "Sepal.Length"
#>   .. .. .. .. ..$ type: chr "number"
#>   .. .. .. ..$ :List of 2
#>   .. .. .. .. ..$ name: chr "Sepal.Width"
#>   .. .. .. .. ..$ type: chr "number"
#>   .. .. .. ..$ :List of 2
#>   .. .. .. .. ..$ name: chr "Petal.Length"
#>   .. .. .. .. ..$ type: chr "number"
#>   .. .. .. ..$ :List of 2
#>   .. .. .. .. ..$ name: chr "Petal.Width"
#>   .. .. .. .. ..$ type: chr "number"
#>   .. .. .. ..$ :List of 3
#>   .. .. .. .. ..$ name       : chr "Species"
#>   .. .. .. .. ..$ type       : chr "string"
#>   .. .. .. .. ..$ constraints:List of 1
#>   .. .. .. .. .. ..$ enum: chr [1:3] "setosa" "versicolor" "virginica"
#>  $ directory: chr "irisdir"
#>  - attr(*, "class")= chr [1:2] "datapackage" "list"

write_package(my_loaded_package, "irisdir2")

# and in this datapackage.json, title = {} and description = {}

Created on 2024-04-16 with reprex v2.0.2

ElsLommelen avatar Apr 16 '24 14:04 ElsLommelen

Thanks for reporting. We have a helper function clean_list() that allows to sanitize NULL, list() etc. We could run it on resource or datapackage before writing, but I'm afraid it might have unintended side effects.

It's probably better to extend this line:

https://github.com/frictionlessdata/frictionless-r/blob/5024c909e67591a6eae9347e31ac99d6fa795749/R/write_package.R#L77C29-L77C35

With the properties null = "null" and na = "null", so values are always exported the same way (as NULL, which is the default for lists).

peterdesmet avatar Apr 17 '24 15:04 peterdesmet

Using na = "string" would cause all NA values to be exported as "NA" (and thus different than NULL values). This is probably not desirable, since reading the package would not interpret those automatically as NA. At least NULL has an inherent meaning in lists.

peterdesmet avatar Apr 17 '24 15:04 peterdesmet

With the properties null = "null" and na = "null", so values are always exported the same way (as NULL, which is the default for lists).

This indeed seems a good solution: now it is null = "list" and na = "null", I suppose, and replacing the first by null = "null" would give the same behaviour after writing for NULL and NA

ElsLommelen avatar Apr 17 '24 15:04 ElsLommelen