Package grooming and performance improvements
Update
- [x] Documentation migrated to roxygen2 markdown
- [x] First clean up of unneeded dependencies
- [ ] Refactor tests
Hi @antagomir @pitkant:
Now that the package moved to a new version, let me share with you some potential improvements I would like to suggest:
Documentation
- Currently the documentation is still using the old syntax
\code{}, \link{}, etc. I would like to propose moving to markdown (see https://roxygen2.r-lib.org/articles/rd-formatting.html#turning-on-markdown-support-1). It is also quite easy to convert the current docs to markdown with {roxygen2md} package. - Also, the code styling can me improved using the
stylerpackage, adopted by the Tidyverse. - The DESCRIPTION file can also be normalized with the
usethispackage. - The
@familytag (https://roxygen2.r-lib.org/articles/rd.html#cross-references) is quite useful for cross-referencing families of functions. It also can be used for presenting a more clear index on the pkgdown page. Forexample, {giscoR} is structured on categories:
reference:
- title: Political
desc: These functions return `sf` objects with political boundaries
contents: has_concept("political")
- title: Infrastructures
desc: These functions return `sf` objects with man-made features
contents: has_concept("infrastructure")
...
- Accordingly, the reference page could be reestructured to facilitate the navigation of users.
Dependencies and tests
This is a sensitive issue. Currently the package has 19 Imports and 13 Suggests. Also, the package has more than 100 tests for get_eurostat_geospatial() and still the coverage is below 55%.
These two issues cause that the package is heavy when installing on a fresh system and also really slow to tests.
I think it is possible to reduce dependencies to 17+5 (get rid of 10 packages) with almost no costs. On the test side, more work is needed, but probably there are room for improvement.
As a suggestion, {giscoR} could help with the spatial part but {sf} would be still needed.
I see the Documentation part quite easy (I can even set an action for performing these tasks). Also I can work on reducing dependencies.
Let mw know your thoughs, regards
Regarding documentation:
- I agree with everything you've mentioned, all these seem like good quality of life improvements for the package.
Regarding package dependencies and tests:
- I agree that the package is rather heavy in its current state. Especially sf dependency seems to cause problems with some users when installing (for example #213) but also adds up in user support work (see #197).
- Reducing dependencies (when possible) is always a good thing in my books. What packages do you think are unnecessary?
- The way I see it it would be good focus eurostat-package's scope to retrieving and outputting data and leave any additional functionalities to other packages. Right now sf package is actually in suggests and geospatial functions work only if it's installed so that could be one way to do it.
I think we could start a project board aiming for eurostat 4.0.0 release. The specifics on what that release would contain could be discussed under the project page.
I like the idea of the project board π
It can start with the Documentation part, and regarding the packages I have testes:
Imports:
broom,
classInt,
countrycode,
curl,
dplyr,
httr,
magrittr,
jsonlite,
lubridate,
[del] RColorBrewer,
rappdirs,
readr,
RefManageR,
stringi,
stringr,
tibble,
tidyr,
regions,
rlang
Suggests:
[del] covr,
[del] Cairo,
[del] ggplot2,
knitr,
[del] markdown,
rmarkdown,
[del] roxygen2,
[del] rvest,
testthat (>= 3.0.0),
[del] tmap,
[del] usethis,
sf,
[del] here
[new] sp
Most of this packages are not even used on the vignettes, but on the articles. We can install that in the GitHub Action, as the articles are not part of the package.
Regarding Cairo, this is a bit tricky. I think it is used for plots. Now the default (at least on pkgdown) is ragg, that provides high-quality plots. This one makes me doubt a bit, but It could be removed amending also the vignette.
I agree with all, I very much like to idea of moving to md docs.
The suggested ways to remove dependencies seem good as well as work towards version 4.0.0 (what does R Extension say about version numbers, should this be 4.1.1 instead?)
One way to reduce dependencies is to move all spatial visualization stuff elsewhere, this has been under discussion for a longer time.
It also seems that there is now an emerging ecosystem of packages around eurostat, including also those from @antaldaniel - we could think if it would make sense to create an more comprehensive online resource on using and enrichning eurostat (and related?) data sources based on this set of interoperable packages. But this latter one would require a bit more planning and time.
Good ideas.
I think there are also performance issues in the get_eurostat() and particulary in the tidy_eurostat(). I tried to speed the code in the branch speed and managed to improve the performance significantly. Unfortunately, I never got time to finish the job, and now the branch is way behind the master. But there is potential... Unless someone else have already improved code meanwhile.
Hi @antagomir Regarding versioninf, on a quick search I see on https://cran.r-project.org/doc/manuals/r-release/R-exts.html:
The mandatory βVersionβ field gives the version of the package. This is a sequence of at least two (and usually three) non-negative integers separated by single β.β or β-β characters. The canonical form is as shown in the example, and a version such as β0.01β or β0.01.0β will be handled as if it were β0.1-0β. It is not a decimal number, so for example 0.9 < 0.75 since 9 < 75.
On https://r-pkgs.org/description.html#version SemVer and X.Org are referred. On SemVer:
Given a version number MAJOR.MINOR.PATCH, increment the:
MAJOR version when you make incompatible API changes, MINOR version when you add functionality in a backwards compatible manner, and PATCH version when you make backwards compatible bug fixes.
Common practice when these changes are applied are just to increase major with trailing 0.
@dieghernan I think this issue is again timely and closely related to #263 since several geospatial packages are going to be removed from CRAN or changed. I think now the time would be ripe to "move" get_eurostat_geospatial functionalities out of eurostat package and rely on giscoR instead.
I can prepare a PR switching to giscoR and including it on the Suggest field.
I think it should be pretty straightforward since the parameters of get_eurostat_geospatial are almost the same than those of gisco_get_nuts. Is that ok?
And if so, over which branch? I see some development coming recently on v4-dev
I think v4-dev (or, if you like, another branch that would be ultimately merged into v4) would be good. These changes fall easily under the MAJOR changes label so it's justifiable to put it there instead of in a minor release.
I'm not certain what the schedule for the release of v4.0 should be, or how long we have to go through the open issues and solve them. Before October 2023 anyway, because of this geospatial package schedule and #243, which also happens to have its deadline in October.