`sf::st_as_text()` is slow
It seems that sf::st_as_text() is definitely slower than its alternatives available in {lwgeom} and {terra}.
library("sf"); library("lwgeom"); library("terra")
pts = matrix(1:100000, , 2)
p = st_multipoint(pts)
p = st_cast(st_sfc(p), "POINT")
p2 = vect(p)
bench::mark(
check = FALSE,
sf::st_as_text(p),
lwgeom::st_astext(p),
terra::geom(p2, wkt = TRUE)
)
# expression min median `itr/sec`
# 1 sf::st_as_text(p) 3.79s 3.79s 0.264
# 2 lwgeom::st_astext(p) 111.51ms 112.72ms 8.64
# 3 terra::geom(p2, wkt = TRUE) 55.96ms 56.45ms 17.7
## second example -----------------------------------------
nc = read_sf(system.file("shape/nc.shp", package = "sf"))
nc = st_as_sfc(nc)
nc2 = vect(nc)
bench::mark(
check = FALSE,
sf::st_as_text(nc),
lwgeom::st_astext(nc),
terra::geom(nc2, wkt = TRUE)
)
# expression min median `itr/sec`
# 1 sf::st_as_text(nc) 146.22ms 155.15ms 6.30
# 2 lwgeom::st_astext(nc) 1.81ms 1.86ms 528.
# 3 terra::geom(nc2, wkt = TRUE) 2.57ms 2.62ms 380.
Yes, maybe we could add pointers to alternatives (lwgeom) in the docs; st_as_text from sf uses native R code by default. The assumption is that nobody will use it for large jobs, but even for printing ansf object, where it is called on the first 10 features, passing a very complex polygon will take time.
Use
Sys.setenv("LWGEOM_WKT"="true")
to have sf use that path (also not documented). Making this the default would make lwgeom effectively a hard dependency. An alternative would be to use lwgeom by default only if it is present.
See also #1602 #957 #800 and #747
I also think it would be good idea to mention {lwgeom} in the doc.
https://github.com/r-spatial/sf/blob/8bf890eb0b153d44d410fd84428311228bda90ae/R/wkt.R#L20
Maybe if we add fixed = TRUE argument in sub() it will be a little faster? And maybe rewriting *apply() into loop with preallocation would also bring some benefits?
fixed = TRUE is not what is meant, as `"^[ ]+" is a regular expression (one or more white spaces).
The change to default to using lwgeom is a lot of work, as the WKT produced changes and many tests in sf itself and probably in many reverse dependencies need to be adjusted.
Yeah, adding {lwgeom} as a hard dependency is too demanding change, so mention this package in the docs will be ok if someone is looking for a faster alternative. I'm still wondering if using a loop with preallocation would provide any benefits, as in the example below:
x = seq_len(1e6)
fun = function(x) {
output = double(length(x))
for (i in seq_along(x)) output[i] = sqrt(x[i])
return(output)
}
bench::mark(
sapply(x, sqrt),
vapply(x, sqrt, double(1)),
fun(x)
)
# expression min median `itr/sec` mem_alloc
# 1 sapply(x, s… 418.4ms 489ms 2.04 30.89MB
# 2 vapply(x, s… 292.5ms 300.6ms 3.33 7.63MB
# 3 fun(x) 46.7ms 47.2ms 20.7 7.63MB
Feel free to try out, I would be surprised if that would reduce time meaningfully.
Should I prepare PR with changes in documentation or would you rather do it yourself?
"The assumption is that nobody will use it for large jobs, ..."
FWIW, in a previous job we took sf objects and casted the geometry column to WKT to upload into snowflake. We ran into this problem and the solution was to use wk