sf icon indicating copy to clipboard operation
sf copied to clipboard

`sf::st_as_text()` is slow

Open kadyb opened this issue 2 years ago • 7 comments

It seems that sf::st_as_text() is definitely slower than its alternatives available in {lwgeom} and {terra}.

library("sf"); library("lwgeom"); library("terra")

pts = matrix(1:100000, , 2)
p = st_multipoint(pts)
p = st_cast(st_sfc(p), "POINT")
p2 = vect(p)

bench::mark(
  check = FALSE,
  sf::st_as_text(p),
  lwgeom::st_astext(p),
  terra::geom(p2, wkt = TRUE)
)
#   expression                       min   median `itr/sec`
# 1 sf::st_as_text(p)              3.79s    3.79s     0.264
# 2 lwgeom::st_astext(p)        111.51ms 112.72ms     8.64 
# 3 terra::geom(p2, wkt = TRUE)  55.96ms  56.45ms    17.7  

## second example -----------------------------------------
nc = read_sf(system.file("shape/nc.shp", package = "sf"))
nc = st_as_sfc(nc)
nc2 = vect(nc)

bench::mark(
  check = FALSE,
  sf::st_as_text(nc),
  lwgeom::st_astext(nc),
  terra::geom(nc2, wkt = TRUE)
)
#   expression                        min   median `itr/sec`
# 1 sf::st_as_text(nc)           146.22ms 155.15ms      6.30
# 2 lwgeom::st_astext(nc)          1.81ms   1.86ms    528.  
# 3 terra::geom(nc2, wkt = TRUE)   2.57ms   2.62ms    380.  

kadyb avatar Mar 17 '24 11:03 kadyb

Yes, maybe we could add pointers to alternatives (lwgeom) in the docs; st_as_text from sf uses native R code by default. The assumption is that nobody will use it for large jobs, but even for printing ansf object, where it is called on the first 10 features, passing a very complex polygon will take time.

Use

Sys.setenv("LWGEOM_WKT"="true")

to have sf use that path (also not documented). Making this the default would make lwgeom effectively a hard dependency. An alternative would be to use lwgeom by default only if it is present.

See also #1602 #957 #800 and #747

edzer avatar Mar 27 '24 21:03 edzer

I also think it would be good idea to mention {lwgeom} in the doc.

https://github.com/r-spatial/sf/blob/8bf890eb0b153d44d410fd84428311228bda90ae/R/wkt.R#L20

Maybe if we add fixed = TRUE argument in sub() it will be a little faster? And maybe rewriting *apply() into loop with preallocation would also bring some benefits?

kadyb avatar Mar 27 '24 23:03 kadyb

fixed = TRUE is not what is meant, as `"^[ ]+" is a regular expression (one or more white spaces).

The change to default to using lwgeom is a lot of work, as the WKT produced changes and many tests in sf itself and probably in many reverse dependencies need to be adjusted.

edzer avatar Mar 28 '24 11:03 edzer

Yeah, adding {lwgeom} as a hard dependency is too demanding change, so mention this package in the docs will be ok if someone is looking for a faster alternative. I'm still wondering if using a loop with preallocation would provide any benefits, as in the example below:

x = seq_len(1e6)
fun = function(x) {
  output = double(length(x))
  for (i in seq_along(x)) output[i] = sqrt(x[i])
  return(output)
}

bench::mark(
  sapply(x, sqrt),
  vapply(x, sqrt, double(1)),
  fun(x)
)
#   expression       min  median `itr/sec` mem_alloc
# 1 sapply(x, s… 418.4ms   489ms      2.04   30.89MB
# 2 vapply(x, s… 292.5ms 300.6ms      3.33    7.63MB
# 3 fun(x)        46.7ms  47.2ms     20.7     7.63MB

kadyb avatar Mar 28 '24 13:03 kadyb

Feel free to try out, I would be surprised if that would reduce time meaningfully.

edzer avatar Mar 28 '24 16:03 edzer

Should I prepare PR with changes in documentation or would you rather do it yourself?

kadyb avatar Mar 29 '24 12:03 kadyb

"The assumption is that nobody will use it for large jobs, ..."

FWIW, in a previous job we took sf objects and casted the geometry column to WKT to upload into snowflake. We ran into this problem and the solution was to use wk

JosiahParry avatar Apr 16 '24 15:04 JosiahParry