cookiejar and cookiefile not work
cookiejar and cookiefile are necessary, when need to reuse cookie in a new R process. However, not working in config.
GET("http://httpbin.org/cookies/set?a=1",
config(cookiefile = "a.txt", cookiejar = "a.txt"))
dir("a.txt")
> character(0)
I was interesting to dig into this and understand how this should work. Here is my investigation
Let’s try to understand how it works with curl. First, the documentation gives us some hints: https://curl.haxx.se/libcurl/c/CURLOPT_COOKIEJAR.html
This will make libcurl write all internally known cookies to the specified file when curl_easy_cleanup is called
So the file be be written to only after cleanup. Reading curl source code, it happens is R finalizer when curl handle are garbage collected I think.
Based on that, this is how you would use cookies with curl.
cookie_file <- tempfile("cookie", fileext = ".txt")
h <- curl::new_handle()
curl::handle_setopt(h, cookiejar = cookie_file)
res <- curl::curl_fetch_memory("https://httpbin.org/cookies/set?a=1", h)
# cookie file is not created just after the request
file.exists(cookie_file)
#> [1] FALSE
# but cookies are stored in the handle for reuse
curl::handle_cookies(h)
#> domain flag path secure expiration name value
#> 1 httpbin.org FALSE / FALSE <NA> a 1
# you can have another one, the previous is still here
res <- curl::curl_fetch_memory("https://httpbin.org/cookies/set?b=2", h)
curl::handle_cookies(h)
#> domain flag path secure expiration name value
#> 1 httpbin.org FALSE / FALSE <NA> a 1
#> 2 httpbin.org FALSE / FALSE <NA> b 2
# cookie file is not here yet
file.exists(cookie_file)
#> [1] FALSE
# it because cookie file is created after curl handle is cleanup
# this happens in R on garbage collection
rm(h)
gc()
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 544708 29.1 1235332 66 640917 34.3
#> Vcells 1018081 7.8 8388608 64 1627019 12.5
# Now file exists and both cookie are here
file.exists(cookie_file)
#> [1] TRUE
readLines(cookie_file)
#> [1] "# Netscape HTTP Cookie File"
#> [2] "# https://curl.haxx.se/docs/http-cookies.html"
#> [3] "# This file was generated by libcurl! Edit at your own risk."
#> [4] ""
#> [5] "httpbin.org\tFALSE\t/\tFALSE\t0\tb\t2"
#> [6] "httpbin.org\tFALSE\t/\tFALSE\t0\ta\t1"
We manage to write cookies to file. The cookies can now be reused using curl or httr.
h <- curl::new_handle()
# activate cookie engine in reading mode
curl::handle_setopt(h, cookiefile = cookie_file)
# see if cookies are sent
res <- curl::curl_fetch_memory("https://httpbin.org/cookies", h)
# cookies are correctly reused
jsonlite::fromJSON(rawToChar(res$content))
#> $cookies
#> $cookies$a
#> [1] "1"
#>
#> $cookies$b
#> [1] "2"
# this also works with httr
readLines(cookie_file)
#> [1] "# Netscape HTTP Cookie File"
#> [2] "# https://curl.haxx.se/docs/http-cookies.html"
#> [3] "# This file was generated by libcurl! Edit at your own risk."
#> [4] ""
#> [5] "httpbin.org\tFALSE\t/\tFALSE\t0\tb\t2"
#> [6] "httpbin.org\tFALSE\t/\tFALSE\t0\ta\t1"
res <- httr::GET("https://httpbin.org/cookies", httr::config(cookiefile = cookie_file))
httr::content(res)
#> $cookies
#> $cookies$a
#> [1] "1"
#>
#> $cookies$b
#> [1] "2"
This means that cookiefile is working fine at least.
Now it seems writing cookies does not work as I expected with httr itself. curl handles are used internally but it is hidden from the user. I think, from the ?handle_pool documentation that it is not cleanup until the end of the session. This makes it easier to maintain and reuse cookies in the same session. However, removing the handle from the pool using httr::handle_reset should do a rm and same thing as in curl should happen - but no, there is something else.
This is what I tried so far
cookie_file <- tempfile("cookie", fileext = ".txt")
url <- "https://httpbin.org/"
# create some cookies and add cookiejar option
r <- httr::GET(url, path = "cookies/set", query = list(a = 1),
httr::config(cookiejar = cookie_file))
httr::content(r)
#> $cookies
#> $cookies$a
#> [1] "1"
httr::cookies(r)
#> domain flag path secure expiration name value
#> 1 httpbin.org FALSE / FALSE <NA> a 1
# cookies are in the session
r <- httr::GET(url, path = "cookies")
httr::cookies(r)
#> domain flag path secure expiration name value
#> 1 httpbin.org FALSE / FALSE <NA> a 1
httr::content(r)
#> $cookies
#> $cookies$a
#> [1] "1"
# Another can be added, and internally curl handle is the same
r <- httr::GET(url, path = "cookies/set", query = list(b = 2))
httr::content(r)
#> $cookies
#> $cookies$a
#> [1] "1"
#>
#> $cookies$b
#> [1] "2"
httr::cookies(r)
#> domain flag path secure expiration name value
#> 1 httpbin.org FALSE / FALSE <NA> a 1
#> 2 httpbin.org FALSE / FALSE <NA> b 2
# cookie file is not yet return
file.exists(cookie_file)
#> [1] FALSE
# Cleaning the httr internal handle should to the trick
httr::handle_reset(url)
file.exists(cookie_file)
#> [1] FALSE
# and after garbage collection ?
gc()
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 571395 30.6 1256178 67.1 754667 40.4
#> Vcells 1056294 8.1 8388608 64.0 1627019 12.5
file.exists(cookie_file)
#> [1] FALSE
Trying now by creating the handle myself
cookie_file <- tempfile("cookie", fileext = ".txt")
url <- "https://httpbin.org/"
h <- httr::handle(url)
# adding the option
curl::handle_setopt(h$handle, cookiejar = cookie_file)
# create some cookies
r <- httr::GET(url, path = "cookies/set", query = list(a = 1),
handle = h)
httr::content(r)
#> $cookies
#> $cookies$a
#> [1] "1"
httr::cookies(r)
#> domain flag path secure expiration name value
#> 1 httpbin.org FALSE / FALSE <NA> a 1
# cookies are in the handle
r <- httr::GET(url, path = "cookies", handle = h)
httr::cookies(r)
#> domain flag path secure expiration name value
#> 1 httpbin.org FALSE / FALSE <NA> a 1
httr::content(r)
#> $cookies
#> $cookies$a
#> [1] "1"
# Another can be added, in the same handle
r <- httr::GET(url, path = "cookies/set", query = list(b = 2), handle = h)
httr::content(r)
#> $cookies
#> $cookies$a
#> [1] "1"
#>
#> $cookies$b
#> [1] "2"
httr::cookies(r)
#> domain flag path secure expiration name value
#> 1 httpbin.org FALSE / FALSE <NA> a 1
#> 2 httpbin.org FALSE / FALSE <NA> b 2
# cookie file is not yet written
file.exists(cookie_file)
#> [1] FALSE
# Cleaning the handle should do the trick
rm(h)
file.exists(cookie_file)
#> [1] FALSE
# and garbage collect ?
gc()
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 571744 30.6 1256178 67.1 856886 45.8
#> Vcells 1057498 8.1 8388608 64.0 1627019 12.5
file.exists(cookie_file)
#> [1] FALSE
Created on 2019-11-23 by the reprex package (v0.3.0)
I also tried to write a non temporary file and close the R session to see if a finalizer launched the curl cleanup, but that did not do the trick.
I wonder if this is because curl::handle_reset is used after each request is perform
https://github.com/r-lib/httr/blob/5b9ebfad4b4d76dfc240ffb3bafda3c6b0a89d20/R/request.R#L146
That will remove the option, including cookiejar I guess.
Hope it helps understand. Curl is a workaround for now I guess
httr has been superseded in favour of httr2, so is no longer under active development. If this problem is still important to you in httr2, I'd suggest filing an issue offer there 😄 — but req_cookie_preserve() should wrap the functionality that you want. Thanks for using httr!
(@cderv you might be interested to see how I flushed the cookies to disk: https://github.com/r-lib/httr2/pull/282/files#diff-2aaba4364d1a56e7a1019822c6d31ff1fe1c47e6c7e072afe9565c9513fa6570)