httr icon indicating copy to clipboard operation
httr copied to clipboard

cookiejar and cookiefile not work

Open kongdd opened this issue 6 years ago • 1 comments

cookiejar and cookiefile are necessary, when need to reuse cookie in a new R process. However, not working in config.

GET("http://httpbin.org/cookies/set?a=1",
    config(cookiefile = "a.txt", cookiejar = "a.txt"))
dir("a.txt")
> character(0)

kongdd avatar May 11 '19 10:05 kongdd

I was interesting to dig into this and understand how this should work. Here is my investigation

Let’s try to understand how it works with curl. First, the documentation gives us some hints: https://curl.haxx.se/libcurl/c/CURLOPT_COOKIEJAR.html

This will make libcurl write all internally known cookies to the specified file when curl_easy_cleanup is called

So the file be be written to only after cleanup. Reading curl source code, it happens is R finalizer when curl handle are garbage collected I think.

Based on that, this is how you would use cookies with curl.

cookie_file <- tempfile("cookie", fileext = ".txt")
h <- curl::new_handle()
curl::handle_setopt(h, cookiejar = cookie_file)
res <- curl::curl_fetch_memory("https://httpbin.org/cookies/set?a=1", h)
# cookie file is not created just after the request
file.exists(cookie_file)
#> [1] FALSE
# but cookies are stored in the handle for reuse
curl::handle_cookies(h)
#>        domain  flag path secure expiration name value
#> 1 httpbin.org FALSE    /  FALSE       <NA>    a     1
# you can have another one, the previous is still here
res <- curl::curl_fetch_memory("https://httpbin.org/cookies/set?b=2", h)
curl::handle_cookies(h)
#>        domain  flag path secure expiration name value
#> 1 httpbin.org FALSE    /  FALSE       <NA>    a     1
#> 2 httpbin.org FALSE    /  FALSE       <NA>    b     2
# cookie file is not here yet 
file.exists(cookie_file)
#> [1] FALSE
# it because cookie file is created after curl handle is cleanup
# this happens in R on garbage collection
rm(h)
gc()
#>           used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells  544708 29.1    1235332   66   640917 34.3
#> Vcells 1018081  7.8    8388608   64  1627019 12.5
# Now file exists and both cookie are here
file.exists(cookie_file)
#> [1] TRUE
readLines(cookie_file)
#> [1] "# Netscape HTTP Cookie File"                                 
#> [2] "# https://curl.haxx.se/docs/http-cookies.html"               
#> [3] "# This file was generated by libcurl! Edit at your own risk."
#> [4] ""                                                            
#> [5] "httpbin.org\tFALSE\t/\tFALSE\t0\tb\t2"                             
#> [6] "httpbin.org\tFALSE\t/\tFALSE\t0\ta\t1"

We manage to write cookies to file. The cookies can now be reused using curl or httr.

h <- curl::new_handle()
# activate cookie engine in reading mode
curl::handle_setopt(h, cookiefile = cookie_file)
# see if cookies are sent
res <- curl::curl_fetch_memory("https://httpbin.org/cookies", h)
# cookies are correctly reused
jsonlite::fromJSON(rawToChar(res$content))
#> $cookies
#> $cookies$a
#> [1] "1"
#> 
#> $cookies$b
#> [1] "2"

# this also works with httr
readLines(cookie_file)
#> [1] "# Netscape HTTP Cookie File"                                 
#> [2] "# https://curl.haxx.se/docs/http-cookies.html"               
#> [3] "# This file was generated by libcurl! Edit at your own risk."
#> [4] ""                                                            
#> [5] "httpbin.org\tFALSE\t/\tFALSE\t0\tb\t2"                             
#> [6] "httpbin.org\tFALSE\t/\tFALSE\t0\ta\t1"

res <- httr::GET("https://httpbin.org/cookies", httr::config(cookiefile = cookie_file))
httr::content(res)
#> $cookies
#> $cookies$a
#> [1] "1"
#> 
#> $cookies$b
#> [1] "2"

This means that cookiefile is working fine at least.

Now it seems writing cookies does not work as I expected with httr itself. curl handles are used internally but it is hidden from the user. I think, from the ?handle_pool documentation that it is not cleanup until the end of the session. This makes it easier to maintain and reuse cookies in the same session. However, removing the handle from the pool using httr::handle_reset should do a rm and same thing as in curl should happen - but no, there is something else.

This is what I tried so far

cookie_file <- tempfile("cookie", fileext = ".txt")
url <- "https://httpbin.org/"
# create some cookies and add cookiejar option
r <- httr::GET(url, path = "cookies/set", query = list(a = 1), 
               httr::config(cookiejar = cookie_file)) 
httr::content(r)
#> $cookies
#> $cookies$a
#> [1] "1"
httr::cookies(r)
#>        domain  flag path secure expiration name value
#> 1 httpbin.org FALSE    /  FALSE       <NA>    a     1
# cookies are in the session
r <- httr::GET(url, path = "cookies") 
httr::cookies(r)
#>        domain  flag path secure expiration name value
#> 1 httpbin.org FALSE    /  FALSE       <NA>    a     1
httr::content(r)
#> $cookies
#> $cookies$a
#> [1] "1"
# Another can be added, and internally curl handle is the same
r <- httr::GET(url, path = "cookies/set", query = list(b = 2)) 
httr::content(r)
#> $cookies
#> $cookies$a
#> [1] "1"
#> 
#> $cookies$b
#> [1] "2"
httr::cookies(r)
#>        domain  flag path secure expiration name value
#> 1 httpbin.org FALSE    /  FALSE       <NA>    a     1
#> 2 httpbin.org FALSE    /  FALSE       <NA>    b     2
# cookie file is not yet return
file.exists(cookie_file)
#> [1] FALSE
# Cleaning the httr internal handle should to the trick
httr::handle_reset(url)
file.exists(cookie_file)
#> [1] FALSE
# and after garbage collection ?
gc()
#>           used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells  571395 30.6    1256178 67.1   754667 40.4
#> Vcells 1056294  8.1    8388608 64.0  1627019 12.5
file.exists(cookie_file)
#> [1] FALSE

Trying now by creating the handle myself

cookie_file <- tempfile("cookie", fileext = ".txt")
url <- "https://httpbin.org/"
h <- httr::handle(url)
# adding the option
curl::handle_setopt(h$handle, cookiejar = cookie_file)
# create some cookies
r <- httr::GET(url, path = "cookies/set", query = list(a = 1), 
               handle = h) 
httr::content(r)
#> $cookies
#> $cookies$a
#> [1] "1"
httr::cookies(r)
#>        domain  flag path secure expiration name value
#> 1 httpbin.org FALSE    /  FALSE       <NA>    a     1
# cookies are in the handle
r <- httr::GET(url, path = "cookies", handle = h) 
httr::cookies(r)
#>        domain  flag path secure expiration name value
#> 1 httpbin.org FALSE    /  FALSE       <NA>    a     1
httr::content(r)
#> $cookies
#> $cookies$a
#> [1] "1"
# Another can be added, in the same handle
r <- httr::GET(url, path = "cookies/set", query = list(b = 2), handle = h) 
httr::content(r)
#> $cookies
#> $cookies$a
#> [1] "1"
#> 
#> $cookies$b
#> [1] "2"
httr::cookies(r)
#>        domain  flag path secure expiration name value
#> 1 httpbin.org FALSE    /  FALSE       <NA>    a     1
#> 2 httpbin.org FALSE    /  FALSE       <NA>    b     2
# cookie file is not yet written
file.exists(cookie_file)
#> [1] FALSE
# Cleaning the handle should do the trick
rm(h)
file.exists(cookie_file)
#> [1] FALSE
# and garbage collect ?
gc()
#>           used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells  571744 30.6    1256178 67.1   856886 45.8
#> Vcells 1057498  8.1    8388608 64.0  1627019 12.5
file.exists(cookie_file)
#> [1] FALSE

Created on 2019-11-23 by the reprex package (v0.3.0)

I also tried to write a non temporary file and close the R session to see if a finalizer launched the curl cleanup, but that did not do the trick.

I wonder if this is because curl::handle_reset is used after each request is perform https://github.com/r-lib/httr/blob/5b9ebfad4b4d76dfc240ffb3bafda3c6b0a89d20/R/request.R#L146 That will remove the option, including cookiejar I guess.

Hope it helps understand. Curl is a workaround for now I guess

cderv avatar Nov 23 '19 16:11 cderv

httr has been superseded in favour of httr2, so is no longer under active development. If this problem is still important to you in httr2, I'd suggest filing an issue offer there 😄 — but req_cookie_preserve() should wrap the functionality that you want. Thanks for using httr!

(@cderv you might be interested to see how I flushed the cookies to disk: https://github.com/r-lib/httr2/pull/282/files#diff-2aaba4364d1a56e7a1019822c6d31ff1fe1c47e6c7e072afe9565c9513fa6570)

hadley avatar Oct 31 '23 20:10 hadley