Odd request timeouts when using http client
Hello Zach,
Firstly I wanted to say thanks for the great libraries you've developed.
I have created a proxy that uses an aleph http server to receive requests and forward them on using the aleph http client.
Every so often (about once every 1000 requests) a request will hit the ":request-timeout" limit (set to 60000 millis). However, the odd thing is that I have confirmed the remote server has processed the request and responded well within the timeout (usually <500 milliseconds).
It seems like aleph is getting stuck when receiving the response and never seeing the request as complete. I am currently working on a reproducible test case, however I was hoping you may be able to point out if I'm doing something obviously wrong that would cause this behaviour.
Many thanks, Jon Campbell
(ns my-ns.core
(:require [mount.lite :as mount]
[manifold.deferred :as df]
[aleph.http :as http]
[compojure.core :as compojure])
(:import (io.aleph.dirigiste IPool))
(:gen-class))
(mount/defstate pool
:start (http/connection-pool {:connections-per-host (System/getenv "CONN_PER_HOST")
:total-connections (System/getenv "TOTAL_CONN")})
:stop (.shutdown ^IPool pool))
(defn create-forward-request [req remote-server-name remote-server-port]
(-> req
(select-keys [:headers :request-method :query-string :body :uri])
(merge {:scheme :http
:server-name remote-server-name
:server-port remote-server-port
:throw-exceptions false
:pool-timeout 10000
:connection-timeout 10000
:request-timeout 60000
:read-timeout 59000
:pool pool})))
(defn forward [req-received remote-server-name remote-server-port]
(let [req-to-forward (create-forward-request req-received remote-server-name remote-server-port)]
(-> (df/chain (http/request req-to-forward)
#(select-keys % [:status :headers :body]))
(df/catch (fn [e]
(.printStackTrace e)
(println "error when forwarding request" req-received req-to-forward)
{:status 500 :body "error when forwarding request"})))))
(def app-routes
(compojure/routes (compojure/ANY "/app-name/*" req (forward req
(System/getenv "REMOTE_SERVER_NAME")
(System/getenv "REMOTE_SERVER_PORT")))))
(http/start-server app-routes {:port (System/getenv "PORT")})
@joncampbelldev What version of the library do you use? And what are your :connections-per-host and :total-connections settings? It sounds like requests are actually stuck in the queue instead of being sent to the server.
It also would be helpful to see pool stats (you can set a callback to log them periodically).
I have tried values from 32 to 1024, none of which changed the frequency of the errors. I have also confirmed the remote server is both receiving the apparently timed out request and responding quickly (see my previous post, 2nd paragraph).
Thanks for the suggestion of stats callback, I will collect the values from that and post them here.
I have tested with both 0.4.4 and 0.4.5-alpha5.
It's possible you're exhausting your file descriptors. Can you use the configuration options described under "benchmarking" in this readme and see if that change anything?
I ran echo 1 | sudo tee /proc/sys/net/ipv4/tcp_tw_reuse on my proxy servers. Viewing netstat -tulnap before and after I ran your command shows a drop in connections with status of TIME_WAIT , unfortunately it has not changed the error rate.
The exception I am getting (in case it helps) is: aleph.utils.RequestTimeoutException: timed out after 60000 milliseconds
This is logged out roughly 1 minute after the remote server logs confirm it has sent the response.
I will hopefully have the :stats-callback logs to post here on monday.
Here are the :stats-callback logs: https://gist.github.com/joncampbelldev/2f19637caf676001d607cbebc6f74afa
There were time out errors in between each of these so hopefully it caught some useful data, these are from several proxy servers.
Well, you only have 2-3 active connections, which explains why tweaking the connection configuration wasn't making any difference. I apologize that this is so labor intensive on your end, but the next thing is to do a packet dump.
Run this:
tcpdump -i "lo0" -c 10000 -w packets.pcap port <SERVER-PORT> or port <PROXY-PORT>
(note: "lo0" is the default loopback interface, this should be the right thing for your machine but maybe not)
And make sure each request and response has unique content or a unique header so we can correlate the traffic on either side of the proxy. I can take a look at the packet dump and see where things are getting lost.
No need to apologise, I'm very glad for the help.
I'm afraid since I haven't managed to reproduce this issue outside of production traffic I won't be able to share a tcpdump. I will debug using tcpdump and try to replicate this with test data too so that I can share the raw output.
Many thanks for your help so far.
Well, if you'd like to try to diagnose this on your end, what I'd be checking for is whether the backend service's response is actually making it to the Aleph proxy. You said you'd verified that the response is happening in a reasonable amount of time, but I assume that was internal to the backend service. It's possible something bad is happening in between the two.
Hi, am facing very similar issue with the intermittent timeout while going through proxy, can you share if you fixed your issue and what was the cause pls.... cheers...
I'm afraid I was unable to find any fix for this or even reproduce it outside of my production traffic. I still use aleph to receive the http requests, but I use a different http client library to forward them on.
Hi, did anyone manage to reproduce this issue? Changing a http client library is not an option in my case....
For now, I don't. Be we know there are some issues on the HTTP client.
We also experienced this "kind" of scenarios on my company and we ditched the aleph client for something else.
This open issue can also be related to this one: https://github.com/clj-commons/aleph/issues/532
Once all the commits from the 1.0.0 have been merged to master, we'll have to tackle the error handling PR (https://github.com/clj-commons/aleph/pull/553) which at least revolves around some issues on the HTTP client.
We have done many improvements on the client and some of them related to timeouts. Feel free to reopen another issue if needed.