rorvswild Sometimes Net::OpenTimeout / EOFError

Since yesterday (2023-12-30 20:49:00 UTC+1) I get this in all my apps using RorVsWild:

ERROR -- : /usr/local/lib/ruby/3.3.0/net/protocol.rb:43:in `ssl_socket_connect': Net::OpenTimeout from /usr/local/lib/ruby/3.3.0/net/http.rb:1674:in `connect'
from /usr/local/lib/ruby/3.3.0/net/http.rb:2388:in `begin_transport'
from /usr/local/lib/ruby/3.3.0/net/http.rb:2332:in `transport_request'
from /usr/local/lib/ruby/3.3.0/net/http.rb:2306:in `request'
from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/plugin/net_http.rb:24:in `block in request_with_rorvswild'
from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/agent.rb:65:in `measure_section'
from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/plugin/net_http.rb:23:in `request_with_rorvswild'
from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/plugin/net_http.rb:17:in `request'
from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/client.rb:67:in `transmit'
from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/client.rb:40:in `post'
from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/queue.rb:71:in `flush'
from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/queue.rb:62:in `flush_indefinetely'
from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/queue.rb:76:in `block in start_thread'

Since I don't know if you have a status page, I'll report it here.

Happy new year ;-)

Dec 31 '23 18:12 ledermann

I have no idea. During this period we received the same volume of requests as usual. Our load average and response time are as usual. I checked the open source app and it received at least a request every minute (https://www.rorvswild.com/applications/136101/requests?range=2023-12-30T19:20:00Z..2023-12-30T20:20:00Z). The throughput is even a little bit higher than the daily average.

The timeout is set to 3 seconds. Maybe it's too short when they are perturbations on the network.

Thanks, happy new year !

Jan 01 '24 15:01 alexisbernard

@alexisbernard Strange, I continue to see this in the logs of all my applications. I'm using dockerized apps, running on Hetzner Cloud Shared VMs, located in Nuremberg, Germany. Low traffic, I can't see any bottlenecks on my side.

I just increased the timeout to 10 seconds and will see if that solves it.

Jan 02 '24 11:01 ledermann

After the timeout was increased to 10 seconds, the error no longer occurred. Nice, but now I see this EOFError a few times in the logs:

 E, [2024-01-03T08:31:03.278931 #1] ERROR -- : /usr/local/bundle/gems/net-protocol-0.2.2/lib/net/protocol.rb:237:in `rbuf_fill': end of file reached (EOFError) 
 from /usr/local/bundle/gems/net-protocol-0.2.2/lib/net/protocol.rb:199:in `readuntil'
 from /usr/local/bundle/gems/net-protocol-0.2.2/lib/net/protocol.rb:209:in `readline'
 from /usr/local/lib/ruby/3.2.0/net/http/response.rb:158:in `read_status_line'
 from /usr/local/lib/ruby/3.2.0/net/http/response.rb:147:in `read_new'
 from /usr/local/lib/ruby/3.2.0/net/http.rb:1862:in `block in transport_request'
 from /usr/local/lib/ruby/3.2.0/net/http.rb:1853:in `catch'
 from /usr/local/lib/ruby/3.2.0/net/http.rb:1853:in `transport_request'
 from /usr/local/lib/ruby/3.2.0/net/http.rb:1826:in `request'
 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/plugin/net_http.rb:24:in `block in request_with_rorvswild'
 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/agent.rb:65:in `measure_section'
 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/plugin/net_http.rb:23:in `request_with_rorvswild'
 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/plugin/net_http.rb:17:in `request'
 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/client.rb:67:in `transmit'
 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/client.rb:40:in `post'
 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/queue.rb:71:in `flush'
 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/queue.rb:62:in `flush_indefinetely'
 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/queue.rb:76:in `block in start_thread'

Jan 03 '24 11:01 ledermann

Sorry for the delay. I suspect that the timeout has been triggered from the server, but the latency between our servers is very constant :

$ ping solectrus.ledermann.dev
[...]
--- solectrus.ledermann.dev ping statistics ---
337 packets transmitted, 337 received, 0% packet loss, time 336452ms
rtt min/avg/max/mdev = 13.963/14.045/14.293/0.043 ms

Do you have an idea of the frequency of EOFError ?

Jan 08 '24 20:01 alexisbernard

The EOFError rarely occurs for me, about 1-2 times a day in most (but not all) apps that use RorVsWild. It happens with Ruby 3.2.2 and Ruby 3.3.0 - others not tested.

Further observation: In a low-traffic app, the error can occur even when there have been no requests for several minutes. I do not understand why the NetHttp plugin was activated at all at this time.

By the way, the exception does not appear in RorVsWild's exception list and also not in HoneyBadger, which I usually have installed in parallel for bug tracking.

Jan 09 '24 06:01 ledermann

Thanks for those details.

Further observation: In a low-traffic app, the error can occur even when there have been no requests for several minutes. I do not understand why the NetHttp plugin was activated at all at this time.

Because server metrics are sent each minute.

By the way, the exception does not appear in RorVsWild's exception list and also not in HoneyBadger, which I usually have installed in parallel for bug tracking.

Because data are sent in a separate thread, and the error is caught directly (https://github.com/BaseSecrete/rorvswild/blob/master/lib/rorvswild/client.rb#L71).

Jan 09 '24 07:01 alexisbernard

Because server metrics are sent each minute.

Hm, I don't think this explains the issue. Please take a look at this snippet from the logs with timestamps:

Jan 09 11:25:46 I, [2024-01-09T10:25:46.106549 #1] INFO -- : [3.120.15.28] [cd9199d2-df50-4aad-ab0c-a2dc39a24edb] method=GET path=/ format=*/* controller=VueController action=index status=200 allocations=1333 duration=5.06 view=4.51 db=0.00 user_agent=Honeybadger Uptime Check
Jan 09 11:30:03 E, [2024-01-09T10:30:03.653642 #1] ERROR -- : /usr/local/lib/ruby/3.3.0/net/protocol.rb:237:in `rbuf_fill': end of file reached (EOFError)
Jan 09 11:30:03 from /usr/local/lib/ruby/3.3.0/net/protocol.rb:199:in `readuntil'
Jan 09 11:30:03 from /usr/local/lib/ruby/3.3.0/net/protocol.rb:209:in `readline'
Jan 09 11:30:03 from /usr/local/lib/ruby/3.3.0/net/http/response.rb:158:in `read_status_line'
Jan 09 11:30:03 from /usr/local/lib/ruby/3.3.0/net/http/response.rb:147:in `read_new'
Jan 09 11:30:03 from /usr/local/lib/ruby/3.3.0/net/http.rb:2342:in `block in transport_request'
Jan 09 11:30:03 from /usr/local/lib/ruby/3.3.0/net/http.rb:2333:in `catch'
Jan 09 11:30:03 from /usr/local/lib/ruby/3.3.0/net/http.rb:2333:in `transport_request'
Jan 09 11:30:03 from /usr/local/lib/ruby/3.3.0/net/http.rb:2306:in `request'
Jan 09 11:30:03 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/plugin/net_http.rb:24:in `block in request_with_rorvswild'
Jan 09 11:30:03 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/agent.rb:65:in `measure_section'
Jan 09 11:30:03 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/plugin/net_http.rb:23:in `request_with_rorvswild'
Jan 09 11:30:03 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/plugin/net_http.rb:17:in `request'
Jan 09 11:30:03 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/client.rb:67:in `transmit'
Jan 09 11:30:03 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/client.rb:40:in `post'
Jan 09 11:30:03 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/queue.rb:71:in `flush'
Jan 09 11:30:03 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/queue.rb:62:in `flush_indefinetely'
Jan 09 11:30:03 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/queue.rb:76:in `block in start_thread'

At 11:25 a request comes in. Five minutes (!) later, without further requests, the EOFError is raised. Do you really think this is caused by the server metrics running each minute? As I understand the source, the NetHttp plugin handles requests, not server metrics. This is the line which raises the EOFError: https://github.com/BaseSecrete/rorvswild/blob/1.6.3/lib/rorvswild/plugin/net_http.rb#L24

Jan 09 '24 18:01 ledermann

Jan 09 11:30:03 from /usr/local/bundle/gems/rorvswild-1.6.3/lib/rorvswild/queue.rb:71:in `flush'

https://github.com/BaseSecrete/rorvswild/blob/master/lib/rorvswild/queue.rb#L71

Yes, it's the server metrics :-)

You can disable the NetHttp plugin and you should get the same error and same stack trace minus the 3 rorvswild/plugin/net_http.rb lines. I have to monkey patch Net::HTTP, so when RorVsWild sends data it goes through the plugin, but it does not measure anything because we are not in the context of a request or a job.

# config/rorvswild.yml
production:
  api_key: xxx
  ignore_plugins:
    - NetHttp

I should also add an option to disable server metrics.

I increased the open timeout as you and I got 3 times the EOFError on the 9th January. Compare to the volume it's extremely low percentage but that is still embarrassing to miss data. It's the same stack trace as you, except that it comes from lib/rorvswild/queue.rb:70 instead of lib/rorvswild/queue.rb:71. That means it's not related specifically to either server metrics or request data. It would be interesting if you could do the same to compare timestamps :

$ grep EOFError file.log
Jan  9 04:05:25 [...]
Jan  9 04:36:58 [...]
Jan  9 05:54:25 [...]

Jan 10 '24 05:01 alexisbernard

Thanks for clarification, I understand that the EOFError comes from sending server metrics.

Yes, disabling server metrics could be useful. In my case, I run multiple dockerized applications on the same host, so the server metrics are all the same.

Here is a list of timestamps (in UTC) where an EOFError was triggered yesterday in my applications:

2024-01-09 04:57:05	App 1
2024-01-09 05:38:02	App 1
2024-01-09 06:45:02	App 1
2024-01-09 10:30:03	App 2
2024-01-09 16:15:03	App 2
2024-01-09 19:41:01	App 3

I'm currently using RorVsWild in 11 applications (with timeout=10), all on the same Docker host. Yesterday, the EOFError was triggered in 3 of them.

Jan 10 '24 08:01 ledermann

Thanks for the grep. So it does not seem to happen at the same time. I'm currently running out of ideas for this issue. I will come back if I have something new to try.

I'm currently using RorVsWild in 11 applications (with timeout=10), all on the same Docker host. Yesterday, the EOFError was triggered in 3 of them.

Indeed, in your case, it sends at least 11 times the same server metrics.

Jan 11 '24 05:01 alexisbernard