create_socket (session.rb:815) hanging in multithreaded environment
Ruby 2.1.3 httpclient 2.5.3.3 ubuntu 14.04
Running 30 threads in a sidekiq instance, and consistently after running for 5-10 hours every thread gets hung on exactly the same line in HTTPClient.
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:815:in `initialize'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:815:in `new'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:815:in `create_socket'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:763:in `block in connect'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:91:in `block in timeout'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:101:in `call'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:101:in `timeout'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:127:in `timeout'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:762:in `connect'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:620:in `query'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:164:in `query'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1161:in `do_get_block'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/httpclient.rb:34:in `block in do_get_block_with_newrelic'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/cross_app_tracing.rb:48:in `tl_trace_http_request'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/httpclient.rb:33:in `do_get_block_with_newrelic'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:962:in `block in do_request'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1059:in `protect_keep_alive_disconnected'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:961:in `do_request'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1034:in `follow_redirect'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:821:in `request'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:714:in `get'
Here's the C-level stack trace:
#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007f24fb62a5dd in _L_lock_820 () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f24fb62a395 in __check_pf (seen_ipv4=0x7f24fb8d5134 <lock>,
seen_ipv4@entry=0x7f246cb4d010, seen_ipv6=seen_ipv6@entry=0x7f246cb4d020,
in6ai=in6ai@entry=0x7f246cb4d050, in6ailen=0xffffffffffffffff, in6ailen@entry=0x7f246cb4d060)
at ../sysdeps/unix/sysv/linux/check_pf.c:319
#3 0x00007f24fb5e2159 in __GI_getaddrinfo (name=0x7f246cb4d360 "domain.livedoor.com",
service=0x7f246cb4d340 "80", hints=<optimized out>, pai=0x7f246cb4d250)
at ../sysdeps/posix/getaddrinfo.c:2458
#4 0x00007f24f9f06518 in nogvl_getaddrinfo (arg=<optimized out>) at raddrinfo.c:162
#5 0x00007f24fc5f3982 in call_without_gvl (func=func@entry=0x7f24f9f06500 <nogvl_getaddrinfo>,
data1=data1@entry=0x7f246cb4d260, ubf=<optimized out>, ubf@entry=0xffffffffffffffff,
data2=<optimized out>, data2@entry=0x0, fail_if_interrupted=0) at thread.c:1266
#6 0x00007f24fc5f6a65 in rb_thread_call_without_gvl (
func=func@entry=0x7f24f9f06500 <nogvl_getaddrinfo>, data1=data1@entry=0x7f246cb4d260,
ubf=ubf@entry=0xffffffffffffffff, data2=data2@entry=0x0) at thread.c:1376
#7 0x00007f24f9f0750a in rb_getaddrinfo (node=node@entry=0x7f246cb4d360 "domain.livedoor.com",
service=service@entry=0x7f246cb4d340 "80", hints=hints@entry=0x7f246cb4d7b0,
res=res@entry=0x7f246cb4d300) at raddrinfo.c:280
#8 0x00007f24f9f08289 in rsock_getaddrinfo (host=<optimized out>, port=<optimized out>,
hints=hints@entry=0x7f246cb4d7b0, socktype_hack=1) at raddrinfo.c:481
#9 0x00007f24f9f08d5c in rsock_addrinfo (host=<optimized out>, port=<optimized out>,
socktype=socktype@entry=1, flags=flags@entry=0) at raddrinfo.c:501
#10 0x00007f24f9efedaf in init_inetsock_internal (arg=arg@entry=0x7f246cb4d9d0) at ipsocket.c:49
#11 0x00007f24fc483fe0 in rb_ensure (b_proc=b_proc@entry=0x7f24f9efed70 <init_inetsock_internal>,
data1=data1@entry=139794419341776, e_proc=e_proc@entry=0x7f24f9efed20 <inetsock_cleanup>,
data2=data2@entry=139794419341776) at eval.c:853
#12 0x00007f24f9eff144 in rsock_init_inetsock (sock=sock@entry=139794885577960,
remote_host=<optimized out>, remote_serv=<optimized out>, local_host=<optimized out>,
local_serv=<optimized out>, type=type@entry=0) at ipsocket.c:158
#13 0x00007f24f9eff48d in tcp_init (argc=<optimized out>, argv=<optimized out>, sock=139794885577960)
at tcpsocket.c:30
#14 0x00007f24fc5d95ff in vm_call0_cfunc_with_frame (argv=0x7f2490250910, ci=0x7f246cb4da70,
th=0x7f24b0041620) at vm_eval.c:118
#15 vm_call0_cfunc (argv=0x7f2490250910, ci=0x7f246cb4da70, th=0x7f24b0041620) at vm_eval.c:135
#16 vm_call0_body (th=th@entry=0x7f24b0041620, ci=ci@entry=0x7f246cb4db50, argv=0x7f2490250910)
at vm_eval.c:175
#17 0x00007f24fc5da27b in vm_call0 (defined_class=<optimized out>, me=<optimized out>,
argv=<optimized out>, argc=2, id=1504, recv=139794885577960, th=0x7f24b0041620) at vm_eval.c:50
#18 rb_call0 (recv=139794885577960, mid=1504, argc=2, argv=<optimized out>, scope=CALL_FCALL,
self=139796859028520) at vm_eval.c:325
#19 0x00007f24fc4dfae1 in rb_class_new_instance (argc=2, argv=0x7f2490250910, klass=<optimized out>) at object.c:1861
#20 0x00007f24fc5cf574 in vm_call_cfunc_with_frame (ci=<optimized out>, reg_cfp=0x7f249026edc0,
th=0x7f24b0041620) at vm_insnhelper.c:1489
#21 vm_call_cfunc (th=0x7f24b0041620, reg_cfp=0x7f249026edc0, ci=<optimized out>)
at vm_insnhelper.c:1579
#22 0x00007f24fc5d4194 in vm_exec_core (th=th@entry=0x7f24b0041620, initial=initial@entry=0)
at insns.def:1028
#23 0x00007f24fc5d7ddf in vm_exec (th=th@entry=0x7f24b0041620) at vm.c:1363
#24 0x00007f24fc5db2ea in invoke_block_from_c (th=th@entry=0x7f24b0041620,
block=block@entry=0x7f2491026b70, self=self@entry=139796863525680, argc=argc@entry=1,
argv=argv@entry=0x7f2490250870, blockptr=blockptr@entry=0x0, cref=cref@entry=0x0,
defined_class=defined_class@entry=139796863518600) at vm.c:782
#25 0x00007f24fc5dbacb in vm_invoke_proc (th=0x7f24b0041620, proc=proc@entry=0x7f2491026b70,
self=139796863525680, defined_class=139796863518600, argc=argc@entry=1,
argv=argv@entry=0x7f2490250870, blockptr=0x0) at vm.c:846
#26 0x00007f24fc5dbb7a in rb_vm_invoke_proc (th=<optimized out>, proc=proc@entry=0x7f2491026b70,
argc=argc@entry=1, argv=argv@entry=0x7f2490250870, blockptr=<optimized out>) at vm.c:865
#27 0x00007f24fc487140 in proc_call (argc=1, argv=0x7f2490250870, procval=139794885578480)
at proc.c:734
#28 0x00007f24fc5cf574 in vm_call_cfunc_with_frame (ci=<optimized out>, reg_cfp=0x7f249026ef00,
th=0x7f24b0041620) at vm_insnhelper.c:1489
#29 vm_call_cfunc (th=0x7f24b0041620, reg_cfp=0x7f249026ef00, ci=<optimized out>)
at vm_insnhelper.c:1579
#30 0x00007f24fc5d4194 in vm_exec_core (th=th@entry=0x7f24b0041620, initial=initial@entry=0)
at insns.def:1028
#31 0x00007f24fc5d7ddf in vm_exec (th=th@entry=0x7f24b0041620) at vm.c:1363
#32 0x00007f24fc5db2ea in invoke_block_from_c (th=0x7f24b0041620, block=<optimized out>,
---Type <return> to continue, or q <return> to quit---
self=<optimized out>, argc=argc@entry=0, argv=argv@entry=0x0, blockptr=blockptr@entry=0x0,
cref=cref@entry=0x0, defined_class=139796924859280) at vm.c:782
#33 0x00007f24fc5dbf54 in vm_yield (argc=0, argv=0x0, th=<optimized out>) at vm.c:821
#34 rb_yield_0 (argv=0x0, argc=0) at vm_eval.c:938
#35 loop_i () at vm_eval.c:1008
#36 0x00007f24fc483b0e in rb_rescue2 (b_proc=b_proc@entry=0x7f24fc5dbf00 <loop_i>,
data1=data1@entry=0, r_proc=r_proc@entry=0x0, data2=data2@entry=0) at eval.c:754
#37 0x00007f24fc5cde3e in rb_f_loop (self=139794889136520) at vm_eval.c:1042
#38 0x00007f24fc5cf574 in vm_call_cfunc_with_frame (ci=<optimized out>, reg_cfp=0x7f249026f540,
th=0x7f24b0041620) at vm_insnhelper.c:1489
#39 vm_call_cfunc (th=0x7f24b0041620, reg_cfp=0x7f249026f540, ci=<optimized out>)
at vm_insnhelper.c:1579
#40 0x00007f24fc5d4ad5 in vm_exec_core (th=th@entry=0x7f24b0041620, initial=initial@entry=0)
at insns.def:999
#41 0x00007f24fc5d7ddf in vm_exec (th=th@entry=0x7f24b0041620) at vm.c:1363
#42 0x00007f24fc5db2ea in invoke_block_from_c (th=th@entry=0x7f24b0041620,
block=block@entry=0x7f249227be20, self=self@entry=139796925596280, argc=argc@entry=0,
argv=argv@entry=0x7f24902503a8, blockptr=blockptr@entry=0x0, cref=cref@entry=0x0,
defined_class=defined_class@entry=139796870304080) at vm.c:782
#43 0x00007f24fc5dbacb in vm_invoke_proc (th=0x7f24b0041620, proc=proc@entry=0x7f249227be20,
self=139796925596280, defined_class=139796870304080, argc=argc@entry=0,
argv=argv@entry=0x7f24902503a8, blockptr=0x0) at vm.c:846
#44 0x00007f24fc5dbb7a in rb_vm_invoke_proc (th=<optimized out>, proc=proc@entry=0x7f249227be20,
argc=argc@entry=0, argv=argv@entry=0x7f24902503a8, blockptr=<optimized out>) at vm.c:865
#45 0x00007f24fc487140 in proc_call (argc=0, argv=0x7f24902503a8, procval=139795083105960)
at proc.c:734
#46 0x00007f24fc5cf574 in vm_call_cfunc_with_frame (ci=<optimized out>, reg_cfp=0x7f249026f6d0,
---Type <return> to continue, or q <return> to quit---
th=0x7f24b0041620) at vm_insnhelper.c:1489
#47 vm_call_cfunc (th=0x7f24b0041620, reg_cfp=0x7f249026f6d0, ci=<optimized out>)
at vm_insnhelper.c:1579
#48 0x00007f24fc5d4194 in vm_exec_core (th=th@entry=0x7f24b0041620, initial=initial@entry=0)
at insns.def:1028
#49 0x00007f24fc5d7ddf in vm_exec (th=th@entry=0x7f24b0041620) at vm.c:1363
#50 0x00007f24fc5db2ea in invoke_block_from_c (th=th@entry=0x7f24b0041620,
block=block@entry=0x7f24905ef620, self=self@entry=139796903630280, argc=argc@entry=0,
argv=argv@entry=0x7f2490250188, blockptr=blockptr@entry=0x0, cref=cref@entry=0x0,
defined_class=defined_class@entry=139796878690480) at vm.c:782
#51 0x00007f24fc5dbacb in vm_invoke_proc (th=0x7f24b0041620, proc=proc@entry=0x7f24905ef620,
self=139796903630280, defined_class=139796878690480, argc=argc@entry=0,
argv=argv@entry=0x7f2490250188, blockptr=0x0) at vm.c:846
#52 0x00007f24fc5dbb7a in rb_vm_invoke_proc (th=<optimized out>, proc=proc@entry=0x7f24905ef620,
argc=argc@entry=0, argv=argv@entry=0x7f2490250188, blockptr=<optimized out>) at vm.c:865
#53 0x00007f24fc487140 in proc_call (argc=0, argv=0x7f2490250188, procval=139795083105240)
at proc.c:734
#54 0x00007f24fc5cf574 in vm_call_cfunc_with_frame (ci=<optimized out>, reg_cfp=0x7f249026fbd0,
th=0x7f24b0041620) at vm_insnhelper.c:1489
#55 vm_call_cfunc (th=0x7f24b0041620, reg_cfp=0x7f249026fbd0, ci=<optimized out>)
at vm_insnhelper.c:1579
#56 0x00007f24fc5d4194 in vm_exec_core (th=th@entry=0x7f24b0041620, initial=initial@entry=0)
at insns.def:1028
#57 0x00007f24fc5d7ddf in vm_exec (th=th@entry=0x7f24b0041620) at vm.c:1363
#58 0x00007f24fc5d9514 in vm_call0_body (th=th@entry=0x7f24b0041620, ci=ci@entry=0x7f246cb504c0,
argv=0x7f2490250098) at vm_eval.c:171
#59 0x00007f24fc5da27b in vm_call0 (defined_class=<optimized out>, me=<optimized out>,
---Type <return> to continue, or q <return> to quit---
argv=<optimized out>, argc=1, id=61160, recv=139796925596280, th=0x7f24b0041620) at vm_eval.c:50
#60 rb_call0 (recv=139796925596280, mid=61160, argc=1, argv=<optimized out>, scope=CALL_PUBLIC,
self=52) at vm_eval.c:325
#61 0x00007f24fc5de4fe in send_internal (argc=1, argv=0x7f2490250098, recv=139796925596280,
scope=CALL_PUBLIC) at vm_eval.c:882
#62 0x00007f24fc5ddbb1 in vm_call_cfunc_with_frame (ci=<optimized out>, reg_cfp=0x7f249026fd60,
th=0x7f24b0041620) at vm_insnhelper.c:1489
#63 vm_call_cfunc (ci=<optimized out>, reg_cfp=0x7f249026fd60, th=0x7f24b0041620)
at vm_insnhelper.c:1579
#64 vm_call_method (th=0x7f24b0041620, cfp=0x7f249026fd60, ci=<optimized out>) at vm_insnhelper.c:1767
#65 0x00007f24fc5d4ad5 in vm_exec_core (th=th@entry=0x7f24b0041620, initial=initial@entry=0)
at insns.def:999
#66 0x00007f24fc5d7ddf in vm_exec (th=th@entry=0x7f24b0041620) at vm.c:1363
#67 0x00007f24fc5db2ea in invoke_block_from_c (th=th@entry=0x7f24b0041620,
block=block@entry=0x7f2490193270, self=self@entry=139794887793720, argc=argc@entry=1,
argv=argv@entry=0x7f246cb50ef0, blockptr=blockptr@entry=0x0, cref=cref@entry=0x0,
defined_class=defined_class@entry=139796876889560) at vm.c:782
#68 0x00007f24fc5dbacb in vm_invoke_proc (th=th@entry=0x7f24b0041620, proc=proc@entry=0x7f2490193270,
self=139794887793720, defined_class=139796876889560, argc=argc@entry=1,
argv=argv@entry=0x7f246cb50ef0, blockptr=blockptr@entry=0x0) at vm.c:846
#69 0x00007f24fc5dbb7a in rb_vm_invoke_proc (th=th@entry=0x7f24b0041620,
proc=proc@entry=0x7f2490193270, argc=argc@entry=1, argv=argv@entry=0x7f246cb50ef0,
blockptr=blockptr@entry=0x0) at vm.c:865
#70 0x00007f24fc5fd278 in rb_fiber_start () at cont.c:1293
#71 0x00007f24fb55a7a0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#72 0x0000000000000000 in ?? ()
Any ideas? Do I need to be manually closing connections after I use them or anything like that?
Looks like there might be a deadlock, given that the threads are all stuck at "lowlevellock.S", but this is fairly far over my skill level.
No idea so far.
HTTPClient seems to stall at TCPSocket.new(target) and the underlying CRuby seems to stall at getaddrinfo(3) for getting address information of the specified target, according to the stacktraces. Can you check lsof -p <CRuby process> and netstat -an to see how many sockets the Ruby process is holding after it stalls?
If the number of sockets is the problem, manual connection termination could mitigate the problem but anyway name resolution (getaddrinfo) issue may happen after more long running.
Got another hang after running for 18 hours so I get a chance to grab somemore data, and I think we're getting closer to finding something useful.
In netstat I grepped for port 80 to eliminate all the local connections:
netstat -an | grep '80'
tcp 1 0 104.131.147.112:52669 199.101.163.129:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:32888 208.80.154.224:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:32855 208.80.154.224:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:48221 199.101.163.129:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:34669 165.254.58.82:80 CLOSE_WAIT
tcp 0 0 104.131.147.112:40854 64.250.181.184:80 ESTABLISHED
tcp 1 0 104.131.147.112:54493 184.168.221.5:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:37324 54.85.12.13:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:54133 74.125.239.102:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:38529 199.101.163.129:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:53270 165.254.42.104:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:50308 192.241.241.29:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:47834 64.250.181.133:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:55614 54.225.96.182:80 CLOSE_WAIT
tcp 0 0 104.131.147.112:42699 64.250.181.184:80 ESTABLISHED
tcp 1 0 104.131.147.112:49654 64.250.181.133:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:60678 107.20.199.207:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:56585 63.245.217.194:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:32876 208.80.154.224:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:42249 64.94.113.154:80 CLOSE_WAIT
What stands out to my untrained eyes is that the 2 outliers - the ones where status = established - are the same foreign address but they're being connected to from 2 different local ports. Maybe that's totally normal though....
I actually had 3 different processes running this time and they all hung, so I did an lsof -p for each process. The truncated version of one of them is something like this:
ruby 9267 deploy 22u IPv4 54402497 0t0 TCP 104.131.147.112:34669->165.254.58.82:http (CLOSE_WAIT)
ruby 9267 deploy 28u unix 0xffff880210530e00 0t0 49411786 socket
ruby 9267 deploy 29u unix 0xffff8800c4d31f80 0t0 49841213 socket
ruby 9267 deploy 33u sock 0,7 0t0 54434654 can't identify protocol
ruby 9267 deploy 34u unix 0xffff880212303800 0t0 49204676 socket
ruby 9267 deploy 35u unix 0xffff880212301880 0t0 49204682 socket
ruby 9267 deploy 38u unix 0xffff880192bac380 0t0 49413893 socket
ruby 9267 deploy 40u unix 0xffff8801bb06f480 0t0 49226855 socket
ruby 9267 deploy 42u unix 0xffff88011698b100 0t0 49414524 socket
ruby 9267 deploy 44u unix 0xffff880017bd1180 0t0 49242474 socket
ruby 9267 deploy 45u unix 0xffff88021157c700 0t0 49209756 socket
ruby 9267 deploy 48u sock 0,7 0t0 54402490 can't identify protocol
ruby 9267 deploy 51u IPv4 49810697 0t0 TCP localhost:43769->localhost:6379 (ESTABLISHED)
ruby 9267 deploy 52u unix 0xffff880212301c00 0t0 49210337 socket
ruby 9267 deploy 53u IPv4 49264160 0t0 TCP localhost:57171->localhost:6379 (ESTABLISHED)
ruby 9267 deploy 60u IPv4 50429441 0t0 TCP 104.131.147.112:40854->64.250.181.184:http (ESTABLISHED)
ruby 9267 deploy 63u unix 0xffff880212e48a80 0t0 49796668 socket
ruby 9267 deploy 65u IPv4 50431151 0t0 TCP 104.131.147.112:47834->64.250.181.133:http (CLOSE_WAIT)
ruby 9267 deploy 85u IPv4 49365039 0t0 TCP localhost:35560->localhost:6379 (ESTABLISHED)
ruby 9267 deploy 90u IPv4 49480239 0t0 TCP localhost:51574->localhost:6379 (ESTABLISHED)
ruby 9267 deploy 98u unix 0xffff880212e48380 0t0 49838472 socket
ruby 9267 deploy 107u unix 0xffff880212e4b800 0t0 49832752 socket
ruby 9267 deploy 111u unix 0xffff88011698a680 0t0 49412545 socket
ruby 9267 deploy 113u unix 0xffff880210532a00 0t0 49792397 socket
ruby 9267 deploy 117u unix 0xffff88011698aa00 0t0 49412549 socket
ruby 9267 deploy 121u unix 0xffff880210803800 0t0 49834471 socket
ruby 9267 deploy 124u IPv4 49417333 0t0 TCP localhost:55287->localhost:6379 (ESTABLISHED)
ruby 9267 deploy 129u unix 0xffff880210530380 0t0 49792415 socket
ruby 9267 deploy 131u IPv4 49796786 0t0 TCP localhost:39364->localhost:6379 (ESTABLISHED)
ruby 9267 deploy 140u IPv4 54402470 0t0 TCP 104.131.147.112:42249->Lxlb2.peaksystems.com:http (CLOSE_WAIT)
ruby 9267 deploy 141u unix 0xffff880210531f80 0t0 49284607 socket
ruby 9267 deploy 142u IPv4 54402506 0t0 TCP 104.131.147.112:53270->165.254.42.104:http (CLOSE_WAIT)
ruby 9267 deploy 150u sock 0,7 0t0 54429336 can't identify protocol
ruby 9267 deploy 155u IPv4 49823256 0t0 TCP localhost:47545->localhost:6379 (ESTABLISHED)
ruby 9267 deploy 157u IPv4 49823258 0t0 TCP localhost:47546->localhost:6379 (ESTABLISHED)
ruby 9267 deploy 182u unix 0xffff880212e4a680 0t0 49838659 socket
ruby 9267 deploy 238u netlink 0t0 54422488 ROUTE
ruby 9267 deploy 245u IPv4 54425223 0t0 TCP 104.131.147.112:38529->199.101.163.129:http (CLOSE_WAIT)
ruby 9267 deploy 300u IPv4 54428858 0t0 TCP 104.131.147.112:56585->mozcom.zlb.phx.mozilla.net:http (CLOSE_WAIT)
What stands out to me here is that they all have at least one of these lines: ruby 9267 deploy 150u sock 0,7 0t0 54429336 can't identify protocol
It seems that it could be related to this issue https://github.com/cotag/spider-gazelle/issues/20, where sockets were being accidentally duplicated when passed to another thread.
@kosh-jelly Thanks. I think the root cause is many CLOSE_WAIT sockets. httpclient should try to reuse connection and close properly but it seems such recycle is not working for you. Can I see your code that actually using HTTPClient? I mean, it would help if you can show me several more stacktrace lines below this line.
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:714:in `get'
Absolutely.
Here's the full stack trace. Looks like all threads are stuck at the same point in my code, on the fetch_robots call:
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:815:in `new'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:815:in `create_socket'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:763:in `block in connect'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:91:in `block in timeout'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:101:in `call'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:101:in `timeout'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:127:in `timeout'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:762:in `connect'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:620:in `query'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:164:in `query'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1161:in `do_get_block'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/httpclient.rb:34:in `block in do_get_block_with_newrelic'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/cross_app_tracing.rb:48:in `tl_trace_http_request'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/httpclient.rb:33:in `do_get_block_with_newrelic'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:962:in `block in do_request'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1059:in `protect_keep_alive_disconnected'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:961:in `do_request'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1034:in `follow_redirect'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:821:in `request'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:714:in `get'
/home/deploy/domain_freek/releases/167/app/ohm_models/live_domain.rb:108:in `fetch_robots'
/home/deploy/domain_freek/releases/167/app/ohm_models/live_domain.rb:145:in `get_robots'
/home/deploy/domain_freek/releases/167/app/ohm_models/live_domain.rb:78:in `block in crawl_domain'
/home/deploy/domain_freek/releases/167/app/ohm_models/live_domain.rb:73:in `loop'
/home/deploy/domain_freek/releases/167/app/ohm_models/live_domain.rb:73:in `crawl_domain'
/home/deploy/domain_freek/releases/167/app/workers/crawl_domain.rb:20:in `perform'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:75:in `execute_job'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:52:in `block (2 levels) in process'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:127:in `call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:127:in `block in invoke'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/sidekiq.rb:33:in `block in call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/controller_instrumentation.rb:365:in `perform_action_with_newrelic_trace'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/sidekiq.rb:29:in `call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/sidetiq-0.6.3/lib/sidetiq/middleware/history.rb:8:in `call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/server/active_record.rb:6:in `call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/server/retry_jobs.rb:74:in `call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/server/logging.rb:11:in `block in call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/logging.rb:22:in `with_context'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/server/logging.rb:7:in `call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:132:in `call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:132:in `invoke'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:51:in `block in process'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:98:in `stats'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:50:in `process'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26:in `public_send'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26:in `dispatch'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/calls.rb:122:in `dispatch'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/cell.rb:60:in `block in invoke'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/cell.rb:71:in `block in task'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/actor.rb:357:in `block in task'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/tasks.rb:57:in `block in initialize'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/tasks/task_fiber.rb:15:in `block in create'
fetch_robots should be the first http connection a thread makes, and it should be doing it with a new HTTPClient instance. Here's the relevant parts of my code:
def fetch_robots
response = agent.get("http://#{domain}/robots.txt", follow_redirect: true) # line 108
end
def agent
@agent ||= new_agent
end
def new_agent
agent = HTTPClient.new
agent.receive_timeout = APP_CONFIG['crawl_page_timeout'] #set to 10
agent
end
@kosh-jelly I think I've understood the situation. Your code uses single HTTPClient instance so connections should be properly recycled, but since you need to access various different domains for fetching robots.txt (practically 1 connection per 1 domain) all connections are not recycled. With Current HTTPClient implementation, Keep-Alive connections are closed by
- manually closed by HTTPClient#reset or HTTPClient#reset_all
- when a connection is recycled but server closed the connection already.
- GCed
Would you please try the following 2 workarounds?
- avoid Keep-Alive connection by setting
agent.protocol_version = 'HTTP/1.0'. you should not need keep-alive connections. - close connection manually by calling
agent.reset("http://#{domain}/robots.txt")after callingagent.get
This HTTPClient behavior rarely caused troubles but when it happens it's hard to investigate. I'd try to introduce connection leaper thread to HTTPClient. Sorry for the inconvenience this time.
Thanks so much for your help with this nahi! I'm really impressed with your work on HTTPClient.
I'll add those two lines of code and post an update.
Typically after fetching robots.txt I'm going on to fetch other pages on the same domain with the same instance of HTTPClient. Should I call reset after each page, or is this specific to connecting the first time to a domain (which is always going to be robots.txt for me)?
@kosh-jelly The workaround should work when you apply either of it. I'm pretty sure about the reason of your trouble so you can only try "reset".
Hmm, you also need Keep-Alive connections. It's hard until I introduce connection leaper thread but following is the best ATM.
Create HTTPClient instance per thread and reset all connections before going to next domain. (Assuming you assign a Thread for a domain)
def fetch_robots
agent.reset_all
response = agent.get("http://#{domain}/robots.txt", follow_redirect: true) # line 108
end
def agent
Thread.current[:my_agent] ||= new_agent
end
If above is not the situation (you assign threads for domains and have thread pool), please call agent.reset_all periodically as a workaround. You'd need synchronizing threads as well. I'm not familiar with Sidekiq so I might be wrong...
Ahh, ok. Yeah I have a separate thread for each domain, and each thread should already have it's own instance of HTTPClient.
When I move onto a new domain the old HTTPClient should be gc'ed, and I should be creating a new HTTPClient instance to use with the new domain. It's certainly possible that's not working as expected though.
reset_all before getting robots sounds like a pretty good workaround for now.
Looks like the reset_all workaround did not solve the problem. Threads again hung after about 18 hours of running on the same line.
Current codebase looks like this:
class LiveDomain
def crawl_domain
loop do
next_page = queue.pop
break unless next_page && get_robots # line 75
if get_robots.allowed?(next_page.url) && next_page.should_crawl?
Kernel.sleep CRAWL_DELAY
next_page.scrape_linked_pages
end
end
end
def get_robots
@robots ||= fetch_robots
end
def fetch_robots
agent.reset_all
response = agent.get("http://#{domain}/robots.txt", follow_redirect: true) #line 106 where it's hanging
return RobotsTxt.new response.body
end
def agent
@agent ||= new_agent
end
def new_agent
agent = HTTPClient.new
agent.receive_timeout = APP_CONFIG['crawl_page_timeout']
agent
end
end
Stack trace from the hanging threads:
2014-12-01T08:03:07.925Z 12629 TID-osi57w9io WARN: /home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:815:in `initialize'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:815:in `new'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:815:in `create_socket'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:763:in `block in connect'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:91:in `block in timeout'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:101:in `call'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:101:in `timeout'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:127:in `timeout'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:762:in `connect'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:620:in `query'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:164:in `query'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1161:in `do_get_block'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/httpclient.rb:34:in `block in do_get_block_with_newrelic'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/cross_app_tracing.rb:48:in `tl_trace_http_request'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/httpclient.rb:33:in `do_get_block_with_newrelic'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:962:in `block in do_request'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1059:in `protect_keep_alive_disconnected'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:961:in `do_request'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1034:in `follow_redirect'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:821:in `request'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:714:in `get'
/home/deploy/domain_freek/releases/173/app/ohm_models/live_domain.rb:106:in `fetch_robots'
/home/deploy/domain_freek/releases/173/app/ohm_models/live_domain.rb:143:in `get_robots'
/home/deploy/domain_freek/releases/173/app/ohm_models/live_domain.rb:75:in `block in crawl_domain'
/home/deploy/domain_freek/releases/173/app/ohm_models/live_domain.rb:70:in `loop'
/home/deploy/domain_freek/releases/173/app/ohm_models/live_domain.rb:70:in `crawl_domain'
/home/deploy/domain_freek/releases/173/app/workers/crawl_domain.rb:20:in `perform'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:75:in `execute_job'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:52:in `block (2 levels) in process'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:127:in `call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:127:in `block in invoke'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/sidekiq.rb:33:in `block in call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/controller_instrumentation.rb:365:in `perform_action_with_newrelic_trace'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/sidekiq.rb:29:in `call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/sidetiq-0.6.3/lib/sidetiq/middleware/history.rb:8:in `call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/server/active_record.rb:6:in `call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/server/retry_jobs.rb:74:in `call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/server/logging.rb:11:in `block in call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/logging.rb:22:in `with_context'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/server/logging.rb:7:in `call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:132:in `call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:132:in `invoke'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:51:in `block in process'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:98:in `stats'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:50:in `process'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26:in `public_send'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26:in `dispatch'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/calls.rb:122:in `dispatch'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/cell.rb:60:in `block in invoke'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/cell.rb:71:in `block in task'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/actor.rb:357:in `block in task'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/tasks.rb:57:in `block in initialize'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/tasks/task_fiber.rb:15:in `block in create'
And again, it looks like sockets are stuck in close_wait, despite the reset_all command right before fetching robots.txt:
netstat -an | grep ':80'
tcp 1 0 104.131.147.112:33863 192.80.13.224:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:47845 173.252.120.6:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:59259 208.80.154.224:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:58951 157.166.226.25:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:50569 143.252.148.217:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:52243 184.51.0.73:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:33540 184.168.221.20:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:34338 184.72.251.173:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:58374 143.252.148.211:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:45148 91.185.212.99:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:60262 217.160.54.59:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:33366 198.35.26.96:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:42752 184.168.221.5:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:51503 198.35.26.96:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:47741 94.228.135.84:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:59257 208.80.154.224:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:52808 216.104.20.189:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:44762 74.125.239.99:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:51738 54.209.11.100:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:58400 108.60.15.56:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:49161 208.80.154.224:80 CLOSE_WAIT
tcp 1 0 104.131.147.112:49163 208.80.154.224:80 CLOSE_WAIT
tcp 0 0 104.131.147.112:44389 216.35.221.76:80 ESTABLISHED
tcp 1 0 104.131.147.112:52784 193.191.138.53:80 CLOSE_WAIT
To clarify, each thread has it's own instance of LiveDomain - which means that each thread will get it's own instance of HTTPClient.
I'll go ahead and try the other workaround of avoiding keep_alive connection by setting http protocol to 1.0.
Hmm. Actually reset_all is not working for you... CLOSE_WAIT means client haven't call TCPSocket#close but reset_all blocks until it calls TCPSocket#close for all cached sessions. No idea why reset_all doesn't work... Not called or not called enough often?
@nahi ... I think we're running into a similar problem (Ruby 2.2.1, Ubuntu 14.04.1, AWS).
$ uname -a
Linux background-job-master 3.13.0-36-generic #63-Ubuntu SMP Wed Sep 3 21:30:07 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Only instead of one byte in the pending receive queue, we typically have 32 bytes pending for the CLOSE_WAIT connections. We see similar output in both netstat & lsof. We also noticed child threads are also getting stuck in a "D" state, which then triggers our load alerts (1.0 load times number of processes in "D" state).
When inspecting with strace, they appear to be waiting for I/O:
$ sudo strace -s 1000 -tt -p 32581 2>&1
Process 32581 attached
16:27:39.965887 restart_syscall(<... resuming interrupted call ...>
Killing the parent process (sidekiq in our case), usually clears the TCP connections and stalled threads. Otherwise they will sit there indefinitely (because of the odd state the TCP sockets are in).
When you see this:
#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 #1 0x00007f24fb62a5dd in _L_lock_820 () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007f24fb62a395 in __check_pf (seen_ipv4=0x7f24fb8d5134, seen_ipv4@entry=0x7f246cb4d010, seen_ipv6=seen_ipv6@entry=0x7f246cb4d020, in6ai=in6ai@entry=0x7f246cb4d050, in6ailen=0xffffffffffffffff, in6ailen@entry=0x7f246cb4d060) at ../sysdeps/unix/sysv/linux/check_pf.c:319
Is there another thread that's actually waiting in recvmsg() like this?
#0 0x0000003841302edd in recvmsg () at ../sysdeps/unix/syscall-template.S:81 #1 0x000000384131f8ee in __check_pf (pid=-52599, fd=20) at ../sysdeps/unix/sysv/linux/check_pf.c:166 #2 0x000000384131f8ee in __check_pf (seen_ipv4=seen_ipv4@entry=0x7f5b137fdc32, seen_ipv6=seen_ipv6@entry=0x7f5b137fdc33, in6ai=in6ai@entry=0x7f5b137fdc40, in6ailen=in6ailen@entry=0x7f5b137fdc48) at ../sysdeps/unix/sysv/linux/check_pf.c:324
That's the interesting one, and all the others are waiting for it to drop the lock. Which it never does, presumably because of a kernel bug. cf. https://bugzilla.redhat.com/show_bug.cgi?id=1209433
I didn't start seeing this until I was running the 4.0 kernel. Going back to 3.19 seemed to fix it, although it's hard to be 100% sure as it was far from reliable to reproduce.
Probably fixed by http://patchwork.ozlabs.org/patch/473041/ I think the bug exists in kernels back to 3.18 but it just got a lot easier to trigger it in 4.0 due to other locking changes.
Im having a similar issue that might be related, I opened #256 for it. Examinig backtraces I discovered that my worker get stuck at line 915 of session.rb which is
initial_line = @socket.gets("\n")