httpclient icon indicating copy to clipboard operation
httpclient copied to clipboard

create_socket (session.rb:815) hanging in multithreaded environment

Open kosh-jelly opened this issue 11 years ago • 14 comments

Ruby 2.1.3 httpclient 2.5.3.3 ubuntu 14.04

Running 30 threads in a sidekiq instance, and consistently after running for 5-10 hours every thread gets hung on exactly the same line in HTTPClient.

/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:815:in `initialize'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:815:in `new'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:815:in `create_socket'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:763:in `block in connect'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:91:in `block in timeout'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:101:in `call'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:101:in `timeout'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:127:in `timeout'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:762:in `connect'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:620:in `query'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:164:in `query'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1161:in `do_get_block'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/httpclient.rb:34:in `block in do_get_block_with_newrelic'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/cross_app_tracing.rb:48:in `tl_trace_http_request'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/httpclient.rb:33:in `do_get_block_with_newrelic'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:962:in `block in do_request'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1059:in `protect_keep_alive_disconnected'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:961:in `do_request'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1034:in `follow_redirect'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:821:in `request'
/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:714:in `get'

Here's the C-level stack trace:

#0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x00007f24fb62a5dd in _L_lock_820 () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f24fb62a395 in __check_pf (seen_ipv4=0x7f24fb8d5134 <lock>, 
    seen_ipv4@entry=0x7f246cb4d010, seen_ipv6=seen_ipv6@entry=0x7f246cb4d020, 
    in6ai=in6ai@entry=0x7f246cb4d050, in6ailen=0xffffffffffffffff, in6ailen@entry=0x7f246cb4d060)
    at ../sysdeps/unix/sysv/linux/check_pf.c:319
#3  0x00007f24fb5e2159 in __GI_getaddrinfo (name=0x7f246cb4d360 "domain.livedoor.com", 
    service=0x7f246cb4d340 "80", hints=<optimized out>, pai=0x7f246cb4d250)
    at ../sysdeps/posix/getaddrinfo.c:2458
#4  0x00007f24f9f06518 in nogvl_getaddrinfo (arg=<optimized out>) at raddrinfo.c:162
#5  0x00007f24fc5f3982 in call_without_gvl (func=func@entry=0x7f24f9f06500 <nogvl_getaddrinfo>, 
    data1=data1@entry=0x7f246cb4d260, ubf=<optimized out>, ubf@entry=0xffffffffffffffff, 
    data2=<optimized out>, data2@entry=0x0, fail_if_interrupted=0) at thread.c:1266
#6  0x00007f24fc5f6a65 in rb_thread_call_without_gvl (
    func=func@entry=0x7f24f9f06500 <nogvl_getaddrinfo>, data1=data1@entry=0x7f246cb4d260, 
    ubf=ubf@entry=0xffffffffffffffff, data2=data2@entry=0x0) at thread.c:1376
#7  0x00007f24f9f0750a in rb_getaddrinfo (node=node@entry=0x7f246cb4d360 "domain.livedoor.com", 
    service=service@entry=0x7f246cb4d340 "80", hints=hints@entry=0x7f246cb4d7b0, 
    res=res@entry=0x7f246cb4d300) at raddrinfo.c:280
#8  0x00007f24f9f08289 in rsock_getaddrinfo (host=<optimized out>, port=<optimized out>, 
    hints=hints@entry=0x7f246cb4d7b0, socktype_hack=1) at raddrinfo.c:481
#9  0x00007f24f9f08d5c in rsock_addrinfo (host=<optimized out>, port=<optimized out>, 
    socktype=socktype@entry=1, flags=flags@entry=0) at raddrinfo.c:501
#10 0x00007f24f9efedaf in init_inetsock_internal (arg=arg@entry=0x7f246cb4d9d0) at ipsocket.c:49
#11 0x00007f24fc483fe0 in rb_ensure (b_proc=b_proc@entry=0x7f24f9efed70 <init_inetsock_internal>, 
    data1=data1@entry=139794419341776, e_proc=e_proc@entry=0x7f24f9efed20 <inetsock_cleanup>, 
    data2=data2@entry=139794419341776) at eval.c:853
#12 0x00007f24f9eff144 in rsock_init_inetsock (sock=sock@entry=139794885577960, 
    remote_host=<optimized out>, remote_serv=<optimized out>, local_host=<optimized out>, 
    local_serv=<optimized out>, type=type@entry=0) at ipsocket.c:158
#13 0x00007f24f9eff48d in tcp_init (argc=<optimized out>, argv=<optimized out>, sock=139794885577960)
    at tcpsocket.c:30
#14 0x00007f24fc5d95ff in vm_call0_cfunc_with_frame (argv=0x7f2490250910, ci=0x7f246cb4da70, 
    th=0x7f24b0041620) at vm_eval.c:118
#15 vm_call0_cfunc (argv=0x7f2490250910, ci=0x7f246cb4da70, th=0x7f24b0041620) at vm_eval.c:135
#16 vm_call0_body (th=th@entry=0x7f24b0041620, ci=ci@entry=0x7f246cb4db50, argv=0x7f2490250910)
    at vm_eval.c:175
#17 0x00007f24fc5da27b in vm_call0 (defined_class=<optimized out>, me=<optimized out>, 
    argv=<optimized out>, argc=2, id=1504, recv=139794885577960, th=0x7f24b0041620) at vm_eval.c:50
#18 rb_call0 (recv=139794885577960, mid=1504, argc=2, argv=<optimized out>, scope=CALL_FCALL, 
    self=139796859028520) at vm_eval.c:325
#19 0x00007f24fc4dfae1 in rb_class_new_instance (argc=2, argv=0x7f2490250910, klass=<optimized out>) at object.c:1861
#20 0x00007f24fc5cf574 in vm_call_cfunc_with_frame (ci=<optimized out>, reg_cfp=0x7f249026edc0, 
    th=0x7f24b0041620) at vm_insnhelper.c:1489
#21 vm_call_cfunc (th=0x7f24b0041620, reg_cfp=0x7f249026edc0, ci=<optimized out>)
    at vm_insnhelper.c:1579
#22 0x00007f24fc5d4194 in vm_exec_core (th=th@entry=0x7f24b0041620, initial=initial@entry=0)
    at insns.def:1028
#23 0x00007f24fc5d7ddf in vm_exec (th=th@entry=0x7f24b0041620) at vm.c:1363
#24 0x00007f24fc5db2ea in invoke_block_from_c (th=th@entry=0x7f24b0041620, 
    block=block@entry=0x7f2491026b70, self=self@entry=139796863525680, argc=argc@entry=1, 
    argv=argv@entry=0x7f2490250870, blockptr=blockptr@entry=0x0, cref=cref@entry=0x0, 
    defined_class=defined_class@entry=139796863518600) at vm.c:782
#25 0x00007f24fc5dbacb in vm_invoke_proc (th=0x7f24b0041620, proc=proc@entry=0x7f2491026b70, 
    self=139796863525680, defined_class=139796863518600, argc=argc@entry=1, 
    argv=argv@entry=0x7f2490250870, blockptr=0x0) at vm.c:846
#26 0x00007f24fc5dbb7a in rb_vm_invoke_proc (th=<optimized out>, proc=proc@entry=0x7f2491026b70, 
    argc=argc@entry=1, argv=argv@entry=0x7f2490250870, blockptr=<optimized out>) at vm.c:865
#27 0x00007f24fc487140 in proc_call (argc=1, argv=0x7f2490250870, procval=139794885578480)
    at proc.c:734
#28 0x00007f24fc5cf574 in vm_call_cfunc_with_frame (ci=<optimized out>, reg_cfp=0x7f249026ef00, 
    th=0x7f24b0041620) at vm_insnhelper.c:1489
#29 vm_call_cfunc (th=0x7f24b0041620, reg_cfp=0x7f249026ef00, ci=<optimized out>)
    at vm_insnhelper.c:1579
#30 0x00007f24fc5d4194 in vm_exec_core (th=th@entry=0x7f24b0041620, initial=initial@entry=0)
    at insns.def:1028
#31 0x00007f24fc5d7ddf in vm_exec (th=th@entry=0x7f24b0041620) at vm.c:1363
#32 0x00007f24fc5db2ea in invoke_block_from_c (th=0x7f24b0041620, block=<optimized out>, 
---Type <return> to continue, or q <return> to quit---
    self=<optimized out>, argc=argc@entry=0, argv=argv@entry=0x0, blockptr=blockptr@entry=0x0, 
    cref=cref@entry=0x0, defined_class=139796924859280) at vm.c:782
#33 0x00007f24fc5dbf54 in vm_yield (argc=0, argv=0x0, th=<optimized out>) at vm.c:821
#34 rb_yield_0 (argv=0x0, argc=0) at vm_eval.c:938
#35 loop_i () at vm_eval.c:1008
#36 0x00007f24fc483b0e in rb_rescue2 (b_proc=b_proc@entry=0x7f24fc5dbf00 <loop_i>, 
    data1=data1@entry=0, r_proc=r_proc@entry=0x0, data2=data2@entry=0) at eval.c:754
#37 0x00007f24fc5cde3e in rb_f_loop (self=139794889136520) at vm_eval.c:1042
#38 0x00007f24fc5cf574 in vm_call_cfunc_with_frame (ci=<optimized out>, reg_cfp=0x7f249026f540, 
    th=0x7f24b0041620) at vm_insnhelper.c:1489
#39 vm_call_cfunc (th=0x7f24b0041620, reg_cfp=0x7f249026f540, ci=<optimized out>)
    at vm_insnhelper.c:1579
#40 0x00007f24fc5d4ad5 in vm_exec_core (th=th@entry=0x7f24b0041620, initial=initial@entry=0)
    at insns.def:999
#41 0x00007f24fc5d7ddf in vm_exec (th=th@entry=0x7f24b0041620) at vm.c:1363
#42 0x00007f24fc5db2ea in invoke_block_from_c (th=th@entry=0x7f24b0041620, 
    block=block@entry=0x7f249227be20, self=self@entry=139796925596280, argc=argc@entry=0, 
    argv=argv@entry=0x7f24902503a8, blockptr=blockptr@entry=0x0, cref=cref@entry=0x0, 
    defined_class=defined_class@entry=139796870304080) at vm.c:782
#43 0x00007f24fc5dbacb in vm_invoke_proc (th=0x7f24b0041620, proc=proc@entry=0x7f249227be20, 
    self=139796925596280, defined_class=139796870304080, argc=argc@entry=0, 
    argv=argv@entry=0x7f24902503a8, blockptr=0x0) at vm.c:846
#44 0x00007f24fc5dbb7a in rb_vm_invoke_proc (th=<optimized out>, proc=proc@entry=0x7f249227be20, 
    argc=argc@entry=0, argv=argv@entry=0x7f24902503a8, blockptr=<optimized out>) at vm.c:865
#45 0x00007f24fc487140 in proc_call (argc=0, argv=0x7f24902503a8, procval=139795083105960)
    at proc.c:734
#46 0x00007f24fc5cf574 in vm_call_cfunc_with_frame (ci=<optimized out>, reg_cfp=0x7f249026f6d0, 
---Type <return> to continue, or q <return> to quit---
    th=0x7f24b0041620) at vm_insnhelper.c:1489
#47 vm_call_cfunc (th=0x7f24b0041620, reg_cfp=0x7f249026f6d0, ci=<optimized out>)
    at vm_insnhelper.c:1579
#48 0x00007f24fc5d4194 in vm_exec_core (th=th@entry=0x7f24b0041620, initial=initial@entry=0)
    at insns.def:1028
#49 0x00007f24fc5d7ddf in vm_exec (th=th@entry=0x7f24b0041620) at vm.c:1363
#50 0x00007f24fc5db2ea in invoke_block_from_c (th=th@entry=0x7f24b0041620, 
    block=block@entry=0x7f24905ef620, self=self@entry=139796903630280, argc=argc@entry=0, 
    argv=argv@entry=0x7f2490250188, blockptr=blockptr@entry=0x0, cref=cref@entry=0x0, 
    defined_class=defined_class@entry=139796878690480) at vm.c:782
#51 0x00007f24fc5dbacb in vm_invoke_proc (th=0x7f24b0041620, proc=proc@entry=0x7f24905ef620, 
    self=139796903630280, defined_class=139796878690480, argc=argc@entry=0, 
    argv=argv@entry=0x7f2490250188, blockptr=0x0) at vm.c:846
#52 0x00007f24fc5dbb7a in rb_vm_invoke_proc (th=<optimized out>, proc=proc@entry=0x7f24905ef620, 
    argc=argc@entry=0, argv=argv@entry=0x7f2490250188, blockptr=<optimized out>) at vm.c:865
#53 0x00007f24fc487140 in proc_call (argc=0, argv=0x7f2490250188, procval=139795083105240)
    at proc.c:734
#54 0x00007f24fc5cf574 in vm_call_cfunc_with_frame (ci=<optimized out>, reg_cfp=0x7f249026fbd0, 
    th=0x7f24b0041620) at vm_insnhelper.c:1489
#55 vm_call_cfunc (th=0x7f24b0041620, reg_cfp=0x7f249026fbd0, ci=<optimized out>)
    at vm_insnhelper.c:1579
#56 0x00007f24fc5d4194 in vm_exec_core (th=th@entry=0x7f24b0041620, initial=initial@entry=0)
    at insns.def:1028
#57 0x00007f24fc5d7ddf in vm_exec (th=th@entry=0x7f24b0041620) at vm.c:1363
#58 0x00007f24fc5d9514 in vm_call0_body (th=th@entry=0x7f24b0041620, ci=ci@entry=0x7f246cb504c0, 
    argv=0x7f2490250098) at vm_eval.c:171
#59 0x00007f24fc5da27b in vm_call0 (defined_class=<optimized out>, me=<optimized out>, 
---Type <return> to continue, or q <return> to quit---
    argv=<optimized out>, argc=1, id=61160, recv=139796925596280, th=0x7f24b0041620) at vm_eval.c:50
#60 rb_call0 (recv=139796925596280, mid=61160, argc=1, argv=<optimized out>, scope=CALL_PUBLIC, 
    self=52) at vm_eval.c:325
#61 0x00007f24fc5de4fe in send_internal (argc=1, argv=0x7f2490250098, recv=139796925596280, 
    scope=CALL_PUBLIC) at vm_eval.c:882
#62 0x00007f24fc5ddbb1 in vm_call_cfunc_with_frame (ci=<optimized out>, reg_cfp=0x7f249026fd60, 
    th=0x7f24b0041620) at vm_insnhelper.c:1489
#63 vm_call_cfunc (ci=<optimized out>, reg_cfp=0x7f249026fd60, th=0x7f24b0041620)
    at vm_insnhelper.c:1579
#64 vm_call_method (th=0x7f24b0041620, cfp=0x7f249026fd60, ci=<optimized out>) at vm_insnhelper.c:1767
#65 0x00007f24fc5d4ad5 in vm_exec_core (th=th@entry=0x7f24b0041620, initial=initial@entry=0)
    at insns.def:999
#66 0x00007f24fc5d7ddf in vm_exec (th=th@entry=0x7f24b0041620) at vm.c:1363
#67 0x00007f24fc5db2ea in invoke_block_from_c (th=th@entry=0x7f24b0041620, 
    block=block@entry=0x7f2490193270, self=self@entry=139794887793720, argc=argc@entry=1, 
    argv=argv@entry=0x7f246cb50ef0, blockptr=blockptr@entry=0x0, cref=cref@entry=0x0, 
    defined_class=defined_class@entry=139796876889560) at vm.c:782
#68 0x00007f24fc5dbacb in vm_invoke_proc (th=th@entry=0x7f24b0041620, proc=proc@entry=0x7f2490193270, 
    self=139794887793720, defined_class=139796876889560, argc=argc@entry=1, 
    argv=argv@entry=0x7f246cb50ef0, blockptr=blockptr@entry=0x0) at vm.c:846
#69 0x00007f24fc5dbb7a in rb_vm_invoke_proc (th=th@entry=0x7f24b0041620, 
    proc=proc@entry=0x7f2490193270, argc=argc@entry=1, argv=argv@entry=0x7f246cb50ef0, 
    blockptr=blockptr@entry=0x0) at vm.c:865
#70 0x00007f24fc5fd278 in rb_fiber_start () at cont.c:1293
#71 0x00007f24fb55a7a0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#72 0x0000000000000000 in ?? ()

Any ideas? Do I need to be manually closing connections after I use them or anything like that?

Looks like there might be a deadlock, given that the threads are all stuck at "lowlevellock.S", but this is fairly far over my skill level.

kosh-jelly avatar Nov 25 '14 07:11 kosh-jelly

No idea so far.

HTTPClient seems to stall at TCPSocket.new(target) and the underlying CRuby seems to stall at getaddrinfo(3) for getting address information of the specified target, according to the stacktraces. Can you check lsof -p <CRuby process> and netstat -an to see how many sockets the Ruby process is holding after it stalls?

If the number of sockets is the problem, manual connection termination could mitigate the problem but anyway name resolution (getaddrinfo) issue may happen after more long running.

nahi avatar Nov 25 '14 14:11 nahi

Got another hang after running for 18 hours so I get a chance to grab somemore data, and I think we're getting closer to finding something useful.

In netstat I grepped for port 80 to eliminate all the local connections:

netstat -an | grep '80'
tcp        1      0 104.131.147.112:52669   199.101.163.129:80      CLOSE_WAIT 
tcp        1      0 104.131.147.112:32888   208.80.154.224:80       CLOSE_WAIT 
tcp        1      0 104.131.147.112:32855   208.80.154.224:80       CLOSE_WAIT 
tcp        1      0 104.131.147.112:48221   199.101.163.129:80      CLOSE_WAIT 
tcp        1      0 104.131.147.112:34669   165.254.58.82:80        CLOSE_WAIT 
tcp        0      0 104.131.147.112:40854   64.250.181.184:80       ESTABLISHED
tcp        1      0 104.131.147.112:54493   184.168.221.5:80        CLOSE_WAIT 
tcp        1      0 104.131.147.112:37324   54.85.12.13:80          CLOSE_WAIT 
tcp        1      0 104.131.147.112:54133   74.125.239.102:80       CLOSE_WAIT 
tcp        1      0 104.131.147.112:38529   199.101.163.129:80      CLOSE_WAIT 
tcp        1      0 104.131.147.112:53270   165.254.42.104:80       CLOSE_WAIT 
tcp        1      0 104.131.147.112:50308   192.241.241.29:80       CLOSE_WAIT 
tcp        1      0 104.131.147.112:47834   64.250.181.133:80       CLOSE_WAIT 
tcp        1      0 104.131.147.112:55614   54.225.96.182:80        CLOSE_WAIT 
tcp        0      0 104.131.147.112:42699   64.250.181.184:80       ESTABLISHED
tcp        1      0 104.131.147.112:49654   64.250.181.133:80       CLOSE_WAIT 
tcp        1      0 104.131.147.112:60678   107.20.199.207:80       CLOSE_WAIT 
tcp        1      0 104.131.147.112:56585   63.245.217.194:80       CLOSE_WAIT 
tcp        1      0 104.131.147.112:32876   208.80.154.224:80       CLOSE_WAIT 
tcp        1      0 104.131.147.112:42249   64.94.113.154:80        CLOSE_WAIT 

What stands out to my untrained eyes is that the 2 outliers - the ones where status = established - are the same foreign address but they're being connected to from 2 different local ports. Maybe that's totally normal though....

I actually had 3 different processes running this time and they all hung, so I did an lsof -p for each process. The truncated version of one of them is something like this:

ruby    9267 deploy   22u     IPv4           54402497       0t0      TCP 104.131.147.112:34669->165.254.58.82:http (CLOSE_WAIT)
ruby    9267 deploy   28u     unix 0xffff880210530e00       0t0 49411786 socket
ruby    9267 deploy   29u     unix 0xffff8800c4d31f80       0t0 49841213 socket
ruby    9267 deploy   33u     sock                0,7       0t0 54434654 can't identify protocol
ruby    9267 deploy   34u     unix 0xffff880212303800       0t0 49204676 socket
ruby    9267 deploy   35u     unix 0xffff880212301880       0t0 49204682 socket
ruby    9267 deploy   38u     unix 0xffff880192bac380       0t0 49413893 socket
ruby    9267 deploy   40u     unix 0xffff8801bb06f480       0t0 49226855 socket
ruby    9267 deploy   42u     unix 0xffff88011698b100       0t0 49414524 socket
ruby    9267 deploy   44u     unix 0xffff880017bd1180       0t0 49242474 socket
ruby    9267 deploy   45u     unix 0xffff88021157c700       0t0 49209756 socket
ruby    9267 deploy   48u     sock                0,7       0t0 54402490 can't identify protocol
ruby    9267 deploy   51u     IPv4           49810697       0t0      TCP localhost:43769->localhost:6379 (ESTABLISHED)
ruby    9267 deploy   52u     unix 0xffff880212301c00       0t0 49210337 socket
ruby    9267 deploy   53u     IPv4           49264160       0t0      TCP localhost:57171->localhost:6379 (ESTABLISHED)
ruby    9267 deploy   60u     IPv4           50429441       0t0      TCP 104.131.147.112:40854->64.250.181.184:http (ESTABLISHED)
ruby    9267 deploy   63u     unix 0xffff880212e48a80       0t0 49796668 socket
ruby    9267 deploy   65u     IPv4           50431151       0t0      TCP 104.131.147.112:47834->64.250.181.133:http (CLOSE_WAIT)
ruby    9267 deploy   85u     IPv4           49365039       0t0      TCP localhost:35560->localhost:6379 (ESTABLISHED)
ruby    9267 deploy   90u     IPv4           49480239       0t0      TCP localhost:51574->localhost:6379 (ESTABLISHED)
ruby    9267 deploy   98u     unix 0xffff880212e48380       0t0 49838472 socket
ruby    9267 deploy  107u     unix 0xffff880212e4b800       0t0 49832752 socket
ruby    9267 deploy  111u     unix 0xffff88011698a680       0t0 49412545 socket
ruby    9267 deploy  113u     unix 0xffff880210532a00       0t0 49792397 socket
ruby    9267 deploy  117u     unix 0xffff88011698aa00       0t0 49412549 socket
ruby    9267 deploy  121u     unix 0xffff880210803800       0t0 49834471 socket
ruby    9267 deploy  124u     IPv4           49417333       0t0      TCP localhost:55287->localhost:6379 (ESTABLISHED)
ruby    9267 deploy  129u     unix 0xffff880210530380       0t0 49792415 socket
ruby    9267 deploy  131u     IPv4           49796786       0t0      TCP localhost:39364->localhost:6379 (ESTABLISHED)
ruby    9267 deploy  140u     IPv4           54402470       0t0      TCP 104.131.147.112:42249->Lxlb2.peaksystems.com:http (CLOSE_WAIT)
ruby    9267 deploy  141u     unix 0xffff880210531f80       0t0 49284607 socket
ruby    9267 deploy  142u     IPv4           54402506       0t0      TCP 104.131.147.112:53270->165.254.42.104:http (CLOSE_WAIT)
ruby    9267 deploy  150u     sock                0,7       0t0 54429336 can't identify protocol
ruby    9267 deploy  155u     IPv4           49823256       0t0      TCP localhost:47545->localhost:6379 (ESTABLISHED)
ruby    9267 deploy  157u     IPv4           49823258       0t0      TCP localhost:47546->localhost:6379 (ESTABLISHED)
ruby    9267 deploy  182u     unix 0xffff880212e4a680       0t0 49838659 socket
ruby    9267 deploy  238u  netlink                          0t0 54422488 ROUTE
ruby    9267 deploy  245u     IPv4           54425223       0t0      TCP 104.131.147.112:38529->199.101.163.129:http (CLOSE_WAIT)
ruby    9267 deploy  300u     IPv4           54428858       0t0      TCP 104.131.147.112:56585->mozcom.zlb.phx.mozilla.net:http (CLOSE_WAIT)

What stands out to me here is that they all have at least one of these lines: ruby 9267 deploy 150u sock 0,7 0t0 54429336 can't identify protocol

It seems that it could be related to this issue https://github.com/cotag/spider-gazelle/issues/20, where sockets were being accidentally duplicated when passed to another thread.

kosh-jelly avatar Nov 26 '14 14:11 kosh-jelly

@kosh-jelly Thanks. I think the root cause is many CLOSE_WAIT sockets. httpclient should try to reuse connection and close properly but it seems such recycle is not working for you. Can I see your code that actually using HTTPClient? I mean, it would help if you can show me several more stacktrace lines below this line.

/home/deploy/domain_freek/releases/165/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:714:in `get'

nahi avatar Nov 27 '14 06:11 nahi

Absolutely.

Here's the full stack trace. Looks like all threads are stuck at the same point in my code, on the fetch_robots call:

/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:815:in `new'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:815:in `create_socket'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:763:in `block in connect'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:91:in `block in timeout'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:101:in `call'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:101:in `timeout'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:127:in `timeout'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:762:in `connect'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:620:in `query'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:164:in `query'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1161:in `do_get_block'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/httpclient.rb:34:in `block in do_get_block_with_newrelic'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/cross_app_tracing.rb:48:in `tl_trace_http_request'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/httpclient.rb:33:in `do_get_block_with_newrelic'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:962:in `block in do_request'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1059:in `protect_keep_alive_disconnected'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:961:in `do_request'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1034:in `follow_redirect'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:821:in `request'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:714:in `get'
/home/deploy/domain_freek/releases/167/app/ohm_models/live_domain.rb:108:in `fetch_robots'
/home/deploy/domain_freek/releases/167/app/ohm_models/live_domain.rb:145:in `get_robots'
/home/deploy/domain_freek/releases/167/app/ohm_models/live_domain.rb:78:in `block in crawl_domain'
/home/deploy/domain_freek/releases/167/app/ohm_models/live_domain.rb:73:in `loop'
/home/deploy/domain_freek/releases/167/app/ohm_models/live_domain.rb:73:in `crawl_domain'
/home/deploy/domain_freek/releases/167/app/workers/crawl_domain.rb:20:in `perform'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:75:in `execute_job'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:52:in `block (2 levels) in process'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:127:in `call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:127:in `block in invoke'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/sidekiq.rb:33:in `block in call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/controller_instrumentation.rb:365:in `perform_action_with_newrelic_trace'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/sidekiq.rb:29:in `call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/sidetiq-0.6.3/lib/sidetiq/middleware/history.rb:8:in `call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/server/active_record.rb:6:in `call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/server/retry_jobs.rb:74:in `call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/server/logging.rb:11:in `block in call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/logging.rb:22:in `with_context'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/server/logging.rb:7:in `call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:132:in `call'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:132:in `invoke'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:51:in `block in process'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:98:in `stats'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:50:in `process'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26:in `public_send'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26:in `dispatch'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/calls.rb:122:in `dispatch'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/cell.rb:60:in `block in invoke'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/cell.rb:71:in `block in task'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/actor.rb:357:in `block in task'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/tasks.rb:57:in `block in initialize'
/home/deploy/domain_freek/releases/167/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/tasks/task_fiber.rb:15:in `block in create'

fetch_robots should be the first http connection a thread makes, and it should be doing it with a new HTTPClient instance. Here's the relevant parts of my code:

def fetch_robots
    response = agent.get("http://#{domain}/robots.txt", follow_redirect: true) # line 108
end

def agent
    @agent ||= new_agent
end

def new_agent
    agent = HTTPClient.new
    agent.receive_timeout = APP_CONFIG['crawl_page_timeout'] #set to 10
    agent
end

kosh-jelly avatar Nov 27 '14 08:11 kosh-jelly

@kosh-jelly I think I've understood the situation. Your code uses single HTTPClient instance so connections should be properly recycled, but since you need to access various different domains for fetching robots.txt (practically 1 connection per 1 domain) all connections are not recycled. With Current HTTPClient implementation, Keep-Alive connections are closed by

  1. manually closed by HTTPClient#reset or HTTPClient#reset_all
  2. when a connection is recycled but server closed the connection already.
  3. GCed

Would you please try the following 2 workarounds?

  1. avoid Keep-Alive connection by setting agent.protocol_version = 'HTTP/1.0'. you should not need keep-alive connections.
  2. close connection manually by calling agent.reset("http://#{domain}/robots.txt") after calling agent.get

This HTTPClient behavior rarely caused troubles but when it happens it's hard to investigate. I'd try to introduce connection leaper thread to HTTPClient. Sorry for the inconvenience this time.

nahi avatar Nov 27 '14 15:11 nahi

Thanks so much for your help with this nahi! I'm really impressed with your work on HTTPClient.

I'll add those two lines of code and post an update.

Typically after fetching robots.txt I'm going on to fetch other pages on the same domain with the same instance of HTTPClient. Should I call reset after each page, or is this specific to connecting the first time to a domain (which is always going to be robots.txt for me)?

kosh-jelly avatar Nov 27 '14 16:11 kosh-jelly

@kosh-jelly The workaround should work when you apply either of it. I'm pretty sure about the reason of your trouble so you can only try "reset".

Hmm, you also need Keep-Alive connections. It's hard until I introduce connection leaper thread but following is the best ATM.

Create HTTPClient instance per thread and reset all connections before going to next domain. (Assuming you assign a Thread for a domain)

def fetch_robots
  agent.reset_all
  response = agent.get("http://#{domain}/robots.txt", follow_redirect: true) # line 108
end

def agent
  Thread.current[:my_agent] ||= new_agent
end

If above is not the situation (you assign threads for domains and have thread pool), please call agent.reset_all periodically as a workaround. You'd need synchronizing threads as well. I'm not familiar with Sidekiq so I might be wrong...

nahi avatar Nov 27 '14 16:11 nahi

Ahh, ok. Yeah I have a separate thread for each domain, and each thread should already have it's own instance of HTTPClient.

When I move onto a new domain the old HTTPClient should be gc'ed, and I should be creating a new HTTPClient instance to use with the new domain. It's certainly possible that's not working as expected though.

reset_all before getting robots sounds like a pretty good workaround for now.

kosh-jelly avatar Nov 27 '14 16:11 kosh-jelly

Looks like the reset_all workaround did not solve the problem. Threads again hung after about 18 hours of running on the same line.

Current codebase looks like this:


class LiveDomain

def crawl_domain
    loop do 
        next_page = queue.pop

        break unless next_page && get_robots # line 75

        if get_robots.allowed?(next_page.url) && next_page.should_crawl?
            Kernel.sleep CRAWL_DELAY
            next_page.scrape_linked_pages
        end
    end
end

def get_robots
    @robots ||= fetch_robots
end

def fetch_robots
    agent.reset_all
    response = agent.get("http://#{domain}/robots.txt", follow_redirect: true) #line 106 where it's hanging
    return RobotsTxt.new response.body
end

def agent
    @agent ||= new_agent
end

def new_agent
    agent = HTTPClient.new
    agent.receive_timeout = APP_CONFIG['crawl_page_timeout']
    agent
end
end

Stack trace from the hanging threads:

2014-12-01T08:03:07.925Z 12629 TID-osi57w9io WARN: /home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:815:in `initialize'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:815:in `new'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:815:in `create_socket'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:763:in `block in connect'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:91:in `block in timeout'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:101:in `call'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:101:in `timeout'
/home/deploy/.rbenv/versions/2.1.3/lib/ruby/2.1.0/timeout.rb:127:in `timeout'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:762:in `connect'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:620:in `query'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient/session.rb:164:in `query'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1161:in `do_get_block'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/httpclient.rb:34:in `block in do_get_block_with_newrelic'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/cross_app_tracing.rb:48:in `tl_trace_http_request'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/httpclient.rb:33:in `do_get_block_with_newrelic'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:962:in `block in do_request'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1059:in `protect_keep_alive_disconnected'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:961:in `do_request'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:1034:in `follow_redirect'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:821:in `request'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/httpclient-2.5.3.3/lib/httpclient.rb:714:in `get'
/home/deploy/domain_freek/releases/173/app/ohm_models/live_domain.rb:106:in `fetch_robots'
/home/deploy/domain_freek/releases/173/app/ohm_models/live_domain.rb:143:in `get_robots'
/home/deploy/domain_freek/releases/173/app/ohm_models/live_domain.rb:75:in `block in crawl_domain'
/home/deploy/domain_freek/releases/173/app/ohm_models/live_domain.rb:70:in `loop'
/home/deploy/domain_freek/releases/173/app/ohm_models/live_domain.rb:70:in `crawl_domain'
/home/deploy/domain_freek/releases/173/app/workers/crawl_domain.rb:20:in `perform'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:75:in `execute_job'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:52:in `block (2 levels) in process'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:127:in `call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:127:in `block in invoke'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/sidekiq.rb:33:in `block in call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/controller_instrumentation.rb:365:in `perform_action_with_newrelic_trace'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.9.7.266/lib/new_relic/agent/instrumentation/sidekiq.rb:29:in `call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/sidetiq-0.6.3/lib/sidetiq/middleware/history.rb:8:in `call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/server/active_record.rb:6:in `call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/server/retry_jobs.rb:74:in `call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/server/logging.rb:11:in `block in call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/logging.rb:22:in `with_context'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/server/logging.rb:7:in `call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:129:in `block in invoke'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:132:in `call'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/middleware/chain.rb:132:in `invoke'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:51:in `block in process'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:98:in `stats'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/bundler/gems/sidekiq-5047f9220b82/lib/sidekiq/processor.rb:50:in `process'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26:in `public_send'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26:in `dispatch'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/calls.rb:122:in `dispatch'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/cell.rb:60:in `block in invoke'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/cell.rb:71:in `block in task'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/actor.rb:357:in `block in task'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/tasks.rb:57:in `block in initialize'
/home/deploy/domain_freek/releases/173/vendor/bundle/ruby/2.1.0/gems/celluloid-0.16.0/lib/celluloid/tasks/task_fiber.rb:15:in `block in create'

And again, it looks like sockets are stuck in close_wait, despite the reset_all command right before fetching robots.txt:

netstat -an | grep ':80'
tcp        1      0 104.131.147.112:33863   192.80.13.224:80        CLOSE_WAIT 
tcp        1      0 104.131.147.112:47845   173.252.120.6:80        CLOSE_WAIT 
tcp        1      0 104.131.147.112:59259   208.80.154.224:80       CLOSE_WAIT 
tcp        1      0 104.131.147.112:58951   157.166.226.25:80       CLOSE_WAIT 
tcp        1      0 104.131.147.112:50569   143.252.148.217:80      CLOSE_WAIT 
tcp        1      0 104.131.147.112:52243   184.51.0.73:80          CLOSE_WAIT 
tcp        1      0 104.131.147.112:33540   184.168.221.20:80       CLOSE_WAIT 
tcp        1      0 104.131.147.112:34338   184.72.251.173:80       CLOSE_WAIT 
tcp        1      0 104.131.147.112:58374   143.252.148.211:80      CLOSE_WAIT 
tcp        1      0 104.131.147.112:45148   91.185.212.99:80        CLOSE_WAIT 
tcp        1      0 104.131.147.112:60262   217.160.54.59:80        CLOSE_WAIT 
tcp        1      0 104.131.147.112:33366   198.35.26.96:80         CLOSE_WAIT 
tcp        1      0 104.131.147.112:42752   184.168.221.5:80        CLOSE_WAIT 
tcp        1      0 104.131.147.112:51503   198.35.26.96:80         CLOSE_WAIT 
tcp        1      0 104.131.147.112:47741   94.228.135.84:80        CLOSE_WAIT 
tcp        1      0 104.131.147.112:59257   208.80.154.224:80       CLOSE_WAIT 
tcp        1      0 104.131.147.112:52808   216.104.20.189:80       CLOSE_WAIT 
tcp        1      0 104.131.147.112:44762   74.125.239.99:80        CLOSE_WAIT 
tcp        1      0 104.131.147.112:51738   54.209.11.100:80        CLOSE_WAIT 
tcp        1      0 104.131.147.112:58400   108.60.15.56:80         CLOSE_WAIT 
tcp        1      0 104.131.147.112:49161   208.80.154.224:80       CLOSE_WAIT 
tcp        1      0 104.131.147.112:49163   208.80.154.224:80       CLOSE_WAIT 
tcp        0      0 104.131.147.112:44389   216.35.221.76:80        ESTABLISHED
tcp        1      0 104.131.147.112:52784   193.191.138.53:80       CLOSE_WAIT 

To clarify, each thread has it's own instance of LiveDomain - which means that each thread will get it's own instance of HTTPClient.

I'll go ahead and try the other workaround of avoiding keep_alive connection by setting http protocol to 1.0.

kosh-jelly avatar Dec 01 '14 08:12 kosh-jelly

Hmm. Actually reset_all is not working for you... CLOSE_WAIT means client haven't call TCPSocket#close but reset_all blocks until it calls TCPSocket#close for all cached sessions. No idea why reset_all doesn't work... Not called or not called enough often?

nahi avatar Dec 28 '14 10:12 nahi

@nahi ... I think we're running into a similar problem (Ruby 2.2.1, Ubuntu 14.04.1, AWS).

$ uname -a
Linux background-job-master 3.13.0-36-generic #63-Ubuntu SMP Wed Sep 3 21:30:07 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Only instead of one byte in the pending receive queue, we typically have 32 bytes pending for the CLOSE_WAIT connections. We see similar output in both netstat & lsof. We also noticed child threads are also getting stuck in a "D" state, which then triggers our load alerts (1.0 load times number of processes in "D" state).

When inspecting with strace, they appear to be waiting for I/O:

$ sudo strace -s 1000 -tt -p 32581 2>&1
    Process 32581 attached
16:27:39.965887 restart_syscall(<... resuming interrupted call ...>

Killing the parent process (sidekiq in our case), usually clears the TCP connections and stalled threads. Otherwise they will sit there indefinitely (because of the odd state the TCP sockets are in).

neurogenesis avatar Mar 17 '15 04:03 neurogenesis

When you see this:

#0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x00007f24fb62a5dd in _L_lock_820 () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f24fb62a395 in __check_pf (seen_ipv4=0x7f24fb8d5134 , 
    seen_ipv4@entry=0x7f246cb4d010, seen_ipv6=seen_ipv6@entry=0x7f246cb4d020, 
    in6ai=in6ai@entry=0x7f246cb4d050, in6ailen=0xffffffffffffffff, in6ailen@entry=0x7f246cb4d060)
    at ../sysdeps/unix/sysv/linux/check_pf.c:319

Is there another thread that's actually waiting in recvmsg() like this?

#0  0x0000003841302edd in recvmsg () at ../sysdeps/unix/syscall-template.S:81
#1  0x000000384131f8ee in __check_pf (pid=-52599, fd=20) at ../sysdeps/unix/sysv/linux/check_pf.c:166
#2  0x000000384131f8ee in __check_pf (seen_ipv4=seen_ipv4@entry=0x7f5b137fdc32, seen_ipv6=seen_ipv6@entry=0x7f5b137fdc33, in6ai=in6ai@entry=0x7f5b137fdc40, in6ailen=in6ailen@entry=0x7f5b137fdc48) at ../sysdeps/unix/sysv/linux/check_pf.c:324

That's the interesting one, and all the others are waiting for it to drop the lock. Which it never does, presumably because of a kernel bug. cf. https://bugzilla.redhat.com/show_bug.cgi?id=1209433

I didn't start seeing this until I was running the 4.0 kernel. Going back to 3.19 seemed to fix it, although it's hard to be 100% sure as it was far from reliable to reproduce.

dwmw2 avatar Jun 08 '15 21:06 dwmw2

Probably fixed by http://patchwork.ozlabs.org/patch/473041/ I think the bug exists in kernels back to 3.18 but it just got a lot easier to trigger it in 4.0 due to other locking changes.

dwmw2 avatar Jun 11 '15 14:06 dwmw2

Im having a similar issue that might be related, I opened #256 for it. Examinig backtraces I discovered that my worker get stuck at line 915 of session.rb which is

initial_line = @socket.gets("\n")

pinpox avatar Jun 22 '15 18:06 pinpox