poseidon icon indicating copy to clipboard operation
poseidon copied to clipboard

Fixing error handling for socket timeouts that occur due to a race-li…

Open StabbyCutyou opened this issue 10 years ago • 2 comments

…ke condition

@bpot So, bare with me here...

During testing, we found if the connection idles for long periods of time, you can run into a case where an exception occurs, that is not handled correctly. It would appear as though, in the time between IO.select and @socket.write (and I'm assuming read as well, because why not), the socket actually timesout. This causes an ERRNO::ETIMEDOUT to be thrown, but not caught.

I tried to run your integration specs but kept getting a file not found issue after supplying the directory to my kafka installation.

StabbyCutyou avatar May 29 '15 04:05 StabbyCutyou

I have a really dumb looking test script that I'm able to reliably reproduce the issue with.

https://gist.github.com/StabbyCutyou/e0050d3b8b12c7c42736

I seem to be able to get it every 3rd run of the loop, but your mileage may vary. You'll know it happens when the stack trace ERRNO::ETIMEDOUT shows up. I have another branch with some extra logging I could link you to that'll dump some info out in the connection.rb class during each attempt to publish, I used it to verify what was happening.

Again - super weird case, but one that I'm able to reproduce.

EDIT

It switches from once every twenty minutes to 100 messages, each every 100ms to try and reproduce a behavior others had seen where the connection remaining in a bad state for several writes, but I couldn't reproduce that. The script is definitely the result of some random testing approaches.

StabbyCutyou avatar May 29 '15 04:05 StabbyCutyou

Coverage Status

Coverage remained the same at 92.35% when pulling 0b00f74c718c6b9cdb06bd058c19689a53063a9c on Tapjoy:fix/missing_timeout_exception_handling into dd74d9473692080cf49d3740eb3d7d929e79e1c9 on bpot:master.

coveralls avatar May 29 '15 04:05 coveralls