Stability issues on high-traffic mixed http/https
I know that non-detailed bug reports are likely the worst on open source project but I'll give a shout-out before digging in.
I have setup templar for us, using it with a fairly large machine on Amazon (8 core), and it is a proxy for some of our web traffic to 3rd parties.
After a few hours, it just freezes up, in the debug log (piped to a file), you can see that requests start but never end. (App is configured with a timeout).
The way I set it up is fairly simple https://github.com/KensoDev/chef-templar.
Would love some pointers/suggestions.
Thanks!
@KensoDev Interesting... Is it still accepting connections but then not responding with any data? When it's locked up like that, send it a kill -5 and then grab the stdout/stderr. We'll be able to see what it's up to and figure out why it's locked up.
@evanphx Yeah, you see the S but no E is happening.
I'll try and reproduce again on staging/local with some benchmarking and see if I can get this reproduced
@KensoDev Thanks! Interesting that it's still accepting and processing requests but not sending any responses. Does this happen on requests without any Templar headers set too? If so, perhaps net/http is being strange (meaning I'm using it wrong).
Yeah
OK, we have 4 different vendors, 2 of them using HTTP (Get, Post) and two using HTTPS (Get, Post).
Obviously, for the HTTPS ones, I am sending the r['X-Templar-Upgrade'] = 'https' if https?.
Actually, I am not sure if it responds to requests at all, when I view the log when it halts, it's the last thing I see, but did not wait long enough to see if it catches up.
When it halted, it just stopped working, did not recover on it's own until restarting it (using upstart, you can see the config in the chef repo I linked to).
The traffic to it is high, I would say it has thousands a minute, it has about 8 cores to use.
Yeah, that's some pretty decent traffic. :smile:
Is the memory footprint of it pretty stable? Also, if/when it's hung, check out what descriptors it's got open with lsof.
Memory footprint is stable, using Redis for cache (but most request are not cached because it's a post anyway).
I am trying to think of a way to pass production-volume data to it without having it die on prod again...
@KensoDev maybe just simulate it with ab or wrk against a nginx? I might see about setting that up tonight as well.
OH yeah, that was a rhetorical question... I'll set it up.
Was there ever a conclusion on this? Does anyone run templar in a high-traffic environment right now?
@bmorton I stopped.
I am using Nginx with proxy_pass in order to do that.
Just to get an idea of the numbers, I'm doing about 2000-3000 requests per second right now. Which is far from the peak but much higher than what it used to be when templar crashed on me.
I didn't get very far with the investigation since this is a money-driving part of the system and I had to replace it with something that will work.