CPU usage creeps with latest Raven client on Tornado
After upgrading one of our (SeatGeek) production services to use the latest version of Raven (raven==5.24.3 at the time) we experience consistent CPU usage spike to 100% and eventually crash the API workers.

After some profiling it New Relic it appears that it's somehow related to the new(ish) breadcrumbs feature.

Let me know if I can do anything else to help debug this.
Do you know in which version this started?
Unfortunately, I don't. I do know the bug doesn't exist in raven==5.15.0, as we have a different service running 5.15.0 running on Tornado just fine.
@zackkitzmiller so for now i would recommend disabling the breadcrumb support. There is a parameter in the client ctor to disable it. I'm looking into supporting breadcrumbs there but I think there might be further issues.
For what it's worth i have not been able to reproduce this yet but there are definitely general issues with the concurrency model.
If you have anything you can share about your tornado setup (in particular what raven transport you are using and what requirements you have imported) would be great.
Sure, I'm going to stick to 5.15.0 for now.
requirements.txt:
amqp-dispatcher==0.8.0
backports.ssl-match-hostname==3.4.0.2
beautifulsoup4==4.3.2
boto==2.37.0
braintree==3.24.0
coverage==3.7.1
coveralls==0.5
decorator==3.3.2
docopt==0.6.2
ecdsa==0.11
elaphe==0.6.0
Fabric==1.8.3
Flask==0.10.1
freezegun==0.1.7
futures==2.1.6
Fuzzy==1.0
fuzzywuzzy==0.2
gevent==1.0.1
greenlet==0.4.7
gunicorn==18.0
haigha==0.9.0
hashids==1.1.0
httmock==1.2.3
httplib2==0.9
httpretty==0.8.3
intervaltree==2.1.0
iso8601==0.1.8
itsdangerous==0.24
Jinja2==2.8
lxml==3.4.1
MarkupSafe==0.23
mock==1.0.0
mockredispy==2.9.0.9
MySQL-python==1.2.3
ndg-httpsclient==0.4.0
newrelic==2.60.0.46
nose==1.3.3
nose-exclude==0.2.0
oauth2==1.5.211
objgraph==3.0.0
paramiko==1.12.4
paypalrestsdk==1.11.5
phpserialize==1.3
pika==0.10.0
Pillow==2.8.1
pyasn1==0.1.9
py-bcrypt==0.4
pycrypto==2.6.1
pycurl==7.43.0
pygeoip==0.2.7
pyOpenSSL==0.15.1
python-dateutil==1.5
python-Levenshtein==0.10.2
pytz==2011j
PyYAML==3.11
qrcode==5.1
Quadtree==0.1.2
raven==5.8.1
redis==2.10.3
repoze.lru==0.6
requests==2.4.3
six==1.9.0
Sift==1.1.2.6
SQLAlchemy==0.6.6
statsd==0.5.1
suds-jurko==0.6
testhttpserver==0.1.4
timelib==0.2.2
tornado==3.2
toro==1.0.1
ujson==1.34
Unidecode==0.4.9
urllib3==1.7.1
Werkzeug==0.10.4
xmltodict==0.8.6
fakeredis==0.6.2
We're running tornado with gunicorn (+ gevent).
We're using SentryMixin and AsyncSentryClient. As far as I know, we're using whatever the default transport is.
@zackkitzmiller i recommend sticking to latest but disabling the new features:
ravent = Client(..., enable_breadcrumbs=False, hook_libraries=[], install_logging_hook=False)
There are important fixes since 5.15.0.
@mitsuhiko I'll run that by everyone here.
@zackkitzmiller is there any chance you might be using multiple raven clients by accident?
I don't believe so. I'm not sure how 1. that would happen, 2. I would find out.
Happy to look into this.
I cannot confirm if it is due to the same issue. But we experience a memory leak/buildup which seems related to breadcrumbs. I have taken the advice mentioned above to see if our issue is solved that way.
@mitsuhiko I do not believe that we have two clients installed at the same time.
@aequitas Were you able to disable breadcrumbs with the latest client without any issue?
@zackkitzmiller no issues so far.
FWIW I'm getting a similar issue caused by a memory leak originating from _wrap_logging_method.
There's something wrong with how new_func is created that prevents CPython garbage collector from working correctly.
The issue goes away as soon as I set install_logging_hook to False.
@charettes try again with 6.3.0 We had cpu issues with 6.2.5 (or previous version)
@fyaretskiyso thanks for mentioning it but we are experiencing the same issue with 6.3.0.
We've been having what seems to be this issue on our servers for the past few months causing us a lot of server instability and headaches trying figure out what was going on. For us it seems to be an issue when certain timeouts pile up, which causes raven to start consuming a lot of cpu and memory, causing more timeouts and eventually causing our server to hang. I've created some example code to reproduce the problem:
import sys
import tornado.httpclient
import tornado.httpserver
import tornado.web
import tornado.gen
from raven.contrib.tornado import AsyncSentryClient, SentryMixin
#modify the time.time function so we can virtually fast-forward time.
old_time = time.time
inc_time = 0
def new_time(*args, **kwargs):
return old_time()+inc_time
time.time = new_time
http_client = tornado.httpclient.AsyncHTTPClient()
class MainHandler(SentryMixin, tornado.web.RequestHandler):
@tornado.gen.coroutine
def get(self):
global inc_time
#simulate a long running request here by adding 30 seconds to time.time
inc_time +=30
#time.sleep(5)
raise
application = tornado.web.Application([
(r"/", MainHandler),
])
application.sentry_client = AsyncSentryClient(
sentry_url_here,
# enable_breadcrumbs=False,
)
def main():
if sys.argv[1] == "0":
application.listen(6000)
tornado.ioloop.IOLoop.instance().start()
elif sys.argv[1] == "1":
count = 0
while True:
count+=1
try:
tornado.httpclient.HTTPClient().fetch("http://localhost:6000/")
except KeyboardInterrupt:
raise
except:
pass
print count
if __name__=='__main__':
main()
Run python script.py 0 in one terminal and python script 1 in another to reproduce the results. When count reaches about 30 in the while loop, the server really starts to slow down to a crawl. This cpu usage is being seen in the Client.encode function of base.py, as the size of the json it's trying to output reaches 100's of megabytes. Turning off breadcrumbs seems to fix the issue.
In the script I've modified the time.time function to simulate a long running, blocking operation. This is not entirely fair since it doesn't give any extra time to do garbage collection, and using time.sleep(30) instead seems to cause the problem to either not happen, or at least happen very slowly. I've found that in this example, using a time.sleep(5) instead will reproduce the problem without having to modify time.time at all (in this case by count = 20 the size of the json reaches 100k, and by count = 60 it's about 70mb, at count = 86 it's 1.4gb and the single tornado application is taking up 21gb of memory).
Strangely enough, when the time.sleep(5) and the modified time.time are removed, there is no issue and the json size doesn't seem to go above 35k.
The version of raven used for the above was 6.7.0, but the problem still seems to exists for older versions, even back to 5.15.0, though different versions do behave slightly differently.
I'm having exactly the same issue as @BarryJRowe, we had to stop using sentry for our services.
I've looking into it, and AsyncSentryClient was trying to send multiple requests with 200Mb of data, so processes were growing to several gigabytes of memory usage just for sentry processing!