raven-python CPU usage creeps with latest Raven client on Tornado

After upgrading one of our (SeatGeek) production services to use the latest version of Raven (raven==5.24.3 at the time) we experience consistent CPU usage spike to 100% and eventually crash the API workers.

After some profiling it New Relic it appears that it's somehow related to the new(ish) breadcrumbs feature.

screen shot 2016-08-31 at 11 57 44 am

Let me know if I can do anything else to help debug this.

Aug 31 '16 15:08 zackkitzmiller

Do you know in which version this started?

Aug 31 '16 16:08 mitsuhiko

Unfortunately, I don't. I do know the bug doesn't exist in raven==5.15.0, as we have a different service running 5.15.0 running on Tornado just fine.

Aug 31 '16 17:08 zackkitzmiller

@zackkitzmiller so for now i would recommend disabling the breadcrumb support. There is a parameter in the client ctor to disable it. I'm looking into supporting breadcrumbs there but I think there might be further issues.

For what it's worth i have not been able to reproduce this yet but there are definitely general issues with the concurrency model.

If you have anything you can share about your tornado setup (in particular what raven transport you are using and what requirements you have imported) would be great.

Aug 31 '16 17:08 mitsuhiko

Sure, I'm going to stick to 5.15.0 for now.

requirements.txt:

amqp-dispatcher==0.8.0
backports.ssl-match-hostname==3.4.0.2
beautifulsoup4==4.3.2
boto==2.37.0
braintree==3.24.0
coverage==3.7.1
coveralls==0.5
decorator==3.3.2
docopt==0.6.2
ecdsa==0.11
elaphe==0.6.0
Fabric==1.8.3
Flask==0.10.1
freezegun==0.1.7
futures==2.1.6
Fuzzy==1.0
fuzzywuzzy==0.2
gevent==1.0.1
greenlet==0.4.7
gunicorn==18.0
haigha==0.9.0
hashids==1.1.0
httmock==1.2.3
httplib2==0.9
httpretty==0.8.3
intervaltree==2.1.0
iso8601==0.1.8
itsdangerous==0.24
Jinja2==2.8
lxml==3.4.1
MarkupSafe==0.23
mock==1.0.0
mockredispy==2.9.0.9
MySQL-python==1.2.3
ndg-httpsclient==0.4.0
newrelic==2.60.0.46
nose==1.3.3
nose-exclude==0.2.0
oauth2==1.5.211
objgraph==3.0.0
paramiko==1.12.4
paypalrestsdk==1.11.5
phpserialize==1.3
pika==0.10.0
Pillow==2.8.1
pyasn1==0.1.9
py-bcrypt==0.4
pycrypto==2.6.1
pycurl==7.43.0
pygeoip==0.2.7
pyOpenSSL==0.15.1
python-dateutil==1.5
python-Levenshtein==0.10.2
pytz==2011j
PyYAML==3.11
qrcode==5.1
Quadtree==0.1.2
raven==5.8.1
redis==2.10.3
repoze.lru==0.6
requests==2.4.3
six==1.9.0
Sift==1.1.2.6
SQLAlchemy==0.6.6
statsd==0.5.1
suds-jurko==0.6
testhttpserver==0.1.4
timelib==0.2.2
tornado==3.2
toro==1.0.1
ujson==1.34
Unidecode==0.4.9
urllib3==1.7.1
Werkzeug==0.10.4
xmltodict==0.8.6
fakeredis==0.6.2

We're running tornado with gunicorn (+ gevent).

We're using SentryMixin and AsyncSentryClient. As far as I know, we're using whatever the default transport is.

Aug 31 '16 17:08 zackkitzmiller

@zackkitzmiller i recommend sticking to latest but disabling the new features:

ravent = Client(..., enable_breadcrumbs=False, hook_libraries=[], install_logging_hook=False)

There are important fixes since 5.15.0.

Aug 31 '16 17:08 mitsuhiko

@mitsuhiko I'll run that by everyone here.

Aug 31 '16 17:08 zackkitzmiller

@zackkitzmiller is there any chance you might be using multiple raven clients by accident?

Sep 02 '16 19:09 mitsuhiko

I don't believe so. I'm not sure how 1. that would happen, 2. I would find out.

Happy to look into this.

Sep 02 '16 19:09 zackkitzmiller

I cannot confirm if it is due to the same issue. But we experience a memory leak/buildup which seems related to breadcrumbs. I have taken the advice mentioned above to see if our issue is solved that way.

Oct 12 '16 15:10 aequitas

@mitsuhiko I do not believe that we have two clients installed at the same time.

@aequitas Were you able to disable breadcrumbs with the latest client without any issue?

Oct 12 '16 15:10 zackkitzmiller

@zackkitzmiller no issues so far.

Oct 13 '16 13:10 aequitas

FWIW I'm getting a similar issue caused by a memory leak originating from _wrap_logging_method.

There's something wrong with how new_func is created that prevents CPython garbage collector from working correctly.

The issue goes away as soon as I set install_logging_hook to False.

Oct 30 '17 04:10 charettes

@charettes try again with 6.3.0 We had cpu issues with 6.2.5 (or previous version)

Nov 09 '17 18:11 fyaretskiyso

@fyaretskiyso thanks for mentioning it but we are experiencing the same issue with 6.3.0.

Nov 09 '17 18:11 charettes

We've been having what seems to be this issue on our servers for the past few months causing us a lot of server instability and headaches trying figure out what was going on. For us it seems to be an issue when certain timeouts pile up, which causes raven to start consuming a lot of cpu and memory, causing more timeouts and eventually causing our server to hang. I've created some example code to reproduce the problem:

import sys
import tornado.httpclient
import tornado.httpserver
import tornado.web
import tornado.gen
from raven.contrib.tornado import AsyncSentryClient, SentryMixin

#modify the time.time function so we can virtually fast-forward time.
old_time = time.time
inc_time = 0
def new_time(*args, **kwargs):
    return old_time()+inc_time
time.time = new_time

http_client = tornado.httpclient.AsyncHTTPClient()

class MainHandler(SentryMixin, tornado.web.RequestHandler):
    @tornado.gen.coroutine
    def get(self):
        global inc_time
        #simulate a long running request here by adding 30 seconds to time.time
        inc_time +=30
        #time.sleep(5)
        raise

application = tornado.web.Application([
    (r"/", MainHandler),
])
application.sentry_client = AsyncSentryClient(
    sentry_url_here,
#    enable_breadcrumbs=False,
)

def main():
    if sys.argv[1] == "0":
        application.listen(6000)
        tornado.ioloop.IOLoop.instance().start()
    elif sys.argv[1] == "1":
        count = 0
        while True:
            count+=1
            try:
                tornado.httpclient.HTTPClient().fetch("http://localhost:6000/")
            except KeyboardInterrupt:
                raise
            except:
                pass
            print count

if __name__=='__main__':
    main()

Run python script.py 0 in one terminal and python script 1 in another to reproduce the results. When count reaches about 30 in the while loop, the server really starts to slow down to a crawl. This cpu usage is being seen in the Client.encode function of base.py, as the size of the json it's trying to output reaches 100's of megabytes. Turning off breadcrumbs seems to fix the issue.

In the script I've modified the time.time function to simulate a long running, blocking operation. This is not entirely fair since it doesn't give any extra time to do garbage collection, and using time.sleep(30) instead seems to cause the problem to either not happen, or at least happen very slowly. I've found that in this example, using a time.sleep(5) instead will reproduce the problem without having to modify time.time at all (in this case by count = 20 the size of the json reaches 100k, and by count = 60 it's about 70mb, at count = 86 it's 1.4gb and the single tornado application is taking up 21gb of memory).

Strangely enough, when the time.sleep(5) and the modified time.time are removed, there is no issue and the json size doesn't seem to go above 35k.

The version of raven used for the above was 6.7.0, but the problem still seems to exists for older versions, even back to 5.15.0, though different versions do behave slightly differently.

May 11 '18 23:05 BarryJRowe

I'm having exactly the same issue as @BarryJRowe, we had to stop using sentry for our services. I've looking into it, and AsyncSentryClient was trying to send multiple requests with 200Mb of data, so processes were growing to several gigabytes of memory usage just for sentry processing!

May 21 '18 21:05 fabiopedrosa