GlideRecord.query has potential memory leak or unnecessary list accumulation
I'd like to use pysnc to iteratively (with fixed space/memory) consume a very large ServiceNow table. Based on what I saw in the docs and looking at the source code of PySNC, it appears GlideRecord.query performs pagination and uses an iterator to yield results.
However, I'm using pysnc==1.1.7 and experienced about 9% increase in memory use for every 5000 records queried. It's as if pysnc is storing all records in memory even after I've consumed or iterated past those records, which isn't very standard Python iterator/generator behaviour.
I'd like the issue resolved in way where the script below runs without crashing and consumes only static/fixed memory, perhaps only storing/yielding rows of a single "batch".
Here's a minimum repro script
#!/usr/bin/env python3
import pysnc
import dataclasses
import logging
import os
import typing as t
@dataclasses.dataclass(frozen=True)
class Filter:
key: str
val: str
def __str__(self) -> str:
return f"{self.key}={self.val}"
@dataclasses.dataclass(frozen=True)
class TableQuery:
table: str
filters: t.List[Filter]
limit: int
BATCH_SIZE = 5000
def query(client: pysnc.ServiceNowClient, params: TableQuery) -> t.Iterator[pysnc.GlideRecord]:
gr = client.GlideRecord(params.table, BATCH_SIZE)
gr.display_value = False
gr.limit = params.limit
for filter_ in params.filters:
gr.add_query(filter_.key, filter_.val)
gr.query()
for r in gr:
yield r
def consume_stream(records: t.Iterator[pysnc.GlideRecord]) -> None:
""" dummy consumer function kept short for brevity
in reality there is extra business logic before using csv.DictWriter """
for r in records:
r.serialize()
def main() -> None:
# Setup
logging.basicConfig(level=logging.DEBUG)
env = os.environ.copy()
username, password = env["SERVICE_NOW_TOKEN"].split(':', 1)
client = pysnc.ServiceNowClient(env["SERVICE_NOW_URL"], (username, password))
# Go!
params = TableQuery("sys_user", [Filter("active", "true")], 10**6)
stream = query(client, params)
consume_stream(stream)
if __name__ == "__main__":
main()
Script output
$ ./pysync_query-table-oom.py
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): ${REDACTED}.service-now.com:443
DEBUG:urllib3.connectionpool:https://${REDACTED}.service-now.com:443 "GET /api/now/table/sys_user?sysparm_query=active%3Dtrue%5EORDERBYsys_id&sysparm_display_value=false&sysparm_limit=5000&sysparm_offset=0&sysparm_exclude_reference_link=true&sysparm_suppress_pagination_header=true HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:https://${REDACTED}.service-now.com:443 "GET /api/now/table/sys_user?sysparm_query=active%3Dtrue%5EORDERBYsys_id&sysparm_display_value=false&sysparm_limit=5000&sysparm_offset=5000&sysparm_exclude_reference_link=true&sysparm_suppress_pagination_header=true HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:https://${REDACTED}.service-now.com:443 "GET /api/now/table/sys_user?sysparm_query=active%3Dtrue%5EORDERBYsys_id&sysparm_display_value=false&sysparm_limit=5000&sysparm_offset=10000&sysparm_exclude_reference_link=true&sysparm_suppress_pagination_header=true HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:https://${REDACTED}.service-now.com:443 "GET /api/now/table/sys_user?sysparm_query=active%3Dtrue%5EORDERBYsys_id&sysparm_display_value=false&sysparm_limit=5000&sysparm_offset=15000&sysparm_exclude_reference_link=true&sysparm_suppress_pagination_header=true HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:https://${REDACTED}.service-now.com:443 "GET /api/now/table/sys_user?sysparm_query=active%3Dtrue%5EORDERBYsys_id&sysparm_display_value=false&sysparm_limit=5000&sysparm_offset=20000&sysparm_exclude_reference_link=true&sysparm_suppress_pagination_header=true HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:https://${REDACTED}.service-now.com:443 "GET /api/now/table/sys_user?sysparm_query=active%3Dtrue%5EORDERBYsys_id&sysparm_display_value=false&sysparm_limit=5000&sysparm_offset=25000&sysparm_exclude_reference_link=true&sysparm_suppress_pagination_header=true HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:https://${REDACTED}.service-now.com:443 "GET /api/now/table/sys_user?sysparm_query=active%3Dtrue%5EORDERBYsys_id&sysparm_display_value=false&sysparm_limit=5000&sysparm_offset=30000&sysparm_exclude_reference_link=true&sysparm_suppress_pagination_header=true HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:https://${REDACTED}.service-now.com:443 "GET /api/now/table/sys_user?sysparm_query=active%3Dtrue%5EORDERBYsys_id&sysparm_display_value=false&sysparm_limit=5000&sysparm_offset=35000&sysparm_exclude_reference_link=true&sysparm_suppress_pagination_header=true HTTP/1.1" 200 None
Killed
Condensed monitoring output: CSV size vs System Memory use
$ while true; do ps auxw && wc -l /root/Downloads/2024-06-09_sn-sys-users.csv; sleep 1; done
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 230208 6992 pts/0 Ss 08:48 0:01 /mnt/lima-rosetta/rosetta /usr/bin/bash
root 5300 0.0 0.1 229888 6224 pts/1 Ss 21:17 0:01 /mnt/lima-rosetta/rosetta /usr/bin/bash
root 6698 4.0 9.9 685540 400544 pts/0 S+ 21:44 0:02 /mnt/lima-rosetta/rosetta /root/.local/share/virtualenvs/app-4PlAip
0 /root/Downloads/2024-06-09_sn-sys-users.csv
...
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 230208 6992 pts/0 Ss 08:48 0:01 /mnt/lima-rosetta/rosetta /usr/bin/bash
root 5300 0.0 0.1 229888 6224 pts/1 Ss 21:17 0:01 /mnt/lima-rosetta/rosetta /usr/bin/bash
root 6698 2.8 10.2 696184 411056 pts/0 S+ 21:44 0:02 /mnt/lima-rosetta/rosetta /root/.local/share/virtualenvs/app-4PlAip
5000 /root/Downloads/2024-06-09_sn-sys-users.csv
...
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 230208 6992 pts/0 Ss 08:48 0:01 /mnt/lima-rosetta/rosetta /usr/bin/bash
root 5300 0.0 0.1 229888 6272 pts/1 Ss 21:17 0:01 /mnt/lima-rosetta/rosetta /usr/bin/bash
root 6698 3.2 19.0 1054144 764044 pts/0 S+ 21:44 0:04 /mnt/lima-rosetta/rosetta /root/.local/share/virtualenvs/
9999 /root/Downloads/2024-06-09_sn-sys-users.csv
...
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 230208 6992 pts/0 Ss 08:48 0:01 /mnt/lima-rosetta/rosetta /usr/bin/bash
root 5300 0.0 0.1 229888 6224 pts/1 Ss 21:17 0:01 /mnt/lima-rosetta/rosetta /usr/bin/bash
root 6698 4.1 27.4 1389976 1100128 pts/0 S+ 21:44 0:06 /mnt/lima-rosetta/rosetta /root/.local/share/virtualenvs/app-4PlAip
root 6983 0.0 0.1 262124 7136 pts/1 R+ 21:47 0:00 /usr/bin/ps auxw
15000 /root/Downloads/2024-06-09_sn-sys-users.csv
...
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 230208 6992 pts/0 Ss 08:48 0:01 /mnt/lima-rosetta/rosetta /usr/bin/bash
root 5300 0.0 0.1 229888 6288 pts/1 Ss 21:17 0:01 /mnt/lima-rosetta/rosetta /usr/bin/bash
root 6698 3.8 37.4 1794336 1503436 pts/0 S+ 21:44 0:08 /mnt/lima-rosetta/rosetta /root/.local/share/virtualenvs/app-4PlAip
root 7093 0.0 0.1 262124 7136 pts/1 R+ 21:48 0:00 /usr/bin/ps auxw
19995 /root/Downloads/2024-06-09_sn-sys-users.csv
...
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 230208 6992 pts/0 Ss 08:48 0:01 /mnt/lima-rosetta/rosetta /usr/bin/bash
root 5300 0.0 0.1 229888 6312 pts/1 Ss 21:17 0:01 /mnt/lima-rosetta/rosetta /usr/bin/bash
root 6698 4.1 45.7 2134304 1838488 pts/0 S+ 21:44 0:10 /mnt/lima-rosetta/rosetta /root/.local/share/virtualenvs/app-4PlAip
root 7145 0.0 0.1 262124 7136 pts/1 R+ 21:48 0:00 /usr/bin/ps auxw
25000 /root/Downloads/2024-06-09_sn-sys-users.csv
...
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 230208 6992 pts/0 Ss 08:48 0:01 /mnt/lima-rosetta/rosetta /usr/bin/bash
root 5300 0.0 0.1 229888 6292 pts/1 Ss 21:17 0:01 /mnt/lima-rosetta/rosetta /usr/bin/bash
root 6698 3.8 54.5 2489000 2188268 pts/0 S+ 21:44 0:13 /mnt/lima-rosetta/rosetta /root/.local/share/virtualenvs/app-4PlAip
root 7293 0.0 0.1 262124 6880 pts/1 R+ 21:50 0:00 /usr/bin/ps auxw
29994 /root/Downloads/2024-06-09_sn-sys-users.csv
...
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 230208 6992 pts/0 Ss 08:48 0:01 /mnt/lima-rosetta/rosetta /usr/bin/bash
root 5300 0.0 0.1 229888 6232 pts/1 Ss 21:17 0:01 /mnt/lima-rosetta/rosetta /usr/bin/bash
root 6698 3.9 63.2 2845000 2539212 pts/0 S+ 21:44 0:15 /mnt/lima-rosetta/rosetta /root/.local/share/virtualenvs/app-4PlAip
root 7303 0.0 0.1 262124 6272 pts/1 R+ 21:50 0:00 /usr/bin/ps auxw
34999 /root/Downloads/2024-06-09_sn-sys-users.csv
Good point, I could drop records on yield, as that would be the expected python behavior. I would likely keep .next() from dropping, and require a new .query() if you needed to start over.
This would also likely be a breaking change to many consumers, so it'll have to be major version update
Thank you for the quick response!
Does a major version upgrade imply a longer turnout time (bundling other breaking changes together) compared to regular fixes?
No, just indicates the API changed enough that it could break how people use it.
@naushadh what I've done is wrong, here, but I'm trying to see if I can avoid breaking changes. What do you think about this syntax instead:
def query(client: pysnc.ServiceNowClient, params: TableQuery) -> t.Iterator[pysnc.GlideRecord]:
with client.GlideRecord(params.table, BATCH_SIZE) as gr:
gr.display_value = False
gr.limit = params.limit
for filter_ in params.filters:
gr.add_query(filter_.key, filter_.val)
gr.query()
for r in gr:
yield r
We'd have the technically incorrect behavior as default but the normal expected behavior when using with -- though there's probably some things i need to think through here...
Hey @vetsin, I don’t mind using an alternate syntax to achieve the right behavior.
However, it might be challenging to reconcile the difference between streaming (new) and rewind (current) behavior in the same GlideRecord class.
Perhaps the two consumptions styles require 2 separate GlideRecord implementations?
Another approach is to update the client to support a “query” function which yields an iterator of GlideRecords. Current users get current behavior with GlideRecord query (and a deprecation warning), and new users get fixed behavior from the client.
Writing the following down for notes/myself:
I believe, in pythonic terms, this is what is what would truly be expected:
gr = client.GlideRecord(table)
gr.query()
for r in iter(gr):
assert r.sys_id, 'should pass'
for r in iter(gr):
assert r.sys_id, 'should pass'
while gr.next():
assert gr.sys_id, 'should pass'
for r in gr:
assert r.sys_id, 'should pass'
for r in gr:
assert r.sys_id, 'should fail'
A possibility would be query() to return an iterable that's different from self (perhaps even just iter(self)?), however the PEP does require that __iter__ return self
Been swamped so haven't got this working -- some work to be done on how .query() behaves with non-rewinding
python's built-in range keeps producing values as is, but wrapping it in iter makes it exhaustible.
>>> x = range(1,5)
>>> list(x)
[1, 2, 3, 4]
>>> list(x)
[1, 2, 3, 4]
>>> y = iter(x)
>>> list(y)
[1, 2, 3, 4]
>>> list(y)
[]
>>> list(x)
[1, 2, 3, 4]
Another possible API is also specifying to the GlideRecord rewind=False in the constructor. Existing users will get rewind=True as a default, and new users can opt-in to the single use behaviour.
released into 1.1.8 but didn't document the change -- mostly to get dependency updates out
@naushadh i presume by the lack of complaints it works-as-expected from you? If so i'll document it into the next release.