[Python] `pyarrow.Table.from_pandas()` causing memory leak
Describe the bug, including details regarding any error messages, version, and platform.
Issue Description
(continuing from https://github.com/pandas-dev/pandas/issues/55296)
pyarrow.Table.from_pandas() causes a memory leak on DataFrames containing nested structs. A sample problematic data schema and a compliant data generator is included in the Reproducible Example below.
From the Reproducible Example:
- 1st
pa.Table.from_pandas()call:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
74 91.9 MiB 91.9 MiB 1 @profile
75 def convert_df_to_table(df: pd.DataFrame):
76 91.9 MiB 0.0 MiB 1 table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))
- 2000th call:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
74 140.1 MiB 140.1 MiB 1 @profile
75 def convert_df_to_table(df: pd.DataFrame):
76 140.1 MiB 0.0 MiB 1 table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))
- 10000th call:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
74 329.4 MiB 329.4 MiB 1 @profile
75 def convert_df_to_table(df: pd.DataFrame):
76 329.5 MiB 0.0 MiB 1 table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))
Reproducible Example
import os
import string
import sys
from random import choice, randint
from uuid import uuid4
import pandas as pd
import pyarrow as pa
from memory_profiler import profile
sample_schema = pa.struct(
[
("a", pa.string()),
(
"b",
pa.struct(
[
("ba", pa.list_(pa.string())),
("bc", pa.string()),
("bd", pa.string()),
("be", pa.list_(pa.string())),
(
"bf",
pa.list_(
pa.struct(
[
(
"bfa",
pa.struct(
[
("bfaa", pa.string()),
("bfab", pa.string()),
("bfac", pa.string()),
("bfad", pa.float64()),
("bfae", pa.string()),
]
),
)
]
)
),
),
]
),
),
("c", pa.int64()),
("d", pa.int64()),
("e", pa.string()),
(
"f",
pa.struct(
[
("fa", pa.string()),
("fb", pa.string()),
("fc", pa.string()),
("fd", pa.string()),
("fe", pa.string()),
("ff", pa.string()),
("fg", pa.string()),
]
),
),
("g", pa.int64()),
]
)
def generate_random_string(str_length: int) -> str:
return "".join(
[choice(string.ascii_lowercase + string.digits) for n in range(str_length)]
)
@profile
def convert_df_to_table(df: pd.DataFrame) -> None:
table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))
def generate_random_data():
return {
"a": [generate_random_string(128)],
"b": [
{
"ba": [generate_random_string(128) for i in range(50)],
"bc": generate_random_string(128),
"bd": generate_random_string(128),
"be": [generate_random_string(128) for i in range(50)],
"bf": [
{
"bfa": {
"bfaa": generate_random_string(128),
"bfab": generate_random_string(128),
"bfac": generate_random_string(128),
"bfad": randint(0, 2**32),
"bfae": generate_random_string(128),
}
}
],
}
],
"c": [randint(0, 2**32)],
"d": [randint(0, 2**32)],
"e": [generate_random_string(128)],
"f": [
{
"fa": generate_random_string(128),
"fb": generate_random_string(128),
"fc": generate_random_string(128),
"fd": generate_random_string(128),
"fe": generate_random_string(128),
"ff": generate_random_string(128),
"fg": generate_random_string(128),
}
],
"g": [randint(0, 2**32)],
}
def main():
for i in range(10000):
df = pd.DataFrame.from_dict(generate_random_data())
# pa.jemalloc_set_decay_ms(0)
convert_df_to_table(df) # memory leak
if __name__ == "__main__":
main()
Installed Versions
INSTALLED VERSIONS
------------------
python : 3.10.9.final.0
python-bits : 64
OS : Darwin
OS-release : 22.6.0
Version : Darwin Kernel Version 22.6.0: Fri Sep 15 13:39:52 PDT 2023; root:xnu-8796.141.3.700.8~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : it_IT.UTF-8
LOCALE : it_IT.UTF-8
pyarrow : 13.0.0
pandas : 2.1.1
numpy : 1.26.0
Component(s)
Python
@RizzoV thanks for the report and nice reproducer!
I can reproduce this running your example with memray:
From the memray stats, it looks like the memory being held at the end is mostly coming from the list with strings, so somehow the conversion to arrow seems to keep those list object alive (haven't yet looked at how that is possible, though). And also the pandas metadata conversion (the json dump) seems to accumulate memory, although that's a bit strange (but I don't see that in the smaller reproducer below).
It seems it is specifically happens when having a list that is nested inside another column (eg struct of list), so I can reproduce the observation as well with this simplified example:
import string
from random import choice
import pandas as pd
import pyarrow as pa
sample_schema = pa.struct(
[
( "a", pa.struct([("aa", pa.list_(pa.string()))])),
]
)
def generate_random_string(str_length: int) -> str:
return "".join(
[choice(string.ascii_lowercase + string.digits) for n in range(str_length)]
)
def generate_random_data():
return {
"a": [{"aa": [generate_random_string(128) for i in range(50)]}],
}
def main():
for i in range(10000):
df = pd.DataFrame.from_dict(generate_random_data())
# pa.jemalloc_set_decay_ms(0)
table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))
if __name__ == "__main__":
main()
@RizzoV / @jorisvandenbossche : any solution for the memory leak in to_parquet() ?, we are also facing this issue for long time
@Ashokcs94 no solution from my side sadly, we still have to work around it
I believe I found a fix for this in https://github.com/apache/arrow/pull/40412, please take a look :)
Issue resolved by pull request 40412 https://github.com/apache/arrow/pull/40412