arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[Python] `pyarrow.Table.from_pandas()` causing memory leak

Open RizzoV opened this issue 2 years ago • 4 comments

Describe the bug, including details regarding any error messages, version, and platform.

Issue Description

(continuing from https://github.com/pandas-dev/pandas/issues/55296)

pyarrow.Table.from_pandas() causes a memory leak on DataFrames containing nested structs. A sample problematic data schema and a compliant data generator is included in the Reproducible Example below.

From the Reproducible Example:

  • 1st pa.Table.from_pandas() call:
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    74     91.9 MiB     91.9 MiB           1   @profile
    75                                         def convert_df_to_table(df: pd.DataFrame):
    76     91.9 MiB      0.0 MiB           1       table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))
  • 2000th call:
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    74    140.1 MiB    140.1 MiB           1   @profile
    75                                         def convert_df_to_table(df: pd.DataFrame):
    76    140.1 MiB      0.0 MiB           1       table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))
  • 10000th call:
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    74    329.4 MiB    329.4 MiB           1   @profile
    75                                         def convert_df_to_table(df: pd.DataFrame):
    76    329.5 MiB      0.0 MiB           1       table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))

Reproducible Example

import os
import string
import sys
from random import choice, randint
from uuid import uuid4

import pandas as pd
import pyarrow as pa
from memory_profiler import profile

sample_schema = pa.struct(
    [
        ("a", pa.string()),
        (
            "b",
            pa.struct(
                [
                    ("ba", pa.list_(pa.string())),
                    ("bc", pa.string()),
                    ("bd", pa.string()),
                    ("be", pa.list_(pa.string())),
                    (
                        "bf",
                        pa.list_(
                            pa.struct(
                                [
                                    (
                                        "bfa",
                                        pa.struct(
                                            [
                                                ("bfaa", pa.string()),
                                                ("bfab", pa.string()),
                                                ("bfac", pa.string()),
                                                ("bfad", pa.float64()),
                                                ("bfae", pa.string()),
                                            ]
                                        ),
                                    )
                                ]
                            )
                        ),
                    ),
                ]
            ),
        ),
        ("c", pa.int64()),
        ("d", pa.int64()),
        ("e", pa.string()),
        (
            "f",
            pa.struct(
                [
                    ("fa", pa.string()),
                    ("fb", pa.string()),
                    ("fc", pa.string()),
                    ("fd", pa.string()),
                    ("fe", pa.string()),
                    ("ff", pa.string()),
                    ("fg", pa.string()),
                ]
            ),
        ),
        ("g", pa.int64()),
    ]
)


def generate_random_string(str_length: int) -> str:
    return "".join(
        [choice(string.ascii_lowercase + string.digits) for n in range(str_length)]
    )


@profile
def convert_df_to_table(df: pd.DataFrame) -> None:
     table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))


def generate_random_data():
    return {
        "a": [generate_random_string(128)],
        "b": [
            {
                "ba": [generate_random_string(128) for i in range(50)],
                "bc": generate_random_string(128),
                "bd": generate_random_string(128),
                "be": [generate_random_string(128) for i in range(50)],
                "bf": [
                    {
                        "bfa": {
                            "bfaa": generate_random_string(128),
                            "bfab": generate_random_string(128),
                            "bfac": generate_random_string(128),
                            "bfad": randint(0, 2**32),
                            "bfae": generate_random_string(128),
                        }
                    }
                ],
            }
        ],
        "c": [randint(0, 2**32)],
        "d": [randint(0, 2**32)],
        "e": [generate_random_string(128)],
        "f": [
            {
                "fa": generate_random_string(128),
                "fb": generate_random_string(128),
                "fc": generate_random_string(128),
                "fd": generate_random_string(128),
                "fe": generate_random_string(128),
                "ff": generate_random_string(128),
                "fg": generate_random_string(128),
            }
        ],
        "g": [randint(0, 2**32)],
    }


def main():
    for i in range(10000):
        df = pd.DataFrame.from_dict(generate_random_data())
        # pa.jemalloc_set_decay_ms(0)
        convert_df_to_table(df)  # memory leak


if __name__ == "__main__":
    main()

Installed Versions

INSTALLED VERSIONS
------------------
python              : 3.10.9.final.0
python-bits         : 64
OS                  : Darwin
OS-release          : 22.6.0
Version             : Darwin Kernel Version 22.6.0: Fri Sep 15 13:39:52 PDT 2023; root:xnu-8796.141.3.700.8~1/RELEASE_X86_64
machine             : x86_64
processor           : i386
byteorder           : little
LC_ALL              : None
LANG                : it_IT.UTF-8
LOCALE              : it_IT.UTF-8

pyarrow             : 13.0.0
pandas              : 2.1.1
numpy               : 1.26.0

Component(s)

Python

RizzoV avatar Oct 03 '23 13:10 RizzoV

@RizzoV thanks for the report and nice reproducer!

I can reproduce this running your example with memray:

newplot(2)

From the memray stats, it looks like the memory being held at the end is mostly coming from the list with strings, so somehow the conversion to arrow seems to keep those list object alive (haven't yet looked at how that is possible, though). And also the pandas metadata conversion (the json dump) seems to accumulate memory, although that's a bit strange (but I don't see that in the smaller reproducer below).

It seems it is specifically happens when having a list that is nested inside another column (eg struct of list), so I can reproduce the observation as well with this simplified example:

import string
from random import choice

import pandas as pd
import pyarrow as pa


sample_schema = pa.struct(
    [
        ( "a", pa.struct([("aa", pa.list_(pa.string()))])),
    ]
)


def generate_random_string(str_length: int) -> str:
    return "".join(
        [choice(string.ascii_lowercase + string.digits) for n in range(str_length)]
    )


def generate_random_data():
    return {
        "a": [{"aa": [generate_random_string(128) for i in range(50)]}],
    }


def main():
    for i in range(10000):
        df = pd.DataFrame.from_dict(generate_random_data())
        # pa.jemalloc_set_decay_ms(0)
        table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))


if __name__ == "__main__":
    main()

jorisvandenbossche avatar Oct 03 '23 18:10 jorisvandenbossche

@RizzoV / @jorisvandenbossche : any solution for the memory leak in to_parquet() ?, we are also facing this issue for long time

Ashokcs94 avatar Dec 06 '23 04:12 Ashokcs94

@Ashokcs94 no solution from my side sadly, we still have to work around it

RizzoV avatar Dec 06 '23 08:12 RizzoV

I believe I found a fix for this in https://github.com/apache/arrow/pull/40412, please take a look :)

chunyang avatar Mar 07 '24 23:03 chunyang

Issue resolved by pull request 40412 https://github.com/apache/arrow/pull/40412

jorisvandenbossche avatar Mar 15 '24 15:03 jorisvandenbossche