arrow icon indicating copy to clipboard operation
arrow copied to clipboard

significant increase in memory when connect hdfs and upload files to hdfs via pyarrow

Open apache135 opened this issue 3 years ago • 1 comments

pyarrow version:7.0.0 connect hdfs to get a fileSystem object by pyarrow,the memory increased from 64M to 254M, then I use this fileSystem object to upload a 2G file to hdfs, the memory increased to 289M after 10 min, the memory keep the same. how can i release the memory?

apache135 avatar Sep 28 '22 15:09 apache135

@apache135 can you share an example of the code you are working with?

AlenkaF avatar Oct 12 '22 06:10 AlenkaF

import os
import time
import psutil
import pyarrow as pa


def get_hdfs_client():
    ticket_path = "/tmp/krb5cc_3114"
    return pa.hdfs.connect(user=os.getenv('USER'), kerb_ticket=ticket_path)


def upload(hdfs_client):
    local_upload_path = "/tmp/lvtest/python399.zip"
    hdfs_path = "/tmp/python399.zip"
    try:
        with open(local_upload_path, "rb") as f_stream:
            hdfs_client.upload(hdfs_path, f_stream, buffer_size=128*1024*1024)
    except Exception as e:
        print(f"upload {hdfs_path} fail: {e}")

    print("upload: ok")

    time.sleep(1)
    try:
        hdfs_client.rm(hdfs_path)
    except Exception as e:
        print(f"rm {hdfs_path} fail: {e}")


if __name__ == '__main__':
    print('========= test start ========')
    start_memory = psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024
    hdfs_client = get_hdfs_client()
    after_get_client_memory = psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024
    print(fr"test done. start memory: {start_memory}, "
          fr"get hdfs client: end memory: {after_get_client_memory}")
    upload(hdfs_client)
    print(fr"finish upload,end memory: {psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024}")
    time.sleep(60)

Here is the result: test done. start memory: 54.19921875, get hdfs client: end memory: 394.4609375 upload: ok finish upload,end memory: 2731.8046875

apache135 avatar Nov 22 '22 06:11 apache135

"/tmp/lvtest/python399.zip" size is 3.1G

Here is the info from a certain time when the script is running PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3146743 user 20 0 21.2g 3.7g 52912 S 17.3 5.8 0:13.43 python

apache135 avatar Nov 22 '22 06:11 apache135

It may be similar to this issuse. https://github.com/python/cpython/issues/104954

apache135 avatar Jun 09 '23 08:06 apache135