significant increase in memory when connect hdfs and upload files to hdfs via pyarrow
pyarrow version:7.0.0 connect hdfs to get a fileSystem object by pyarrow,the memory increased from 64M to 254M, then I use this fileSystem object to upload a 2G file to hdfs, the memory increased to 289M after 10 min, the memory keep the same. how can i release the memory?
@apache135 can you share an example of the code you are working with?
import os
import time
import psutil
import pyarrow as pa
def get_hdfs_client():
ticket_path = "/tmp/krb5cc_3114"
return pa.hdfs.connect(user=os.getenv('USER'), kerb_ticket=ticket_path)
def upload(hdfs_client):
local_upload_path = "/tmp/lvtest/python399.zip"
hdfs_path = "/tmp/python399.zip"
try:
with open(local_upload_path, "rb") as f_stream:
hdfs_client.upload(hdfs_path, f_stream, buffer_size=128*1024*1024)
except Exception as e:
print(f"upload {hdfs_path} fail: {e}")
print("upload: ok")
time.sleep(1)
try:
hdfs_client.rm(hdfs_path)
except Exception as e:
print(f"rm {hdfs_path} fail: {e}")
if __name__ == '__main__':
print('========= test start ========')
start_memory = psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024
hdfs_client = get_hdfs_client()
after_get_client_memory = psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024
print(fr"test done. start memory: {start_memory}, "
fr"get hdfs client: end memory: {after_get_client_memory}")
upload(hdfs_client)
print(fr"finish upload,end memory: {psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024}")
time.sleep(60)
Here is the result: test done. start memory: 54.19921875, get hdfs client: end memory: 394.4609375 upload: ok finish upload,end memory: 2731.8046875
"/tmp/lvtest/python399.zip" size is 3.1G
Here is the info from a certain time when the script is running PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3146743 user 20 0 21.2g 3.7g 52912 S 17.3 5.8 0:13.43 python
It may be similar to this issuse. https://github.com/python/cpython/issues/104954