hdfs icon indicating copy to clipboard operation
hdfs copied to clipboard

Uploading with AsyncWritter can lead to OOM errors

Open Kaldie opened this issue 4 years ago • 3 comments

Currently when using the AsyncWritter, it is possible to have an OOM error due to the queue being huge.

For instance this snippet will fill up the queue faster than it can be send via https to hdfs

import string
import random
import hdfs

client = hdfs(<valid arguments>)

with client.write("filename", encoding="utf-8") as file_handle:
  writer = csv.writer(file_handle)

  # creates 25 pseudo lines of csv junk
  for element in [["".join(random.choice(string.ascii_letters) for _ in range(100)) for _ in range(25)] for _ in range(25)]:
    writer.writerows(element)

Leading to a unmanageable large memory usage.

Is it possible to have a limit on the queue size when creating a file_handle? If you like I would like to create a PR with a possible solution?

Kaldie avatar Jun 23 '21 14:06 Kaldie

Hi @Kaldie. A PR for this would be welcome.

mtth avatar Jun 28 '21 14:06 mtth

Created a PR, however can't seem to link it in here :cry:

Kaldie avatar Jul 06 '21 06:07 Kaldie

Hi @mtth, could you have a look at the corrisponding PR?

Kaldie avatar Sep 01 '21 11:09 Kaldie