Uploading more than once to a path adds it to subfolders
I am not sure if this is indented please let me know if it is.
When uploading a directory first time it will add the data into the correct spot; i.e: hdfs-path/sub-folder. However, when trying to add more data to the same place it output it in the /hdfs_path/sub-folder/<local_name>/.
If this is not an intended output, I believe the culprit is here on line 553 where hdfs_path and local_name are joined. I removed the local_name on the join and it seemed to upload all data into hdfs_path while making no subfolders.
https://github.com/mtth/hdfs/blob/5b40065adbe1a5627b0b513daf13b41c9819a9be/hdfs/client.py#L553
EDIT
Coded used:
for p in files:
file_path = "sub_folder"
upload_path = "%s/%s" % ("/hdfs-path", "sub_folder")
client.upload(upload_path, file_path, overwrite=True, n_threads=0)
After a bit more debugging, I found that if the path in hdfs exists, it will append the folder name in which the files are coming from. I need the files to be added to the specified directory and not to the directory + sub folder. To remedy this I created a new variable called use_existing. When True it will use the hdfs path and not the hdfs+local_name.
Again let me know if my understanding is off, or you would like a PR with the added variable.
Thanks for the detailed report. Your understanding is correct. It is implemented this way to be consistent with local commands:
# In an empty directory
$ mkdir src1 src2
$ cp -r src1 dst # Copies src1 as dst
$ cp -r src2 dst # Copies src2 as dst/src2
As you point out, there is a usability gap though. You can achieve what you are trying to do locally by globbing (cp -r src2/* dst) but there is no equivalent here, at least until https://github.com/mtth/hdfs/issues/105. I think this justifies adding an option; if you send a PR I would be happy to review it.