bug: opendal python could hang while file.write
Describe the bug
When calling file.write (upload file to tencent cos by s3 scheme), there is a very slight chance that it hangs or does not return.
strace -p <pid> output:
futex(0x7ff1bc73fe54, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=72667171, tv_nsec=27695447}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x7ff1bc73fe58, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1bc73fe54, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=72667171, tv_nsec=32828162}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x7ff1bc73fe58, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1bc73fe54, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=72667171, tv_nsec=37962838}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x7ff1bc73fe58, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1bc73fe54, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=72667171, tv_nsec=43111763}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x7ff1bc73fe58, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1bc73fe54, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=72667171, tv_nsec=48237693}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
......
futex(0x7ff1bc73fe58, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
....
maybe file.write had not release GIL, all python's threads is hang.
Steps to Reproduce
no
Expected Behavior
no
Additional Context
No response
Are you willing to submit a PR to fix this bug?
- [ ] Yes, I would like to submit a PR.
Hi @seewindcn! I'm Dosu and I’m helping the opendal team.
There are a few known issues in opendal Python bindings that could cause hangs or blocking during file.write, especially when using S3-compatible backends like Tencent COS. One common cause is interaction with the underlying Tokio async runtime and Python's multiprocessing or fork behavior, which can lead to resource and lock contention and hang all threads if the GIL is not properly released or if resources are not reinitialized after fork. This is a broader problem affecting all Python crates that depend on Tokio, not just OpenDAL [source].
Another area to check is whether the file is being properly closed after writing. In some cases, failing to call close or shutdown on the writer can leave resources hanging or uploads incomplete, especially for multipart uploads [source] [source].
To help narrow this down, could you share:
- The opendal and Python versions you're using
- Whether you're using multiprocessing or forking in your application
- If the hang happens with both sync and async APIs
- A Python stack trace (using e.g.
faulthandler.dump_traceback_later) during the hang
In the meantime, make sure to always call close/shutdown on the file after writing, and avoid forking after initializing opendal. If you are using multiprocessing, try to initialize opendal objects after the fork, not before. If you can reproduce the issue with a minimal script, that would also help pinpoint the cause.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
There are a few known issues that can cause hangs or blocking in opendal Python bindings when writing to S3-compatible backends like Tencent COS, even when using sync APIs and without multiprocessing or forking. The most common causes are related to multipart upload handling and resource cleanup. It's important to always call close on the file writer after writing—missing this step can leave uploads incomplete or cause the process to hang, especially for multipart uploads where all parts must be finalized before the operation completes. Simply dropping the writer or relying on garbage collection is not enough; explicit close is required for proper cleanup and to avoid silent failures or hangs [source] [source].
Timeouts or hangs can also occur if the number of concurrent upload tasks exceeds the backend or OpenDAL's concurrency limits, or if there are network issues during multipart upload completion [source]. These issues can be intermittent and hard to reproduce.
There is no direct evidence of a bug in opendal==0.45.20 causing sync file.write to hang with S3 backends, but related issues with multipart upload handling and resource cleanup have been reported. If you continue to see hangs, try enabling more detailed logging or capturing a Python stack trace during the hang, and check if the issue correlates with large files or specific network conditions. If possible, share a minimal reproducible example or more details about the file sizes and backend behavior for deeper diagnosis.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
Hi @seewindcn, can you share the code you used?
def _writes():
for chunk in resp.iter_content(chunk_size=10240): # resp is requests Response
if chunk:
f.write(chunk)
if _sf:
_sf.write(chunk) # hang
with open(save_file, 'wb') as f: # f is local file object
if _store:
logging.info('[build]save_jenkins_file(%s) to %s@s3', save_file, C.STORE_OPS)
_sf = _store.op.open(save_file, 'wb') # _sf is opendal's sync file object
with _sf:
_writes()
else:
_writes()
@Xuanwo
maybe block by gil https://pyo3.rs/v0.27.1/parallelism.html
@seewindcn can you provide a simple reproducer? I can not reproduce in my local machine
hi @yihong0618 , can't provide.
There are two questions:
- GIL release. this one you had fixed;
- file.write hang, no return or raise error when ETIMEDOUT (it might be related to my use of the temporary credential);
maybe can add timeout param for Operator ?
hi @yihong0618 , can't provide. There are two questions:
- GIL release. this one you had fixed;
- file.write hang, no return or raise error when ETIMEDOUT (it might be related to my use of the temporary credential);
maybe can add timeout param for Operator ?
copy can you double-check if my patch fix your issue?