opendal icon indicating copy to clipboard operation
opendal copied to clipboard

bug: opendal python could hang while file.write

Open seewindcn opened this issue 2 months ago • 9 comments

Describe the bug

When calling file.write (upload file to tencent cos by s3 scheme), there is a very slight chance that it hangs or does not return. strace -p <pid> output:

futex(0x7ff1bc73fe54, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=72667171, tv_nsec=27695447}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x7ff1bc73fe58, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1bc73fe54, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=72667171, tv_nsec=32828162}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x7ff1bc73fe58, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1bc73fe54, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=72667171, tv_nsec=37962838}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x7ff1bc73fe58, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1bc73fe54, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=72667171, tv_nsec=43111763}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x7ff1bc73fe58, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1bc73fe54, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=72667171, tv_nsec=48237693}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
......
futex(0x7ff1bc73fe58, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
....

maybe file.write had not release GIL, all python's threads is hang.

Steps to Reproduce

no

Expected Behavior

no

Additional Context

No response

Are you willing to submit a PR to fix this bug?

  • [ ] Yes, I would like to submit a PR.

seewindcn avatar Dec 03 '25 04:12 seewindcn

Hi @seewindcn! I'm Dosu and I’m helping the opendal team.

There are a few known issues in opendal Python bindings that could cause hangs or blocking during file.write, especially when using S3-compatible backends like Tencent COS. One common cause is interaction with the underlying Tokio async runtime and Python's multiprocessing or fork behavior, which can lead to resource and lock contention and hang all threads if the GIL is not properly released or if resources are not reinitialized after fork. This is a broader problem affecting all Python crates that depend on Tokio, not just OpenDAL [source].

Another area to check is whether the file is being properly closed after writing. In some cases, failing to call close or shutdown on the writer can leave resources hanging or uploads incomplete, especially for multipart uploads [source] [source].

To help narrow this down, could you share:

  • The opendal and Python versions you're using
  • Whether you're using multiprocessing or forking in your application
  • If the hang happens with both sync and async APIs
  • A Python stack trace (using e.g. faulthandler.dump_traceback_later) during the hang

In the meantime, make sure to always call close/shutdown on the file after writing, and avoid forking after initializing opendal. If you are using multiprocessing, try to initialize opendal objects after the fork, not before. If you can reproduce the issue with a minimal script, that would also help pinpoint the cause.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other

Chat with Dosu Open in Cursor Join Discord Share on X

dosubot[bot] avatar Dec 03 '25 04:12 dosubot[bot]

@dosu

  • python 3.11.9, opendal==0.45.20
  • no multiprocessing or forking
  • use sync APIs

seewindcn avatar Dec 03 '25 04:12 seewindcn

There are a few known issues that can cause hangs or blocking in opendal Python bindings when writing to S3-compatible backends like Tencent COS, even when using sync APIs and without multiprocessing or forking. The most common causes are related to multipart upload handling and resource cleanup. It's important to always call close on the file writer after writing—missing this step can leave uploads incomplete or cause the process to hang, especially for multipart uploads where all parts must be finalized before the operation completes. Simply dropping the writer or relying on garbage collection is not enough; explicit close is required for proper cleanup and to avoid silent failures or hangs [source] [source].

Timeouts or hangs can also occur if the number of concurrent upload tasks exceeds the backend or OpenDAL's concurrency limits, or if there are network issues during multipart upload completion [source]. These issues can be intermittent and hard to reproduce.

There is no direct evidence of a bug in opendal==0.45.20 causing sync file.write to hang with S3 backends, but related issues with multipart upload handling and resource cleanup have been reported. If you continue to see hangs, try enabling more detailed logging or capturing a Python stack trace during the hang, and check if the issue correlates with large files or specific network conditions. If possible, share a minimal reproducible example or more details about the file sizes and backend behavior for deeper diagnosis.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other

Chat with Dosu Open in Cursor Join Discord Share on X

dosubot[bot] avatar Dec 03 '25 04:12 dosubot[bot]

Hi @seewindcn, can you share the code you used?

Xuanwo avatar Dec 03 '25 06:12 Xuanwo

    def _writes():
        for chunk in resp.iter_content(chunk_size=10240):  # resp is requests Response
            if chunk:
                f.write(chunk)
                if _sf:
                    _sf.write(chunk)  # hang

    with open(save_file, 'wb') as f:  # f is local file object
        if _store:
            logging.info('[build]save_jenkins_file(%s) to %s@s3', save_file, C.STORE_OPS)
            _sf = _store.op.open(save_file, 'wb')  # _sf is opendal's sync file object
            with _sf:
                _writes()
        else:
            _writes()

@Xuanwo

seewindcn avatar Dec 03 '25 10:12 seewindcn

maybe block by gil https://pyo3.rs/v0.27.1/parallelism.html

yihong0618 avatar Dec 03 '25 11:12 yihong0618

@seewindcn can you provide a simple reproducer? I can not reproduce in my local machine

yihong0618 avatar Dec 03 '25 14:12 yihong0618

hi @yihong0618 , can't provide.
There are two questions:

  1. GIL release. this one you had fixed;
  2. file.write hang, no return or raise error when ETIMEDOUT (it might be related to my use of the temporary credential);

maybe can add timeout param for Operator ?

seewindcn avatar Dec 04 '25 01:12 seewindcn

hi @yihong0618 , can't provide. There are two questions:

  1. GIL release. this one you had fixed;
  2. file.write hang, no return or raise error when ETIMEDOUT (it might be related to my use of the temporary credential);

maybe can add timeout param for Operator ?

copy can you double-check if my patch fix your issue?

yihong0618 avatar Dec 04 '25 01:12 yihong0618