GH-39968: [Python][FS][Azure] Minimal Python bindings for `AzureFileSystem`
Rationale for this change
We want to use the new AzureFileSystem in pyarrow.
What changes are included in this PR?
- Add minimal python bindings for
AzureFileSystem. This includes just enough to run the python tests against azurite plus default credential auth to enable real use of this once this PR merges. - Adding additional configuration options and remaining authentication options can be done as a follow up.
- I tried to copy the existing pybinds for GCS and S3
- Explicitly set
ARROW_AZURE=OFFrather than relying on defaults. The defaults are different for builds vs tests so this was causing tests to be enabled while Azure was disabled during the build.
Are these changes tested?
Enabled the the python filesystem tests for the new filesystem. I had to skip azure in a couple of the tests though because they are not yet working on the C++ side. I created Github issues to resolve these https://github.com/apache/arrow/issues/40025 and https://github.com/apache/arrow/issues/40026 and added TODO comments where relevant, that reference these Github issues.
Are there any user-facing changes?
pyarrow users can now use the native AzureFileSystem to get much better reliability and performance compared to adlfs based options.
- Closes: #39968
- GitHub Issue: #39968
We may want to update
ci/scripts/python_*.sh/.github/workflows/python.ymltoo forPYARROW_WITH_AZUREin this PR. Or we can do it in a separated PR to keep this PR minimal.
I updated .github/workflows/python.yml and ci/scripts/python_sdist_build.sh. I think these are the only ones I missed in https://github.com/apache/arrow/pull/39971. Probably I missed them because GCS was disabled.
The MATLAB builds seem to be having issues. I don't think these can be related to my changes
Yes. MATLAB related failures are unrelated. Could you open an issue for it to ignore the failures in this PR?
@github-actions crossbow submit -g cpp -g wheel
Revision: e7a5df839415f3dc665f9a021f47ed053c7dc0f9
Submitted crossbow builds: ursacomputing/crossbow @ actions-47111d92a7
Yes. MATLAB related failures are unrelated. Could you open an issue for it to ignore the failures in this PR?
Created an issue: https://github.com/apache/arrow/issues/40034
2 CI failures:
appvayor: Build execution time has reached the maximum allowed time for your plan (90 minutes).
C++ / AMD64 macOS 12 C++ (pull_request): 97/97 Test #73: arrow-s3fs-test ..............................***Timeout 300.06 sec
I think both are unrelated to this PR
Can you check that this is actually tested in one of the CI python builds? I think right now it is being skipped in all the default builds here in PRs (and also if enabled, not sure if azurite is set up in those builds)
Yeah, pretty sure I missed this. It looks like Python CI test suites are all based on the conda build plus one mac OS build. I will enable Azure on these and install azurite.
I think I'm still going to need to install azurite in a couple of places. I'm just having a bad time trying to run the conda builds locally.
I assume you will have to add the Azure C++ SDK dependency to ci/conda_env_cpp.txt
I'm trying to work out these CI failures. I'm not 100% sure if they are related to my changes. They are all in conda builds which I have modified but the errors are all related to gcs testbench and I can't reproduce the failures locally.
Can reproduce the CI failures locally with PYTHON=3.9 docker-compose build conda-python assuming conda and conda-cpp have been built first.
I can also reproduce this error on main.
All this conda build stuff is turning into a bit of a mess. I should have included it in https://github.com/apache/arrow/pull/39971. I think at this point its best if I start a new PR to handle conda builds separately.
Created https://github.com/apache/arrow/pull/40080 for the conda and other build stuff.
From my point of view this is ready to go. I'm confident the tests are running now as we can see logs for the tests that are currently skipped on Azure
SKIPPED [2] opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/tests/test_fs.py:506: Not implemented yet in for Azure. See GH-40025
I see that some tests are skipped on macOS because azurite-blob cannot be found, is it expected?
https://github.com/apache/arrow/actions/runs/8068749515/job/22042241403?pr=40021#step:6:451
@github-actions crossbow submit -g python -g wheel
Revision: 15dc0eb8bbd3d00653339b2827b6c1150577a6c5
Submitted crossbow builds: ursacomputing/crossbow @ actions-14e098965e
I see that some tests are skipped on macOS because
azurite-blobcannot be found, is it expected? https://github.com/apache/arrow/actions/runs/8068749515/job/22042241403?pr=40021#step:6:451
The same problem can be seen on the wheel builds, and potentially other builds.
Perhaps we need to error out when PyArrow Azure testing is required and Azurite is not available?
I see that some tests are skipped on macOS because
azurite-blobcannot be found, is it expected? https://github.com/apache/arrow/actions/runs/8068749515/job/22042241403?pr=40021#step:6:451
It is not expected to me. I had noticed it before (perhaps I should have mentioned it) but I assumed it was expected because it looks like the same problem exists for minio for S3 tests.
Summary of CI failures:
Related to my changes, that I need to fix:
Azure build is not enabled on fedora or debian but the tests are still running and unsurprisingly fail
https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=61513&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=6032
https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=61512&view=logs&j=50a69d0a-7972-5459-cdae-135ee6ebe312&t=13df7b5c-76db-5c26-6592-75581a9ed64a&l=6087
I think unrelated, that I will not look into: S3 test timeout https://github.com/apache/arrow/actions/runs/8084728188/job/22090642540?pr=40021 Looks like a problem introduced in the pandas nightly https://github.com/ursacomputing/crossbow/actions/runs/8083585629/job/22086967459 Something related to timestamp arrays https://github.com/ursacomputing/crossbow/actions/runs/8083584576/job/22086961342
I think unrelated, that I will not look into:
Indeed, those are all unrelated and happening on main as well
Perhaps we need to error out when PyArrow Azure testing is required and Azurite is not available?
I think ideally we still skip the tests if azurite is not available, even when you have an install that has AzureFileSystem available. At least for local testing I would prefer that. But our system of automatically skipping is not always great on CI when you don't notice all your tests are just being skipped ..
I think the fedora and debian builds should be fixed. Problem was because PYARROW_WITH_AZURE defaults to OFF but PYARROW_TEST_AZURE defaults to ON when ARROW_AZURE env var is not set (I just copied this from GCS and S3). I have explicitly set ARROW_AZURE=OFF everywhere that it was not previously set to fix this.
@github-actions crossbow submit -g python -g wheel
Revision: e9313c4b2fad6345d8064460c0753b419f722e0a
Submitted crossbow builds: ursacomputing/crossbow @ actions-65e6a8e195
There seem to be some new CI failures after rebasing. These look unrelated to my changes.
> samples = [seed.replace(year=y) for y in range(1992, 2092)]
E ValueError: day is out of range for month
@Tom-Newton Leap year failures. This PyArrow test seems to fail on Feb 29. We should perhaps retry tomorrow to get a clearer view of the remaining issues :-)
Actually, you can rebase from main now that https://github.com/apache/arrow/pull/40288 has been merged
@github-actions crossbow submit -g python -g wheel