arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-39968: [Python][FS][Azure] Minimal Python bindings for `AzureFileSystem`

Open Tom-Newton opened this issue 1 year ago • 35 comments

Rationale for this change

We want to use the new AzureFileSystem in pyarrow.

What changes are included in this PR?

  • Add minimal python bindings for AzureFileSystem. This includes just enough to run the python tests against azurite plus default credential auth to enable real use of this once this PR merges.
  • Adding additional configuration options and remaining authentication options can be done as a follow up.
  • I tried to copy the existing pybinds for GCS and S3
  • Explicitly set ARROW_AZURE=OFF rather than relying on defaults. The defaults are different for builds vs tests so this was causing tests to be enabled while Azure was disabled during the build.

Are these changes tested?

Enabled the the python filesystem tests for the new filesystem. I had to skip azure in a couple of the tests though because they are not yet working on the C++ side. I created Github issues to resolve these https://github.com/apache/arrow/issues/40025 and https://github.com/apache/arrow/issues/40026 and added TODO comments where relevant, that reference these Github issues.

Are there any user-facing changes?

pyarrow users can now use the native AzureFileSystem to get much better reliability and performance compared to adlfs based options.

  • Closes: #39968
  • GitHub Issue: #39968

Tom-Newton avatar Feb 09 '24 21:02 Tom-Newton

We may want to update ci/scripts/python_*.sh/.github/workflows/python.yml too for PYARROW_WITH_AZURE in this PR. Or we can do it in a separated PR to keep this PR minimal.

I updated .github/workflows/python.yml and ci/scripts/python_sdist_build.sh. I think these are the only ones I missed in https://github.com/apache/arrow/pull/39971. Probably I missed them because GCS was disabled.

Tom-Newton avatar Feb 10 '24 21:02 Tom-Newton

The MATLAB builds seem to be having issues. I don't think these can be related to my changes

Tom-Newton avatar Feb 10 '24 21:02 Tom-Newton

Yes. MATLAB related failures are unrelated. Could you open an issue for it to ignore the failures in this PR?

kou avatar Feb 11 '24 06:02 kou

@github-actions crossbow submit -g cpp -g wheel

kou avatar Feb 11 '24 06:02 kou

Revision: e7a5df839415f3dc665f9a021f47ed053c7dc0f9

Submitted crossbow builds: ursacomputing/crossbow @ actions-47111d92a7

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-11-cpp-amd64 GitHub Actions
test-debian-11-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
wheel-macos-big-sur-cp310-arm64 GitHub Actions
wheel-macos-big-sur-cp311-arm64 GitHub Actions
wheel-macos-big-sur-cp312-arm64 GitHub Actions
wheel-macos-big-sur-cp38-arm64 GitHub Actions
wheel-macos-big-sur-cp39-arm64 GitHub Actions
wheel-macos-catalina-cp310-amd64 GitHub Actions
wheel-macos-catalina-cp311-amd64 GitHub Actions
wheel-macos-catalina-cp312-amd64 GitHub Actions
wheel-macos-catalina-cp38-amd64 GitHub Actions
wheel-macos-catalina-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-arm64 GitHub Actions
wheel-manylinux-2-28-cp311-amd64 GitHub Actions
wheel-manylinux-2-28-cp311-arm64 GitHub Actions
wheel-manylinux-2-28-cp312-amd64 GitHub Actions
wheel-manylinux-2-28-cp312-arm64 GitHub Actions
wheel-manylinux-2-28-cp38-amd64 GitHub Actions
wheel-manylinux-2-28-cp38-arm64 GitHub Actions
wheel-manylinux-2-28-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp39-arm64 GitHub Actions
wheel-manylinux-2014-cp310-amd64 GitHub Actions
wheel-manylinux-2014-cp310-arm64 GitHub Actions
wheel-manylinux-2014-cp311-amd64 GitHub Actions
wheel-manylinux-2014-cp311-arm64 GitHub Actions
wheel-manylinux-2014-cp312-amd64 GitHub Actions
wheel-manylinux-2014-cp312-arm64 GitHub Actions
wheel-manylinux-2014-cp38-amd64 GitHub Actions
wheel-manylinux-2014-cp38-arm64 GitHub Actions
wheel-manylinux-2014-cp39-amd64 GitHub Actions
wheel-manylinux-2014-cp39-arm64 GitHub Actions
wheel-windows-cp310-amd64 GitHub Actions
wheel-windows-cp311-amd64 GitHub Actions
wheel-windows-cp312-amd64 GitHub Actions
wheel-windows-cp38-amd64 GitHub Actions
wheel-windows-cp39-amd64 GitHub Actions

github-actions[bot] avatar Feb 11 '24 06:02 github-actions[bot]

Yes. MATLAB related failures are unrelated. Could you open an issue for it to ignore the failures in this PR?

Created an issue: https://github.com/apache/arrow/issues/40034

Tom-Newton avatar Feb 11 '24 12:02 Tom-Newton

2 CI failures: appvayor: Build execution time has reached the maximum allowed time for your plan (90 minutes). C++ / AMD64 macOS 12 C++ (pull_request): 97/97 Test #73: arrow-s3fs-test ..............................***Timeout 300.06 sec

I think both are unrelated to this PR

Tom-Newton avatar Feb 12 '24 09:02 Tom-Newton

Can you check that this is actually tested in one of the CI python builds? I think right now it is being skipped in all the default builds here in PRs (and also if enabled, not sure if azurite is set up in those builds)

Yeah, pretty sure I missed this. It looks like Python CI test suites are all based on the conda build plus one mac OS build. I will enable Azure on these and install azurite.

Tom-Newton avatar Feb 13 '24 14:02 Tom-Newton

I think I'm still going to need to install azurite in a couple of places. I'm just having a bad time trying to run the conda builds locally.

Tom-Newton avatar Feb 13 '24 15:02 Tom-Newton

I assume you will have to add the Azure C++ SDK dependency to ci/conda_env_cpp.txt

jorisvandenbossche avatar Feb 13 '24 15:02 jorisvandenbossche

I'm trying to work out these CI failures. I'm not 100% sure if they are related to my changes. They are all in conda builds which I have modified but the errors are all related to gcs testbench and I can't reproduce the failures locally.

Tom-Newton avatar Feb 14 '24 08:02 Tom-Newton

Can reproduce the CI failures locally with PYTHON=3.9 docker-compose build conda-python assuming conda and conda-cpp have been built first.

I can also reproduce this error on main.

Tom-Newton avatar Feb 14 '24 10:02 Tom-Newton

All this conda build stuff is turning into a bit of a mess. I should have included it in https://github.com/apache/arrow/pull/39971. I think at this point its best if I start a new PR to handle conda builds separately.

Tom-Newton avatar Feb 14 '24 10:02 Tom-Newton

Created https://github.com/apache/arrow/pull/40080 for the conda and other build stuff.

Tom-Newton avatar Feb 14 '24 16:02 Tom-Newton

From my point of view this is ready to go. I'm confident the tests are running now as we can see logs for the tests that are currently skipped on Azure

SKIPPED [2] opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/tests/test_fs.py:506: Not implemented yet in for Azure. See GH-40025

Tom-Newton avatar Feb 27 '24 18:02 Tom-Newton

I see that some tests are skipped on macOS because azurite-blob cannot be found, is it expected? https://github.com/apache/arrow/actions/runs/8068749515/job/22042241403?pr=40021#step:6:451

pitrou avatar Feb 28 '24 16:02 pitrou

@github-actions crossbow submit -g python -g wheel

pitrou avatar Feb 28 '24 16:02 pitrou

Revision: 15dc0eb8bbd3d00653339b2827b6c1150577a6c5

Submitted crossbow builds: ursacomputing/crossbow @ actions-14e098965e

Task Status
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest GitHub Actions
test-conda-python-3.10-pandas-nightly GitHub Actions
test-conda-python-3.10-spark-v3.5.0 GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-upstream_devel GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.8 GitHub Actions
test-conda-python-3.8-pandas-1.0 GitHub Actions
test-conda-python-3.8-spark-v3.5.0 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-latest GitHub Actions
test-cuda-python GitHub Actions
test-debian-11-python-3-amd64 Azure
test-debian-11-python-3-i386 GitHub Actions
test-fedora-39-python-3 Azure
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-python-3 GitHub Actions
wheel-macos-big-sur-cp310-arm64 GitHub Actions
wheel-macos-big-sur-cp311-arm64 GitHub Actions
wheel-macos-big-sur-cp312-arm64 GitHub Actions
wheel-macos-big-sur-cp38-arm64 GitHub Actions
wheel-macos-big-sur-cp39-arm64 GitHub Actions
wheel-macos-catalina-cp310-amd64 GitHub Actions
wheel-macos-catalina-cp311-amd64 GitHub Actions
wheel-macos-catalina-cp312-amd64 GitHub Actions
wheel-macos-catalina-cp38-amd64 GitHub Actions
wheel-macos-catalina-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-arm64 GitHub Actions
wheel-manylinux-2-28-cp311-amd64 GitHub Actions
wheel-manylinux-2-28-cp311-arm64 GitHub Actions
wheel-manylinux-2-28-cp312-amd64 GitHub Actions
wheel-manylinux-2-28-cp312-arm64 GitHub Actions
wheel-manylinux-2-28-cp38-amd64 GitHub Actions
wheel-manylinux-2-28-cp38-arm64 GitHub Actions
wheel-manylinux-2-28-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp39-arm64 GitHub Actions
wheel-manylinux-2014-cp310-amd64 GitHub Actions
wheel-manylinux-2014-cp310-arm64 GitHub Actions
wheel-manylinux-2014-cp311-amd64 GitHub Actions
wheel-manylinux-2014-cp311-arm64 GitHub Actions
wheel-manylinux-2014-cp312-amd64 GitHub Actions
wheel-manylinux-2014-cp312-arm64 GitHub Actions
wheel-manylinux-2014-cp38-amd64 GitHub Actions
wheel-manylinux-2014-cp38-arm64 GitHub Actions
wheel-manylinux-2014-cp39-amd64 GitHub Actions
wheel-manylinux-2014-cp39-arm64 GitHub Actions
wheel-windows-cp310-amd64 GitHub Actions
wheel-windows-cp311-amd64 GitHub Actions
wheel-windows-cp312-amd64 GitHub Actions
wheel-windows-cp38-amd64 GitHub Actions
wheel-windows-cp39-amd64 GitHub Actions

github-actions[bot] avatar Feb 28 '24 16:02 github-actions[bot]

I see that some tests are skipped on macOS because azurite-blob cannot be found, is it expected? https://github.com/apache/arrow/actions/runs/8068749515/job/22042241403?pr=40021#step:6:451

The same problem can be seen on the wheel builds, and potentially other builds.

Perhaps we need to error out when PyArrow Azure testing is required and Azurite is not available?

pitrou avatar Feb 28 '24 16:02 pitrou

I see that some tests are skipped on macOS because azurite-blob cannot be found, is it expected? https://github.com/apache/arrow/actions/runs/8068749515/job/22042241403?pr=40021#step:6:451

It is not expected to me. I had noticed it before (perhaps I should have mentioned it) but I assumed it was expected because it looks like the same problem exists for minio for S3 tests.

Tom-Newton avatar Feb 28 '24 17:02 Tom-Newton

Summary of CI failures:

Related to my changes, that I need to fix: Azure build is not enabled on fedora or debian but the tests are still running and unsurprisingly fail
https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=61513&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=6032 https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=61512&view=logs&j=50a69d0a-7972-5459-cdae-135ee6ebe312&t=13df7b5c-76db-5c26-6592-75581a9ed64a&l=6087

I think unrelated, that I will not look into: S3 test timeout https://github.com/apache/arrow/actions/runs/8084728188/job/22090642540?pr=40021 Looks like a problem introduced in the pandas nightly https://github.com/ursacomputing/crossbow/actions/runs/8083585629/job/22086967459 Something related to timestamp arrays https://github.com/ursacomputing/crossbow/actions/runs/8083584576/job/22086961342

Tom-Newton avatar Feb 28 '24 21:02 Tom-Newton

I think unrelated, that I will not look into:

Indeed, those are all unrelated and happening on main as well

jorisvandenbossche avatar Feb 29 '24 08:02 jorisvandenbossche

Perhaps we need to error out when PyArrow Azure testing is required and Azurite is not available?

I think ideally we still skip the tests if azurite is not available, even when you have an install that has AzureFileSystem available. At least for local testing I would prefer that. But our system of automatically skipping is not always great on CI when you don't notice all your tests are just being skipped ..

jorisvandenbossche avatar Feb 29 '24 08:02 jorisvandenbossche

I think the fedora and debian builds should be fixed. Problem was because PYARROW_WITH_AZURE defaults to OFF but PYARROW_TEST_AZURE defaults to ON when ARROW_AZURE env var is not set (I just copied this from GCS and S3). I have explicitly set ARROW_AZURE=OFF everywhere that it was not previously set to fix this.

Tom-Newton avatar Feb 29 '24 10:02 Tom-Newton

@github-actions crossbow submit -g python -g wheel

pitrou avatar Feb 29 '24 10:02 pitrou

Revision: e9313c4b2fad6345d8064460c0753b419f722e0a

Submitted crossbow builds: ursacomputing/crossbow @ actions-65e6a8e195

Task Status
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest GitHub Actions
test-conda-python-3.10-pandas-nightly GitHub Actions
test-conda-python-3.10-spark-v3.5.0 GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-upstream_devel GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.8 GitHub Actions
test-conda-python-3.8-pandas-1.0 GitHub Actions
test-conda-python-3.8-spark-v3.5.0 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-latest GitHub Actions
test-cuda-python GitHub Actions
test-debian-11-python-3-amd64 Azure
test-debian-11-python-3-i386 GitHub Actions
test-fedora-39-python-3 Azure
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-python-3 GitHub Actions
wheel-macos-big-sur-cp310-arm64 GitHub Actions
wheel-macos-big-sur-cp311-arm64 GitHub Actions
wheel-macos-big-sur-cp312-arm64 GitHub Actions
wheel-macos-big-sur-cp38-arm64 GitHub Actions
wheel-macos-big-sur-cp39-arm64 GitHub Actions
wheel-macos-catalina-cp310-amd64 GitHub Actions
wheel-macos-catalina-cp311-amd64 GitHub Actions
wheel-macos-catalina-cp312-amd64 GitHub Actions
wheel-macos-catalina-cp38-amd64 GitHub Actions
wheel-macos-catalina-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-arm64 GitHub Actions
wheel-manylinux-2-28-cp311-amd64 GitHub Actions
wheel-manylinux-2-28-cp311-arm64 GitHub Actions
wheel-manylinux-2-28-cp312-amd64 GitHub Actions
wheel-manylinux-2-28-cp312-arm64 GitHub Actions
wheel-manylinux-2-28-cp38-amd64 GitHub Actions
wheel-manylinux-2-28-cp38-arm64 GitHub Actions
wheel-manylinux-2-28-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp39-arm64 GitHub Actions
wheel-manylinux-2014-cp310-amd64 GitHub Actions
wheel-manylinux-2014-cp310-arm64 GitHub Actions
wheel-manylinux-2014-cp311-amd64 GitHub Actions
wheel-manylinux-2014-cp311-arm64 GitHub Actions
wheel-manylinux-2014-cp312-amd64 GitHub Actions
wheel-manylinux-2014-cp312-arm64 GitHub Actions
wheel-manylinux-2014-cp38-amd64 GitHub Actions
wheel-manylinux-2014-cp38-arm64 GitHub Actions
wheel-manylinux-2014-cp39-amd64 GitHub Actions
wheel-manylinux-2014-cp39-arm64 GitHub Actions
wheel-windows-cp310-amd64 GitHub Actions
wheel-windows-cp311-amd64 GitHub Actions
wheel-windows-cp312-amd64 GitHub Actions
wheel-windows-cp38-amd64 GitHub Actions
wheel-windows-cp39-amd64 GitHub Actions

github-actions[bot] avatar Feb 29 '24 11:02 github-actions[bot]

There seem to be some new CI failures after rebasing. These look unrelated to my changes.

>   samples = [seed.replace(year=y) for y in range(1992, 2092)]
E   ValueError: day is out of range for month

Tom-Newton avatar Feb 29 '24 11:02 Tom-Newton

@Tom-Newton Leap year failures. This PyArrow test seems to fail on Feb 29. We should perhaps retry tomorrow to get a clearer view of the remaining issues :-)

pitrou avatar Feb 29 '24 12:02 pitrou

Actually, you can rebase from main now that https://github.com/apache/arrow/pull/40288 has been merged

pitrou avatar Feb 29 '24 13:02 pitrou

@github-actions crossbow submit -g python -g wheel

pitrou avatar Feb 29 '24 13:02 pitrou