rules_python icon indicating copy to clipboard operation
rules_python copied to clipboard

Allow using environment for configuring toolchains and pypi with bzlmod

Open rickeylev opened this issue 3 months ago • 5 comments

Over in jax (and some related projects, like xla), they're currently using workspace and have a pretty bespoke way of configuring their toolchains and pip settings.

They use environment variables to specify the python version, url, sha, and threading. Those generate some repos with the values, which eventually decide/feed into the python_repository/python_register_toolchains/pip_parse rules. The net effect is they have one python version for the whole build, but are able to change it without modifying workspace. CI jobs and users can then set the values to change what python is used (both for the toolchain and for pip). Thus, they're able to have something resembling multi-version support. The basic logic of their WORKSPACE is something like:

load("env.bzl", "env")
env() # reads env vars, generates '@env'
load("@env//...", "HERMETIC_...")
if url:
  python_register_toolchains(name="python", TOOL_VERSIONS=<env url>)
else:
  python_register_toolchains(name="python", <env version>>)
pip_parse(version=<env var>, interpreter=@python//:interpreter)

Something this env var setup allows that our builtin multi-version setup doesn't is allowing the user to easily specify an alternative python runtime. e.g. they simply do HERMETIC_PYTHON_VERSION=3.15, HERMETIC_PYTHON_URL=file:///cpython-3.15.tar.gz, and then its used for everything, including by pip_parse.

I think there's two basic needs this is trying to serve: 1.. Allowing easily overriding the runtime. This allows custom building Python (at head, with santizers, etc) and using it. 2. pip_parse can be sensitive to the runtime used. This design helps ensure the right interpreter is used. In particular, if a freethreading interpreter is used.

For (1), local toolchain rules should be able to handle this, mostly. The problem I see is several bzlmod APIs want a string literal for python version, but with such a toolchain, we don't know the version until its run.

For (2), pip.parse can use python_interpreter_target to point to a local runtime. However, (1) python_version is required, which we don't know, and (2) pip.parse is particular about duplicate calls.

Sketching a MODULE.bazel, I came up with this:

local_runtime = use_repo_rule(...)
local_runtime(name="local_runtime", path="python3")
local_toolchain = use_repo_rule(...)
local_toolchain(name="local_toolchains", repos=["local_runtime"], TCW=<//:py=local>)
register_toolchains("@local_toolchains//...")

pip = use_extension(...)

pip.parse(
  python_version = ???
  python_interpreter_target = "@local_runtime//:python3")
  requirements = "//:requirements-local.txt",
  config_settings = ???
)

# Run
export PATH=$PATH:/cpython-src/build/python3.15
bazel build --@//:py=local //:foo

The python_version and config_settings part for pip.parse is unclear.

Maybe add python_version_target="@local_runtime//:version.txt" ? If we could get rid of the python_version attribute entirely that might be better. Is it actually required (during bzlmod/repo phase) if the interpreter target is given explicitly?

I'm not sure what config_settings would be for pip.parse. Maybe just match what is set on the toolchain?

Some misc improvments to local toolchain that might help:

  • Allow local toolchains to get python from a particular envvar. (modifying PATH seems invasive, potentially expensive)
  • Generate a bzl file with the detected python version. There's some various contexts where a loading-phase string is needed of the python version (py_wheel, py_binary.python_version, among others). These should probably be updated to accept a label for the python version, where feasible.

rickeylev avatar Sep 25 '25 17:09 rickeylev

One way of thinking: The python_version within pip.parse is needed to select the right wheels if we are using the downloader. I would say that pip.parse in the jax and xla case should be configured against all possible python configurations and then we only select the toolchain using env var. Selecting everything within pip.parse may be just too much work.

Another way of thinking: We allow the pip.parse to be in host mode where the python version is specified using a well known env var (see below). With that we should be almost good, we still need to know where to get the python_interpreter_target from for whl_library consumption, but with pipstar the need would be only for sdist. And I am sure that even without pipstar we can do the whole wiring with not too much trouble.

Third way of thinking. We already have this and it should work? The other overrides that you want to specify should be also possible, although I am not sure if I agree that it is bzlmod-esque to use environment to override big parts of MODULE.bazel entries. There is a way to include files via MODULE.bazel and the CI, etc, could override/generate those files in their CI and we would not need to do any invasive changes in rules_python to accommodate that.

I think the following:

include("rules_python_host_config.MODULE.bazel")

# rules_python_host_config.MODULE.bazel
pip.single_version_override(
    ...
)

# if you chose to generate this as well.
pip.parse(
    ...
)

Fourth way of thinking: What makes it difficult to define all of the combinations in MODULE.bazel? Instead of doing this with env vars, we could be explicit. Would it break bazel query?

aignas avatar Sep 26 '25 00:09 aignas

Extra thoughts - we could default to default python version if not specified in pip.parse and given that we already have an env var for controlling the python version, we would be good, other than the URL override for python version.

There is another ticket that is asking to configure all available python versions if it not specified #1708. Maybe that would be also good for this - since there would be only one toolchain version specified, we would the usecase described here would also work.

That said, sdist building in the repo phase would require somehow being able to switch the threadedness or libc of the interpreter, which is a little more involved, but as long as the python extension is doing the heavy lifting, maybe it would be OK?

aignas avatar Sep 26 '25 10:09 aignas

we already have python_version_env and it should work

Yes, that works for the selecting the default version for the toolchain, but pip.parse doesn't consult that.

Hm -- maybe pip.parse should have a way to say "get the version from python_interpreter_target" ? Allow python_version to be unset. If unset, it looks for a file in the python_interpreter_target directory (version.txt, version.json, w/e). If not found, it tries to run the python_interpreter_target to get the version.

edit: ah, i see you suggested something similar :) ; i was looking at an old load of the page

What makes it difficult to define all of the combinations in MODULE.bazel?

A good example if if you want to custom build of Python and use it, in particular if its a new version. I can write a generic local_toolchain definition in MODULE.bazel that would work for any locally installed python (custom built or otherwise). In particular, I don't have to specify the python version when I define this. This makes a raw toolchain definition work.

However, when we try to add pip dependencies, I have a problem -- pip.parse requires a python version. There isn't any valid value I can put to work with an arbitrary local toolchain. Maybe today I'm testing Python 3.14.1 and tomorrow I'm testing Python 3.15. Maybe there is already a 3.14 pip.parse call, written assuming 3.14.0, not the 3.14.1 I'm testing with (i.e. I can't just add another pip.parse call, I have to replace the existing).

What this means is, if I'm happy/prefer to use a local python toolchain, I can't use pip.parse

Just to be clear, while this use-case came up from JAX, I don't want to frame this as "what does jax need?". Generalizing this, it's more about making it possible to use a local python, have external dependencies, and to do so with much less configuration.

rickeylev avatar Sep 26 '25 20:09 rickeylev

That said, sdist building in the repo phase would require somehow being able to switch the threadedness or libc of the interpreter, which is a little more involved, but as long as the python extension is doing the heavy lifting, maybe it would be OK?

Funny you should mention this -- JAX is doing something very close to that: building python from head with freethreading, then building some external deps manually from head for freethreading. This is part of their CI (they track upstreams closely and a really keen on freethreaded). I didn't want to bring up sdist because building sdists in repo phase is such a huge can of worms -- definitely a separate topic. Picking the right interpreter for repo-phase (threadedness, libc) is somewhat part of that. But, I think that whole can of worms can be avoided if we just punt everything to local -- if you just want to use your local system, thats fine. All we really need on the rules side is urls and config setting names to wire up the toolchain settings and pip hub download and select().

rickeylev avatar Sep 26 '25 21:09 rickeylev

have pip get default from python extension

POC in https://github.com/bazel-contrib/rules_python/pull/3298

It was fairly easy to hack in (the hub_builder refactor made it much easier). The core changes are pretty small (basically just having pypi extension load the default python version, then pass it along to hub_builder)

Overall I like the direction. There's still a lot of boilerplate config to put in MODULE.bazel, but its mostly one-time config.

I think the main thing to figure out is what to do if the same python_version occurs in separate pip.parse calls (once via static configuration, once via this new use-via-default).

rickeylev avatar Sep 27 '25 18:09 rickeylev