Canonical way to isolate binary info from `PyExecutableInfo`
🚀 feature request
Relevant Rules
Adding information to PyExecutableInfo (impacts py_binary, py_test).
Description
Adding something to PyExecutableInfo that helps communicate "here's extra stuff for the binary that wouldn't be present were this a library".
Right now I have a custom rules that takes deps which expose PyInfo (which py_binary does).
It then executes some custom logic stemming from something like this:
inputs = depset(
transitive = [dep[DefaultInfo].data_runfiles.files for dep in ctx.attr.deps] +
[dep[DefaultInfo].default_runfiles.files for dep in ctx.attr.deps],
)
After some more processing of that depset named "inputs", it then exposes them to other rules which seek to cp or symlink those files around.
This fails due to errors such as:
FileNotFoundError: [Errno 2] No such file or directory: 'bazel-out/darwin_arm64-fastbuild/bin/tools/runnable/examples/python/_example_app_bin.venv/bin/python3'
This only occurs with build --@rules_python//python/config_settings:bootstrap_impl=script.
Describe the solution you'd like
It sounds like I generally need to re-work my rules because accepting the heterogeneous inputs of Pyinfo vs PyInfo AND PyExecutableInfo will need some better handling on my end, but ideally there is a canonical way for me to remove any extra things that are only necessary because something is a binary dependency.
Please see this Slack thread for extra context: https://bazelbuild.slack.com/archives/CA306CEV6/p1759447665927069?thread_ts=1759177072.316529&cid=CA306CEV6
If PyExecutableInfo was modified to add something like
PyExecutableInfo = provider(
fields = {
# ... existing fields ...
"content_files": """
:type: depset[File]
The actual content files (sources, data) without runtime scaffolding.
This excludes venv directories, bootstrap scripts, and executable wrappers.
""",
}
)
then I think my use case would be unblocked relatively nicely.
Describe alternatives you've considered
Right now I have added some workarounds to my custom rule to check for the presence of PyExecutableInfo and then do some heuristic filtering of paths based on patterns that I've seen.
This seems to work but feels non ideal for obvious reasons
def _is_runtime_artifact(file):
"""
Identify files that are runtime scaffolding, not actual content.
This cannot identify the actual shell script that `py_binary`
generates, but that is largely static and small, and more importantly it exists
at build time, so there is no real harm in including it.
The bootstrap script is also harmless to keep in but we remove it for completeness.
"""
path = file.short_path
basename = file.basename
# Venv artifacts added by `rules_python` with bootstrap_impl=script
if path.endswith(".venv/bin/python3"):
return True
# Bootstrap scripts added by `rules_python` with bootstrap_impl=script
if basename.endswith("_stage2_bootstrap.py"):
return True
return False
At a high level, my thinking is:
- PyInfo looks the same as if py_library had been used
- PyExecutableInfo has additional fields to re-compose the executable. i.e. given PyInfo and PyExecutableInfo, you should be able to (1) re-create what py_binary does and (2) derive a new one without having to do expensive analysis phase (e.g. depset-flattening) operations.
Where it gets tricky is DefaultInfo. That has to have a default output of the executable, and its runfiles have to contain the required executable files (executable, stage2 bootstrap, venv files, python runtime files, and transitive runfiles). The only way I can see to deal with this is for PyExecutableInfo to have a field with an alternative DefaultInfo. Such that one can then do:
runfiles = []
if PyExecutableInfo in t:
runfiles.append(t[PEI].library_only_runfiles)
elif PyInfo in t:
runfiles.append(t[DefaultInfo].runfiles)
# or
if PEI in t:
outputs.append(t[PEI].library_only_default_outputs)
else
outputs.append(t[DefaultInfo].files)
I'm not entirely sure how the "binary specific files" needs to be broken out on PEI though. There's the following "categories" of files I can think of:
- executable itself
- stage2 bootstrap
- venv bin/python3 file (this seems "special" compared to others bin/ files)
- venv
bin/<other entries>files - venv site-packages files
- venv
_bazel_site_init.pthandbazel_site_init.py - Python interpreter runtime
That's a lot of fields, though. I'm not sure which are too broad or too granular. Some are specific to the particular toolchain implementation (e.g. stage2 bootstrap).
Something that might help are some concrete use cases for the types of transforms/derivations being done. Some things that have crossed my mind:
- "merging" multiple binaries into a single executable
- Creating a self-contained executable (of which there are many impls)
- Depending on a binary, but consuming it like a library
A few raw thoughts if it helps.
executable itself
This is the shell script? Yea this is tough to remove automatically on my end since it is named depending on the binary name - although perhaps I can do some extra magic inside my internal rule to derive that given detection of PEI. I probably wouldn't bother in my internal rules through because that shell script is harmless enough for now.
stage2 bootstrap
easy to detet manually
venv bin/python3 file (this seems "special" compared to others bin/ files)
Yep it does seem special - do you have any idea why? This is the one that is always broken as "missing" if my rule tried to naively cp data around.
venv bin/
files venv site-packages files
Funnily enough, for my use case, I separate between internal deps and external + pypi deps (with the help of a rules_python patch), and so all the other stuff just falls out pretty naturally as being "unowned" by my repo. So, for now its not bothering me, and depending on how it is fixed it may actually break my custom rule (but I assume I can adapt).
venv _bazel_site_init.pth and bazel_site_init.py
These are treated as owned by my repo and thus included in my artifacts, but seem harmless enough to include. But I think they can be detected statically if I feel strongly - need to double check.
Python interpreter runtime
Don't think I've seen this one.
Something that might help are some concrete use cases for the types of transforms/derivations being done. Some things that have crossed my mind:
- "merging" multiple binaries into a single executable
- Creating a self-contained executable (of which there are many impls)
- Depending on a binary, but consuming it like a library
I can give you a high level description of my use cases. We have been treating py_binary as a canonical concept of a "runnable" that we then hand to an internal rule which dresses it up, exposes the underlying local py_binary for bazel run purposes, and then stubs out some providers for downstream rules to consume for use cases such as:
- Create a wheel that has all transitive internal deps in the src files and all transitive external dips in a
requirements.txtfile and themainof thepy_binaryas the entrypoint- This is related to the internal vs external deps and
rules_pythonpatch described above in this comment
- This is related to the internal vs external deps and
- Bake the runnable/binary + main entrypoint into an image that we run as a container app
- Bake the runnable/binary + main entrypoint into an Azure Function Apps project (that has its own specific directory structure and deploy process)
The biggest alternative I have considered so far, and I am curious if you think this is just worth moving to in terms of simplifying everything, is to just create a new macro which has all the same attrs as py_binary, but makes sure to stub out a py_library internally, so that a lot of the headaches described in this issue go away. Maybe this is just the path forward - I like the idea of depending on a py_binary as it respects the foundational unit from rules_python, but if wrapping it internally clears away a lot of headaches, I am very open to it.
runfiles = []
if PyExecutableInfo in t:
runfiles.append(t[PEI].library_only_runfiles)
elif PyInfo in t:
runfiles.append(t[DefaultInfo].runfiles)
# or
if PEI in t:
outputs.append(t[PEI].library_only_default_outputs)
else
outputs.append(t[DefaultInfo].files)
Something like this would be fine for me I think. I think it's a reasonable cost for custom rules if they want to support both types of inputs.
For the executable, you can use DefaultInfo.files_to_run.executable to identify it
venv bin/python3 gives errors if cp'd around
Under the hood, with bootstrap=script, relative symlinks are used to make bin/python3 refer back to the underlying python interpreter elsewhere in runfiles (e.g. $runfiles/foo.venv/bin/python3 -> ../../+python_linux_x86/bin/python). Similar is done for other entries in the venv (when --venvs_site_package_libs=yes). If you're on Bazel 8, you can identify such files with File.is_symlink().
venv bin/python3 is special, but why
Ah, I think because the first stage bootstrap relies on it. It goes looking for the interpreter to use, so it has e.g. _main/foo.venv/bin/python3 embedded into it as the place to go looking. If one were going to derive a new executable, then you'd need to know the path to bin/python3 (for most cases, anyways).
Interesting - I wonder if my workaround can be pretty principled then by using DefaultInfo.files_to_run.executable for the executable, checking for symlinks for the .venv/bin/python3, and then relying on my existing separation of internal vs internal deps for the site package stuff. It's a mouthful but might get me what i need. I am mainly focused on the "Create a wheel with internal vs external deps + entrypoint" case for now though so maybe I am missing something about the other cases I described (which are more forward looking for us).
@rickeylev I'd be open to taking a stab at a PR for this if you can share some guidance on what the providers should look like.