rules_python icon indicating copy to clipboard operation
rules_python copied to clipboard

Canonical way to isolate binary info from `PyExecutableInfo`

Open FrankPortman opened this issue 3 months ago • 6 comments

🚀 feature request

Relevant Rules

Adding information to PyExecutableInfo (impacts py_binary, py_test).

Description

Adding something to PyExecutableInfo that helps communicate "here's extra stuff for the binary that wouldn't be present were this a library".

Right now I have a custom rules that takes deps which expose PyInfo (which py_binary does).

It then executes some custom logic stemming from something like this:

    inputs = depset(
        transitive = [dep[DefaultInfo].data_runfiles.files for dep in ctx.attr.deps] +
                     [dep[DefaultInfo].default_runfiles.files for dep in ctx.attr.deps],
    )

After some more processing of that depset named "inputs", it then exposes them to other rules which seek to cp or symlink those files around.

This fails due to errors such as:

FileNotFoundError: [Errno 2] No such file or directory: 'bazel-out/darwin_arm64-fastbuild/bin/tools/runnable/examples/python/_example_app_bin.venv/bin/python3'

This only occurs with build --@rules_python//python/config_settings:bootstrap_impl=script.

Describe the solution you'd like

It sounds like I generally need to re-work my rules because accepting the heterogeneous inputs of Pyinfo vs PyInfo AND PyExecutableInfo will need some better handling on my end, but ideally there is a canonical way for me to remove any extra things that are only necessary because something is a binary dependency.

Please see this Slack thread for extra context: https://bazelbuild.slack.com/archives/CA306CEV6/p1759447665927069?thread_ts=1759177072.316529&cid=CA306CEV6

If PyExecutableInfo was modified to add something like

PyExecutableInfo = provider(
    fields = {
        # ... existing fields ...
        "content_files": """
        :type: depset[File]
        The actual content files (sources, data) without runtime scaffolding.
        This excludes venv directories, bootstrap scripts, and executable wrappers.
        """,
    }
)

then I think my use case would be unblocked relatively nicely.

Describe alternatives you've considered

Right now I have added some workarounds to my custom rule to check for the presence of PyExecutableInfo and then do some heuristic filtering of paths based on patterns that I've seen.

This seems to work but feels non ideal for obvious reasons

def _is_runtime_artifact(file):
    """
    Identify files that are runtime scaffolding, not actual content.

    This cannot identify the actual shell script that `py_binary`
    generates, but that is largely static and small, and more importantly it exists
    at build time, so there is no real harm in including it.

    The bootstrap script is also harmless to keep in but we remove it for completeness.
    """
    path = file.short_path
    basename = file.basename

    # Venv artifacts added by `rules_python` with bootstrap_impl=script
    if path.endswith(".venv/bin/python3"):
        return True

    # Bootstrap scripts added by `rules_python` with bootstrap_impl=script
    if basename.endswith("_stage2_bootstrap.py"):
        return True

    return False

FrankPortman avatar Oct 06 '25 19:10 FrankPortman

At a high level, my thinking is:

  • PyInfo looks the same as if py_library had been used
  • PyExecutableInfo has additional fields to re-compose the executable. i.e. given PyInfo and PyExecutableInfo, you should be able to (1) re-create what py_binary does and (2) derive a new one without having to do expensive analysis phase (e.g. depset-flattening) operations.

Where it gets tricky is DefaultInfo. That has to have a default output of the executable, and its runfiles have to contain the required executable files (executable, stage2 bootstrap, venv files, python runtime files, and transitive runfiles). The only way I can see to deal with this is for PyExecutableInfo to have a field with an alternative DefaultInfo. Such that one can then do:

runfiles = []

if PyExecutableInfo in t:
  runfiles.append(t[PEI].library_only_runfiles)
elif PyInfo in t:
  runfiles.append(t[DefaultInfo].runfiles)

# or
if PEI in t:
  outputs.append(t[PEI].library_only_default_outputs)
else
  outputs.append(t[DefaultInfo].files)

I'm not entirely sure how the "binary specific files" needs to be broken out on PEI though. There's the following "categories" of files I can think of:

  • executable itself
  • stage2 bootstrap
  • venv bin/python3 file (this seems "special" compared to others bin/ files)
  • venv bin/<other entries> files
  • venv site-packages files
  • venv _bazel_site_init.pth and bazel_site_init.py
  • Python interpreter runtime

That's a lot of fields, though. I'm not sure which are too broad or too granular. Some are specific to the particular toolchain implementation (e.g. stage2 bootstrap).

Something that might help are some concrete use cases for the types of transforms/derivations being done. Some things that have crossed my mind:

  • "merging" multiple binaries into a single executable
  • Creating a self-contained executable (of which there are many impls)
  • Depending on a binary, but consuming it like a library

rickeylev avatar Oct 08 '25 19:10 rickeylev

A few raw thoughts if it helps.

executable itself

This is the shell script? Yea this is tough to remove automatically on my end since it is named depending on the binary name - although perhaps I can do some extra magic inside my internal rule to derive that given detection of PEI. I probably wouldn't bother in my internal rules through because that shell script is harmless enough for now.

stage2 bootstrap

easy to detet manually

venv bin/python3 file (this seems "special" compared to others bin/ files)

Yep it does seem special - do you have any idea why? This is the one that is always broken as "missing" if my rule tried to naively cp data around.

venv bin/ files venv site-packages files

Funnily enough, for my use case, I separate between internal deps and external + pypi deps (with the help of a rules_python patch), and so all the other stuff just falls out pretty naturally as being "unowned" by my repo. So, for now its not bothering me, and depending on how it is fixed it may actually break my custom rule (but I assume I can adapt).

venv _bazel_site_init.pth and bazel_site_init.py

These are treated as owned by my repo and thus included in my artifacts, but seem harmless enough to include. But I think they can be detected statically if I feel strongly - need to double check.

Python interpreter runtime

Don't think I've seen this one.

Something that might help are some concrete use cases for the types of transforms/derivations being done. Some things that have crossed my mind:

  • "merging" multiple binaries into a single executable
  • Creating a self-contained executable (of which there are many impls)
  • Depending on a binary, but consuming it like a library

I can give you a high level description of my use cases. We have been treating py_binary as a canonical concept of a "runnable" that we then hand to an internal rule which dresses it up, exposes the underlying local py_binary for bazel run purposes, and then stubs out some providers for downstream rules to consume for use cases such as:

  • Create a wheel that has all transitive internal deps in the src files and all transitive external dips in a requirements.txt file and the main of the py_binary as the entrypoint
    • This is related to the internal vs external deps and rules_python patch described above in this comment
  • Bake the runnable/binary + main entrypoint into an image that we run as a container app
  • Bake the runnable/binary + main entrypoint into an Azure Function Apps project (that has its own specific directory structure and deploy process)

The biggest alternative I have considered so far, and I am curious if you think this is just worth moving to in terms of simplifying everything, is to just create a new macro which has all the same attrs as py_binary, but makes sure to stub out a py_library internally, so that a lot of the headaches described in this issue go away. Maybe this is just the path forward - I like the idea of depending on a py_binary as it respects the foundational unit from rules_python, but if wrapping it internally clears away a lot of headaches, I am very open to it.

FrankPortman avatar Oct 08 '25 20:10 FrankPortman

runfiles = []

if PyExecutableInfo in t:
  runfiles.append(t[PEI].library_only_runfiles)
elif PyInfo in t:
  runfiles.append(t[DefaultInfo].runfiles)

# or
if PEI in t:
  outputs.append(t[PEI].library_only_default_outputs)
else
  outputs.append(t[DefaultInfo].files)

Something like this would be fine for me I think. I think it's a reasonable cost for custom rules if they want to support both types of inputs.

FrankPortman avatar Oct 08 '25 20:10 FrankPortman

For the executable, you can use DefaultInfo.files_to_run.executable to identify it

venv bin/python3 gives errors if cp'd around

Under the hood, with bootstrap=script, relative symlinks are used to make bin/python3 refer back to the underlying python interpreter elsewhere in runfiles (e.g. $runfiles/foo.venv/bin/python3 -> ../../+python_linux_x86/bin/python). Similar is done for other entries in the venv (when --venvs_site_package_libs=yes). If you're on Bazel 8, you can identify such files with File.is_symlink().

venv bin/python3 is special, but why

Ah, I think because the first stage bootstrap relies on it. It goes looking for the interpreter to use, so it has e.g. _main/foo.venv/bin/python3 embedded into it as the place to go looking. If one were going to derive a new executable, then you'd need to know the path to bin/python3 (for most cases, anyways).

rickeylev avatar Oct 08 '25 21:10 rickeylev

Interesting - I wonder if my workaround can be pretty principled then by using DefaultInfo.files_to_run.executable for the executable, checking for symlinks for the .venv/bin/python3, and then relying on my existing separation of internal vs internal deps for the site package stuff. It's a mouthful but might get me what i need. I am mainly focused on the "Create a wheel with internal vs external deps + entrypoint" case for now though so maybe I am missing something about the other cases I described (which are more forward looking for us).

FrankPortman avatar Oct 08 '25 21:10 FrankPortman

@rickeylev I'd be open to taking a stab at a PR for this if you can share some guidance on what the providers should look like.

FrankPortman avatar Dec 02 '25 13:12 FrankPortman