rules_python icon indicating copy to clipboard operation
rules_python copied to clipboard

`py_binary` targets added to `py_wheel.deps` end up including the binary in the wheel. Sounds like WAI but it was surprising to me

Open dougthor42 opened this issue 8 months ago β€’ 8 comments

🐞 bug report

Affected Rule

  • py_wheel

Is this a regression?

Nope, not that I can tell at least.

Description

Adding a py_binary target as a dep to py_wheel results in both the binary's srcs being added to the wheel and the binary itself. It's this latter part that is unexpected (at least to me, haha).

Maybe I'm missing an option that causes the bazel binaries to be excluded?

πŸ”¬ Minimal Reproduction

py_binary(
    name = "foo",
    srcs = ["foo.py"],
)

py_wheel(
    ...,
    distribution = "our_tables",
    deps = [":foo"],
)
$ less bazel-bin/path/to/foo/foo-0.0.1-py3-none-any.whl | cat
Metadata-Version: 2.1
Name: our_tables
Version: 0.0.1

Archive:  bazel-bin/path/to/foo/our_tables-0.0.1-py3-none-any.whl
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
   24451  Defl:N     7964  67% 1980-01-01 00:00 2d00cf53  our_tables/foo
    5124  Defl:N     1787  65% 1980-01-01 00:00 4846e197  our_tables/foo.py
      91  Defl:N       86   6% 1980-01-01 00:00 4f1122ec  our_tables-0.0.3.dist-info/WHEEL
    9031  Defl:N     3390  63% 1980-01-01 00:00 e00f51c6  our_tables-0.0.3.dist-info/METADATA
     698  Defl:N      445  36% 1980-01-01 00:00 8c16618f  our_tables-0.0.3.dist-info/RECORD
--------          -------  ---                            -------
   47408            16577  65%                            5 files

It's that our_tables/foo that's strange to me. IMO it should not be included by default.

πŸ”₯ Exception or Error

N/A

🌍 Your Environment

Operating System:

gLinux

Output of bazel version:

$ bazel version
Bazelisk version: v1.20.0
Build label: 7.6.1
Build target: @@//src/main/java/com/google/devtools/build/lib/bazel:BazelServer
Build time: Mon Mar 31 17:08:56 2025 (1743440936)
Build timestamp: 1743440936
Build timestamp as int: 1743440936

Rules_python version:

1.4.1

Anything else relevant?

dougthor42 avatar May 22 '25 03:05 dougthor42

In some sense I think this is working as intended, because the executable is in the default outputs group.

Maybe what we should change is to not allow py_binary and py_test rules to be added to the py_wheel, because actually what you want is only py_library files in the wheel. However, that is a breaking change.

Maybe the py_wheel should instead only depend on the PyInfo provider (or something similar) to make it just work. In some sense I imagine users using py_binary as a py_library and not being able to package your code without changing your project structure sounds like a huge rough corner.

That said, I would not call the current behaviour a bug.

aignas avatar May 23 '25 01:05 aignas

Yeah, I kinda figured it was WAI.

Maybe what we should change is to not allow py_binary and py_test rules to be added to the py_wheel, because actually what you want is only py_library files in the wheel. However, that is a breaking change.

Agreed, this is (a) breaking and (b) not the desired behavior. There are plenty of cases (legitimate and illegitimate) where a binary target is used as a library. For the illegitimate case, typically what happens is that a library has if __name__ slapped inside it for some debugging or local development work and now it's considered a "binary".

Maybe the py_wheel should instead only depend on the PyInfo provider (or something similar) to make it just work.

When you say "just" work - do you mean "barely" work? Or do you mean "it works and is not complicated"?

I imagine users using py_binary as a py_library and not being able to package your code without changing your project structure sounds like a huge rough corner.

Yup. I try to tell people to make a stub-style binary but it's an uphill battle to say the least.


My thought was maybe we could add a flag like py_wheel.include_binaries: bool = True. When False, the binaries (note: not the .py files) are stripped from the sdist/wheel contents before bundling. Does that sound like a reasonable (and feasible) solution?

dougthor42 avatar May 23 '25 03:05 dougthor42

When you say "just" work - do you mean "barely" work? Or do you mean "it works and is not complicated"?

Either it has to "work and not be complicated" or it has to not be here at all. πŸ˜†

Maybe we could use the PyInfo provider, but I can't remember if it includes the .so files and similar. We have a need in general to be able to tell the venv machinery to put certain files into certain places, so maybe this will be easier once we solve the venv building in #2156.

aignas avatar May 23 '25 04:05 aignas

I'm surprised that only the foo executable shows up. I would expect the entire Python runtime to also show up because the runfiles should be merged in. Is the posted example using workspace instead of bzlmod?

An executable can be detected by looking for PyExecutableInfo, however, that won't be sufficient. The binary's runfiles may contain the Python runtime, which can't easily be sorted out from the regular runfiles that should be included.

To do this properly, there has to be a stronger distinction between the "binary" and "library" parts of a py_binary target. Something like:

  • Make py_binary produce a PyInfo the same as if it was a py_library
  • Move the binary-specific parts into PyExecutableInfo
  • Both PyInfo and PyExecutableInfo need separate attributes for their respective runfiles
  • Consumers, like py_wheel, can then choose whether to use it as a library, binary, or generic target.

rickeylev avatar May 23 '25 23:05 rickeylev

Posted example is bzlmod.

I dug a little deeper. The binary our_tables/foo is the boilerplate that starts with:

#!/usr/bin/env python3

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import sys

# The Python interpreter unconditionally prepends the directory containing this
...

I would expect the entire Python runtime to also show up because the runfiles should be merged in.

It looks like the only files that are merged in are:

https://github.com/bazel-contrib/rules_python/blob/3aea414aa2e5f1e0e915f58a434c01428a90382c/python/private/py_wheel.bzl#L329

The direct_pyi_files are:

DEBUG: .../py_wheel.bzl:328:10: ctx.files.deps=[<source file src/doug/foo.py>, <generated file src/doug/foo>]
DEBUG: .../py_wheel.bzl:329:10: Direct pyi files=[]
DEBUG: .../py_wheel.bzl:333:10: inputs_to_package=depset([<source file src/doug/foo.py>, <generated file src/doug/foo>])

I don't yet know enough about internals, but maybe a very naive "drop if is generated file and does not have .py extension" check might be sufficient?

We'll want to be careful about generated files that are not generated from a py_binary, though. There are rare but legitimate causes for generating a python library file (but I'm not aware of any cases where the generated file would not have the .py extension and you'd still want it as part of the wheel).

dougthor42 avatar May 24 '25 03:05 dougthor42

So I have no idea what the impact of this is, but this patch results in no foo binary being added to the wheel (patch base: 1.4.1). Like, I'd imagine that this would probably break data dependencies being added, but it's a start!

diff --git a/python/private/py_wheel.bzl b/python/private/py_wheel.bzl
index c196ca6a..991d6555 100644
--- a/python/private/py_wheel.bzl
+++ b/python/private/py_wheel.bzl
@@ -337,6 +337,8 @@ def _py_wheel_impl(ctx):
     packageinputfile = ctx.actions.declare_file(ctx.attr.name + "_target_wrapped_inputs.txt")
     content = ""
     for input_file in inputs_to_package.to_list():
+        if not input_file.is_source:
+            continue
         content += _input_file_to_arg(input_file) + "\n"
     ctx.actions.write(output = packageinputfile, content = content)
     other_inputs.append(packageinputfile)
$ less bazel-bin/src/doug/our_tables-0.0.1-py3-none-any.whl | cat
Metadata-Version: 2.1
Name: our_tables
Version: 0.0.1

Archive:  bazel-bin/src/doug/our_tables-0.0.1-py3-none-any.whl
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
      62  Defl:N       62   0% 1980-01-01 00:00 3431507e  doug/foo.py
      91  Defl:N       86   6% 1980-01-01 00:00 4f1122ec  our_tables-0.0.1.dist-info/WHEEL
      63  Defl:N       56  11% 1980-01-01 00:00 6d200e48  our_tables-0.0.1.dist-info/METADATA
     274  Defl:N      200  27% 1980-01-01 00:00 d92b0407  our_tables-0.0.1.dist-info/RECORD
--------          -------  ---                            -------
     490              404  18%                            4 files

dougthor42 avatar May 24 '25 03:05 dougthor42

I'd imagine that this would probably break data dependencies being added

Actually it looks like it's OK for the basic data cases such as filegroup targets or direct file targets:

$ bazel build //src/doug:wheel
$ less bazel-bin/src/doug/our_tables-0.0.1-py3-none-any.whl | cat
Metadata-Version: 2.1
Name: our_tables
Version: 0.0.1

Archive:  bazel-bin/src/doug/our_tables-0.0.1-py3-none-any.whl
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
      10  Defl:N       12 -20% 1980-01-01 00:00 c801f919  doug/data.txt
      62  Defl:N       62   0% 1980-01-01 00:00 3431507e  doug/foo.py
      15  Defl:N       17 -13% 1980-01-01 00:00 d27ea5b1  doug/more_data.txt
      91  Defl:N       86   6% 1980-01-01 00:00 4f1122ec  our_tables-0.0.1.dist-info/WHEEL
      63  Defl:N       56  11% 1980-01-01 00:00 6d200e48  our_tables-0.0.1.dist-info/METADATA
     420  Defl:N      293  30% 1980-01-01 00:00 942311a4  our_tables-0.0.1.dist-info/RECORD
--------          -------  ---                            -------
     661              526  20%                            6 files

But data targets that are the result of a genrule or other generated file will be excluded. This is expected given the naivetΓ© of the patch.

Obviously the 4-point description that @rickeylev mentioned in https://github.com/bazel-contrib/rules_python/issues/2923#issuecomment-2906039765 is the more robust way to go about things. I just don't have the knowledge to implement such πŸ™ƒ.

dougthor42 avatar May 24 '25 05:05 dougthor42

Oh, huh. Looking at the py_wheel code, it doesn't take the run files? Just the immediate default outputs? That's surprising. It is beneficial for this situation though -- we don't have to worry about transitive run files merging so just have to split the immediate binary specific outputs out of PyInfo and into PyExecutableInfo

rickeylev avatar May 24 '25 05:05 rickeylev

Oh, huh. Looking at the py_wheel code, it doesn't take the run files? Just the immediate default outputs? That's surprising. It is beneficial for this situation though -- we don't have to worry about transitive run files merging so just have to split the immediate binary specific outputs out of PyInfo and into PyExecutableInfo

The problem with this is it then is missing any data dependencies which may be required e.g. a python C extension. In general you can't reproduce the correct configuration for the data attribute at the py_wheel stage and so really py_wheel and conversely py_package should be packaging runfiles / data as is as its the only way to get the correct configuration for these targets in the general case.

jsharpe avatar Jun 25 '25 13:06 jsharpe