rules_python icon indicating copy to clipboard operation
rules_python copied to clipboard

Runfiles files under `.venv/site-packages` can double in size

Open shayanhoshyari opened this issue 1 month ago • 5 comments

🐞 bug report

Affected Rule

py_binary with

common --@rules_python//python/config_settings:bootstrap_impl=script
common --@rules_python//python/config_settings:venvs_site_packages=yes
common --@rules_python//python/config_settings:venvs_use_declare_symlink=yes

Is this a regression?

Hard to tell. I think it might be unintended behavior.

Description

For building docker images, it is very common to make .tar files.

Currently, tar.bzl (used for aspect_rules_py, rules_oci, ... family) does not support symlinks, but there an ongoing thread about it: https://github.com/bazel-contrib/tar.bzl/issues/16

A suggestion by @rickeylev is to use the File.is_symlink field to determine if a file is a symlink and just write as-is as symlink and preserve the relative path.

I did implement a basic tar rule equivalent based on this idea at: https://github.com/bazel-contrib/rules_python/issues/3388#issuecomment-3615363362

However, with current result, the torch .so files return File.is_symlink = False, so we will end up writing the file to the tar twice (once the external runfile and once the one under .venv), and it can double the size of the tar (image layer).

I can workaround this in my tar rule, but was wondering if it is intended behavior in first place.

🔬 Minimal Reproduction

https://github.com/shayanhoshyari/issue-reports/tree/main/rules_python/venv_tar

🔥 Exception or Error

No exception, but files like this are returning File.is_symlink = False

- path: bazel-out/darwin_arm64-fastbuild/bin/_test.venv/lib/python3.13/site-packages/torch/_C.cpython-313-darwin.so
- short_path: _test.venv/lib/python3.13/site-packages/torch/_C.cpython-313-darwin.so
- is_symlink = False
- target: /private/var/tmp/_bazel_hoshyari/0915edf9aa0fc6e009ba2664be204381/execroot/_main/bazel-out/darwin_arm64-fastbuild/bin/_test.venv/lib/python3.13/site-packages/torch/_C.cpython-313-darwin.so

Other files (that are just folders and don't have file resolution) correctly have is_symlink = True

- path: bazel-out/darwin_arm64-fastbuild/bin/_test.venv/lib/python3.13/site-packages/filelock
- short_path: _test.venv/lib/python3.13/site-packages/filelock
- is_symlink = True
- target: ../../../../../rules_python++pip+pypi_313_filelock_py3_none_any_d38e3048/site-packages/filelock

🌍 Your Environment

Operating System: MacOS Sonoma (same is on Ubuntu 22.04) Output of bazel version: 8.4.2 Rules_python version: 1.7.0

More info

I tried https://github.com/bazel-contrib/rules_python/pull/3440, and that seems to return is_symlink = True but have same folder structure. How I tried it: link

shayanhoshyari avatar Dec 06 '25 06:12 shayanhoshyari

This is WAI to work around an issue with how the dynamic linker processes symlinks to shared libraries. Basically, if only the top-level directory is symlinked (site-packages/foo), then the dynamic linker computes the wrong directory for $ORIGIN and then e.g. libtorch.so looks in the wrong place for more shared libraries.

is_symlink returns False

This is WAI and a Bazel behavior. Though ctx.actions.symlink() is used, Bazel doesn't guarantee that a symlink will be used.

it doubles the size of the tar

This is unintended. What's probably happening is the set of runfiles returned by the library target has everything (e.g. including libtorch.so) (this is the historical behavior), when it should actually omit entries that will ultimately be put into the venv using ctx.actions.symlink().

rickeylev avatar Dec 06 '25 07:12 rickeylev

This is WAI to work around an issue with how the dynamic linker processes symlinks to shared libraries. Basically, if only the top-level directory is symlinked (site-packages/foo), then the dynamic linker computes the wrong directory for $ORIGIN and then e.g. libtorch.so looks in the wrong place for more shared libraries

Makes sense, and agreed. I think I suggested this myself too at https://github.com/bazel-contrib/rules_python/issues/3228#issuecomment-3560901945.

This is WAI and a Bazel behavior. Though ctx.actions.symlink() is used, Bazel doesn't guarantee that a symlink will be used.

It seems we don't use ctx.actions.symlink for these files, but instead just pass those files extra_runfiles + ctx.runfiles(symlinks=...). I tried to use ctx.actions.symlink here https://github.com/bazel-contrib/rules_python/pull/3440 just for demonstration purposes, I see the files start returning File.is_symlink=True.

I suspect there is a good (e.g. performance) reason for using ctx.runfiles(symlinks=...) that I am missing of course :)

shayanhoshyari avatar Dec 06 '25 10:12 shayanhoshyari

runfiles symlinks are used, presumably for performance reasons

Yes. The runfiles symlinks attribute is used because it lowers the number of actions run and artifacts created/tracked. Some users reported O(millions) of actions without it (quite plausible with O(hundreds) of tests, each with O(tens-of-thousands) of files). (I didn't profile the two, but not using runfiles.symlinks means there's a numTargets coefficient on how many files and artifacts are explicitly created and tracked at the starlark level).

You can detect these by looking at the runfiles.symlinks attribute, which is depset of SymlinkEntry objects (https://bazel.build/rules/lib/builtins/SymlinkEntry) (basically a (path, File) tuple). Since you want to put them into a tar, just copy the runfiles_root_path and relative_path logic in that block of code you modified to transform them into something to put into the tar file. If you're packaging just a single thing (i.e. a single binary), you could simplify by dereferencing the symlinks entirely (ie. store a copy of bin.venv/foo/bar.py instead of having it be a symlink to ../../+whatever+pypi_foo/site-packages/foo.py).

Anyways, going back to the original problem, this does point out a duplication issue, even separate from tar. I'd have to go check, but i'm pretty sure whats going to happen is this file structure gets materialized:

runfiles/
  pypi_torch/
    site-packages/
      libtorch.so -> symlink to actual file
  bin.venv/
    site-packages/
      libtorch.so -> symlink to actual file

But, that first entry under pypi_torch is unnecessary. It's just a symlink, but the same will happen for the other 10,000 files in torch (similar for airlock, or other packages that install into the same directory). More ideally, the pypi_torch directory would only be populated with files that a relative symlink (via declare_symlink) within the venv needs (files via runfiles.symlinks don't need it). Fixing that should be fairly straight forward (keep track of files a relative-symlink needs and compute the final runfiles later instead of propagating the full set of them).

rickeylev avatar Dec 08 '25 06:12 rickeylev

Thank you for the details :)

You can detect these by looking at the runfiles.symlinks attribute, which is depset of SymlinkEntry objects (https://bazel.build/rules/lib/builtins/SymlinkEntry) (basically a (path, File) tuple).

Today I learned!

Before knowing this I started matching the path after site-packages heuristic to decide if something has to be a symlink. (even if File.is_symlink for a runfile is false, check if a workspace=external runfile has same path after site-packages, and if so, still make a relative symlink to that)

shayanhoshyari avatar Dec 08 '25 15:12 shayanhoshyari

Forgot to add: I think a simple fix that might help this tar case is to switch the DSO venv-path-logic to always use link_to_path instead of link_to_file (basically set link_to_file=None at https://github.com/bazel-contrib/rules_python/blob/89c8f204daeb588bfb34e142555b2a322a3a7126/python/private/venv_runfiles.bzl#L271). That would force the file in the venv to be a relative symlink.

It won't completely eliminate the possibility of the problem, though. Ultimately, if overlapping paths are detected, it has to fallback to using runfiles.symlinks and linking files individually using "regular" symlinks instead of "raw" symlinks.

After https://github.com/bazel-contrib/rules_python/pull/3448, the incidence of that should be much less (and at the least, it won't cause torch to always conflict with itself and use runfiles.symlinks)

rickeylev avatar Dec 10 '25 17:12 rickeylev