bazel Runfile collisions across transitions

Description of the bug:

When building up runfiles there can be cases where the 'short_path' is identical yet the actual file is different due to transitions. So when a rule is building up runfiles using ctx.runfiles only one of the files that collides will be present in the runfiles.
The collision can be silent, giving no errors or indications it has occurred.

The attached zip contains a project workspace that recreates an issue with runfiles collisions when there are transitions involved.

For example:

DEBUG: /Users/****/repos/overwrote_runfiles/rules.bzl:92:18: File:<generated file hello_world.h> short_path:hello_world.h path:bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/hello_world.h
DEBUG: /Users/****/repos/overwrote_runfiles/rules.bzl:92:18: File:<generated file hello_world.h> short_path:hello_world.h path:bazel-out/darwin_arm64-fastbuild-ST-3af09e09b98c/bin/hello_world.h

This workspace is likely more complex than needs to be but it approximates a real world scenario that exhibited these collisions.

The basics of it are there is a reporting rule that has dependencies on test rules. These test rules run binaries but each test uses different transitions, one is the 'exec' and another a custom transition. So the outputs of the binary are dependant on the incoming transition.

An aspect collects the source from the test targets. The reporting rule in this example does not do anything with the sources, but in our real implementation the reporting gear uses sources, objects and test outputs to create a global report about all the tests.

The issue with the collisions causes the following to happen when runfiles are created:

Silent collisions: Last file in the set wins
Symlink 'overwrote runfile': When creating runfiles with 'symlinks' then the collisions are logged as a error but do not fail. See https://github.com/bazelbuild/bazel/issues/4965

For our implementation we were able to squelch the error by filtering before creating runfiles, detecting collisions, and only using the last File in the runfiles. This only works because the files that are produced as of the transitions happen to be identical. This may not be true for other situations as the transition might cause differences in the files created.

To recreate the issue

bazel run -s //:report

Debug statements show the File objects that ultimately end up as runfiles to the reporting rule.

DEBUG: /Users/****/repos/overwrote_runfiles/rules.bzl:92:18: File:<generated file hello_world.h> short_path:hello_world.h path:bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/hello_world.h
DEBUG: /Users/****/repos/overwrote_runfiles/rules.bzl:92:18: File:<generated file hello_world.h> short_path:hello_world.h path:bazel-out/darwin_arm64-fastbuild-ST-3af09e09b98c/bin/hello_world.h

If symlinks are used in the runfiles, the you will get the error that is not an error. See https://github.com/bazelbuild/bazel/issues/4965

ERROR: /Users/****/repos/overwrote_runfiles/BUILD.bazel:41:7: overwrote runfile hello_world.h, was symlink to bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/hello_world.h, now symlink to bazel-out/darwin_arm64-fastbuild-ST-3af09e09b98c/bin/hello_world.h

And in the action itself we can see only the last file added to the set of runfiles is symlinked into the actions space.

Running Report
total 0
lrwxr-xr-x  1 ****  staff  151 29 Sep 09:47 hello_world.h -> /Volumes/BazelOut/_bazel_****/42a766200e36a3e8369aec1f430cf379/execroot/__main__/bazel-out/darwin_arm64-fastbuild-ST-3af09e09b98c/bin/hello_world.h

In my specific scenario, I actually don't really need the runfiles, as I have all the paths to the actual outputs, but I do need the actions to produce the outputs based on the 'test' dependencies, so the report can access them.

Ultimately when having runfiles there should be no way to have collisions across transitions, however I do not see how that can be possible with the way runfiles are implemented today.

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Using the attached workspace zip file

bazel run -s //:report

Which operating system are you running Bazel on?

MacOS

What is the output of `bazel info release`?

No response

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

release 5.0.0

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

No response

Have you found anything relevant by searching the web?

https://github.com/bazelbuild/bazel/issues/4965 is related but only for the error/waring Bazel emits.

Any other information, logs, or outputs that you want to share?

runfile_collisions_across_transisitions.zip

Oct 03 '22 17:10 octalthorpe

For an easy-to-reproduce example of this within Google, see //third_party/py/elfzipimport:cython_par_test. (or in general, put any 2 py_binary in data of a rule type that consumes default runfiles (not data runfiles) and turn stamping on).

The (Google variant of the) Python rules do symlinks={"<fixedpath>": <target-specific file>}. When multiple binaries are built together, runfiles merging causes them to want to use the same fixed symlink path, as OP describes. (I won't defend this implementation too much, nor has it been an issue in practice; it's just a convenient example).

Oct 27 '22 23:10 rickeylev

Assigning to me to triage.

Oct 27 '22 23:10 haxorz

Related: https://github.com/bazelbuild/bazel/issues/15029

Oct 28 '22 07:10 lberki

https://github.com/bazelbuild/bazel/issues/13281#issuecomment-1057001554 is also related. It is slightly simpler in that the consumer of the transitioned runfiles is an ordinary native rule, not an aspect.

Oct 28 '22 08:10 fmeum

#13281 is different, I think: it's not about runfiles, but about unnecessarily building things twice.

Oct 28 '22 12:10 lberki

@lberki The original problem described in the issue is different, but the particular comment I linked to starts a sub-discussion about another situation in which runfiles collide due to transitions. As to why that was brought up as part of the other issue I don't know.

Oct 28 '22 12:10 fmeum

The comment in #13281 that is worrisome

As an aside, I'm surprised to read about collision at https://github.com/aherrmann/bazel-transitions-demo#gotchas. I didn't look super-thoroughly, but that should be hard to actually do in practice (i.e. collisions shouldn't really happen unless someone is egregiously abusing a rule). Is that specific to pkg_tar?

IMHO I am not abusing a rule, this is fundamentally a problem with transitions and runfiles and the missing ability to expose these transitions. This is likely a really tough problem as there needs to be some way to isolate the runfiles under separate trees (for each transition), and then having some way to expose that to the action for consumption.

Oct 28 '22 13:10 octalthorpe

@octalthorpe I agree that you are not abusing anything, it's a shortcoming of the current runfiles system.

I don't think the solution necessarily has to be complicated though. We could roughly do the following:

When runfiles conflict, additionally stage them under repo/bazel-out/config/bin/pkg/file in the runfiles, where they don't conflict.
Pass this path into the binary (in this case the report tool) via location expansion in args or env or by injecting paths computed in Starlark, depending on whether it runs as a test or in an action.
Extend the runfiles libraries to fall back to stripping the bazel-out/config/bin part if they don't find a path that contains it.

Oct 28 '22 13:10 fmeum

Sorry, I haven't had time to fully research this. It's still on my list of things to do though.

Dec 02 '22 17:12 haxorz

Here's a minimal-ish repro I put together after just running into this in another context:

# BUILD.bazel
py_binary(
    name = "test-py-binary-1",
    main = "source.py",
    srcs = ["source.py"],
    deps = [],
    python_version = "PY3"
)

genrule(
    name = "my-genrule",
    outs = ["source.py"],
    cmd = select({
        "@rules_python//python:PY2": '''echo "print 'Py2!'" > $(location source.py)''',
        "@rules_python//python:PY3": '''echo "print('Py3!')" > $(location source.py)'''
    })
)


py_binary(
    name = "test-py-binary-2",
    main = "source.py",
    srcs = ["source.py"],
    deps = [],
    python_version = "PY2"
)

sh_test(
    name = "test-py-shell",
    srcs = ["test-py-shell.sh"],
    data = ["test-py-binary-1", "test-py-binary-2"],
    tags = ["no-sandbox"]
)

# test-py-shell.sh

echo "RUNNING PY2 BINARY"
bazel/artifacts/test-py-binary-2

echo "RUNNING PY3 BINARY"
bazel/artifacts/test-py-binary-1

$ bazel test //:test-py-shell 
RUNNING PY2 BINARY
Py2!
RUNNING PY3 BINARY
  File "bazel-out/k8-fastbuild/bin/test-py-shell.runfiles/universe/source.py", line 1
    print 'Py2!'
          ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print('Py2!')?

The specific issue here is that test-py-binary-1 and test-py-binary-2 have different transitions for python_version, so they depend on py3/py2 versions of my-genrule, which generates compatible versions of the source.py script for each one. However, when put together in the same test-py-shell, the the py3 and py2 versions of source.py both get stuffed into the same .runfiles folder at the same path, and only one can win.

In this case, we arbitrarily end up with the py2 version of source.py, which then is run by both Python2 and Python3 interpreters and ends up failing with a syntax error in the latter

Oct 05 '23 09:10 lihaoyi-databricks

Hi there, I am hitting the same issue. Minimal description is transitioning on a build property that is not the cpu, therefore hitting this same runfile collision.

Try my best, I can't come up with even a hacky workaround as a client. Any ideas for how this could be worked around from the rule author perspective?

I imagine tremendous hacks like:

Defining new cpu names
prefixing every output file to match its transition value
Something like described above:

When runfiles conflict, additionally stage them under repo/bazel-out/config/bin/pkg/file in the runfiles, where they don't conflict.

(however, I don't think 3. is accomplishable without changes to bazel itself)

Nov 16 '23 00:11 mjj48

So I've started taking a look at this issue (and sorry about the reference spam above, I've fixed that).

I have something that locally will generate separate runfile directory per transition configuration - that is a fairly surgicasl change on the runfiles directory path: https://github.com/toumorokoshi/bazel/commit/7951e6fd1b944de300dc6146a537a2fec447b2a5#diff-620b5e3a3825cd7a9fbdce26209c2f1d95976761717a612e3a354e6d80edbda8R768. I'm still working on fully verifying it fixes the issue locally.

The solution in https://github.com/bazelbuild/bazel/issues/16379#issuecomment-1295028175 makes sense too - but I wonder it it just makes sense to always use the full path for the runfiles directory as I've done above? I suppose it may accidentally duplicate content when the files themselves don't conflict - but looking at the code it seems like you would always have a conflict if you build a particular file under two different transitions.

Apr 15 '24 21:04 toumorokoshi

This issue also appears to affect source files (when they aren't being shadowed by generated files).

e.g. with https://github.com/aspect-build/bazel-lib/blob/c0607f67caeaf479023e5566b2f3ab58db124d7f/docs/write_source_files.md

write_source_files(
    name = "write_output",
    diff_test = True,
    files = {
        "dist": ":ts_prod_in_dist",
        #        ^ dist/ generated file directory (ctx.actions.declare_directory(...))
        # ^ dist/ source file directory
    },
)

Behind the scenes diff_test is given generated and source file versions of dist/, which are passed as runfiles for the test. Bazel silently ignores the collision, just like with generated files from different transitions/configurations.

May 01 '24 01:05 Silic0nS0ldier

Is there a plan to fix this problem? There are multiple queries about this problem in bazel slack as well. https://bazelbuild.slack.com/archives/CA31HN1T3/p1721768894441739

Aug 03 '24 09:08 tsiddharth

Needed to handle this in rules_appimage as well: https://github.com/lalten/rules_appimage/pull/179

Aug 04 '24 12:08 lalten

Quick minimal demo of the issue:

foo/defs.bzl:

def _write_conflicting_runfile(ctx, content):
    o = ctx.actions.declare_file("conflict")
    ctx.actions.write(o, content)
    return DefaultInfo(runfiles = ctx.runfiles(files = [o]))

def _compilation_mode_transition(mode):
    return transition(
        implementation = lambda _, __: {"//command_line_option:compilation_mode": mode},
        inputs = [],
        outputs = ["//command_line_option:compilation_mode"],
    )

a = rule(
    implementation = lambda ctx: _write_conflicting_runfile(ctx, "a"),
    cfg = _compilation_mode_transition("fastbuild"),
)

b = rule(
    implementation = lambda ctx: _write_conflicting_runfile(ctx, "b"),
    cfg = _compilation_mode_transition("opt"),
)

foo/BUILD:

load(":defs.bzl", "a", "b")

a(name = "a")

b(name = "b")

sh_binary(
    name = "bin",
    srcs = ["bin.sh"],
    data = [
        ":a",
        ":b",
    ],
)

foo/bin.sh:

echo The content of the conflicting runfile is: `cat $0.runfiles/_main/foo/conflict`

$ bazel run //foo:bin
The content of the conflicting runfile is: b
$ cat bazel-bin/foo/bin.runfiles/MANIFEST | grep conflict | wc -l
1

Only one version of the conflicting file makes its way to runfiles (last one wins).

The fundamental issue is that unlike bazel-out, paths under a .runfiles tree have no "mnemonic" (configuration-specific path uniquifier). I view this as a configurability issue with no obvious solution, so will triage it as such.

Aug 06 '24 15:08 justinhorvitz

Runfile collisions across transitions

Description of the bug:

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?