flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[BUG] FlyteDirectory not copied to ContainerTask inputs

Open pryce-turner opened this issue 2 years ago • 14 comments

Describe the bug

When using a ContainerTask with a FlyteDirectory in the inputs, no inputs are actually copied to the running container. For example:

bt = ContainerTask(
    name="basic-test",
    input_data_dir="/var/inputs",
    output_data_dir="/var/outputs",
    inputs=kwtypes(indir=FlyteDirectory),
    outputs=kwtypes(),
    image="ghcr.io/flyteorg/rawcontainers-shell:v2",
    command=[
        "ls",
        "-la",
        "/var/inputs",
    ],
)

@task
def get_dir(dirpath: str) -> FlyteDirectory:
    fd = FlyteDirectory(path=dirpath)
    return fd

@workflow
def wf():
    fd = get_dir(dirpath='s3://my-s3-bucket/cv-in')
    bt(indir=fd)

This workflow returns an empty directory at /var/inputs according to the k8s logs despite there definitely being objects at the specified path.

Expected behavior

I would expect the above workflow to list either /var/inputs/indir with objects copied to indir or perhaps copy them directly i.e. /var/inputs/obj1, /var/inputs/obj2 as they are named in the object store.

Additional context to reproduce

This was run on a local sandbox via pyflyte run --remote basic_test.py wf

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • [X] Yes

Have you read the Code of Conduct?

  • [X] Yes

pryce-turner avatar May 01 '23 21:05 pryce-turner

Thank you for opening your first issue here! šŸ› 

welcome[bot] avatar May 01 '23 21:05 welcome[bot]

Confirm copilot has support for FlyteDirectory.

eapolinario avatar May 05 '23 17:05 eapolinario

Multipart blobs (which is what FlyteDirectory objects are mapped to) are not supported in Flyte copilot currently.

eapolinario avatar May 05 '23 21:05 eapolinario

I can confirm that I'm observing this issue. Any timeline by which we can expect this to be fixed ?

gakumar49606 avatar Aug 03 '23 13:08 gakumar49606

@gakumar49606 , this issue has been in the backlog for some time but it's really low priority, so I can't really offer any timeline. That said, this could be a great contribution and I'd be happy to review any PRs!

eapolinario avatar Aug 24 '23 17:08 eapolinario

Is there a way to use List[FlyteFile] as a workaround? For instance, if you passed an input called files of type List[FlyteFile] to the ContainerTask, would those files be handled by copilot?

djbutler avatar Oct 22 '23 15:10 djbutler

this is actually a simple change to use - https://github.com/flyteorg/flyte/blob/48f53f0995c82568fb10ebf3ffead16c10424e99/flytecopilot/data/download.go#L45C58-L45C76. The thing is this will need a list and then start goroutines. Someone could easily contribute.

kumare3 avatar Nov 10 '23 18:11 kumare3

Any updates on this issue? Just encountered this bug in my own workflows

cjidboon94 avatar Mar 18 '24 16:03 cjidboon94

One current workaround I am using, is to make the ContainerTask a python task with an ImageSpec that uses the ContainerTask's image as base_image and use the task-decorated function as an entry point/to call the applications in the container. e.g.

bt = ContainerTask(
    name="basic-test",
    input_data_dir="/var/inputs",
    output_data_dir="/var/outputs",
    inputs=kwtypes(indir=FlyteDirectory),
    outputs=kwtypes(),
    image="ghcr.io/flyteorg/rawcontainers-shell:v2",
    command=[
        "ls",
        "-la",
        "/var/inputs",
    ],
)

becomes (important to install flytekit python package if it's not already in the image, so that the task can run)

image_spec = ImageSpec(
    base_image="ghcr.io/flyteorg/rawcontainers-shell:v2",
    packages=["flytekit"],
)

@task(container_image=image_spec)
def bt(input_dir: FlyteDirectory) -> None:
    input_dir.download()
    cmd = f"ls -la {input_dir}"
    subprocess.check_call(cmd, shell=True)

Wonder if there's any drawbacks to this approach, other than the additional packages/python overhead introduced this way.

cjidboon94 avatar Mar 20 '24 09:03 cjidboon94

#take

Future-Outlier avatar Mar 21 '24 07:03 Future-Outlier

I am going to contribute to this, and please review the local execution implementation.

https://github.com/flyteorg/flytekit/pull/2258

Future-Outlier avatar Mar 21 '24 07:03 Future-Outlier

Doing it now. image

Future-Outlier avatar Jun 03 '24 13:06 Future-Outlier

Is there any update on this? This feature would be really useful

vasra-gh avatar Jun 26 '24 09:06 vasra-gh

When is this feature expected to be implemented? This limitation makes the functionality of container tasks a bit limited and this would improve our workloads a lot

srale avatar Aug 28 '24 14:08 srale

Sorry guys, I have other priorities but contribution is welcomed!

Future-Outlier avatar Aug 29 '24 13:08 Future-Outlier

Hi everyone,
I’m currently working on this issue, and most of the necessary modifications have been completed. Please follow the PR linked above for updates. Thank you!

Thanks to @davidmirror-ops for the mention!

wayner0628 avatar Sep 05 '24 16:09 wayner0628

This is implemented here Big thanks to @wayner0628

davidmirror-ops avatar Dec 02 '24 21:12 davidmirror-ops

Once this is out in the upcoming 1.14 release, please give it a try and share any feedback here

davidmirror-ops avatar Dec 02 '24 22:12 davidmirror-ops

@davidmirror-ops I think the original issue is not resolved by the last linked PR (different problem or mixup in this thread...)

My analysis of the original OPs problem: In these two lines: https://github.com/flyteorg/flytekit/blob/2632ca64c21dd893cf285cdc1da57150f2f08dd7/flytekit/core/container_task.py#L273-L274

        commands, volume_bindings = self._prepare_command_and_volumes(cmd_and_args, **kwargs)
        volume_bindings[output_directory] = {"bind": self._output_data_dir, "mode": "rw"}

of the execute function in the Containertask class, the volume bindings are computed, used for the docker invocation. You can see, that the output binding is added explicitly, however the input binding is based on the user command. I could verify this: If I add the inputs to the cmd ["/script.sh", "{{.inputs.a}}"], then a is mounted into the docker via the construction of volume_bindings. However, if a is just a primitive (e.g. int), then the value is just placed into the command but not binded as input file with the value written to it, like we would do for primitives into the output folder.

If i remove the input from the cmd=["/script.sh"] (no .inputs.a), then even if a is a flytedir, it is not bindmounted into the docker execution.

Suggestion, keep the primitive literal replacement in the command, i guess it can be very useful. But at the same time, construct an input folder in the pod (or local env) (similar to output folder), and fill it with all input values specified by the container task args, e.g. put flytedirs as well as files with serialized primitives in this input folder, and always bindmount this to the docker execution.

For now the simple workaround is, just to add all used inputs into the cmd, even if you ignore those arguments (ensure your docker executable can take additional args and ignore without failing or perhaps using a comment symbol works? (not tested)

cmd = ["/script.sh", "relevant_expected_args", "#", "{{.inputs.a}}", "{{.inputs.b}}", etc.]

demmerichs avatar Mar 26 '25 22:03 demmerichs

"Hello šŸ‘‹, this issue has been inactive for over 90 days. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! šŸ™"

github-actions[bot] avatar Jun 25 '25 00:06 github-actions[bot]