pytask BUG: using `pytask.task()(func_defined_in_other

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pytask.
[ ] (optional) I have confirmed this bug exists on the main branch of pytask.

Code Sample, a copy-pastable example

# lib.py
def get_name():
    return "jane"

# task_bug.py
from pathlib import Path

import pytask

from .lib import get_name

task_get_name = pytask.task(produces=Path("name.txt"))(get_name)

Problem description

It appears that inpytask.task(), the created task is stored in COLLECTED_TASKS, but is stored under COLLECTED_TASKS["lib.py"] (the place where the function was defined), not under COLLECTED_TASKS["task_bug.py"] (where the task was created). Then, when we actually go looking through COLLECTED_TASKS for tasks to execute, we miss it because we only look under "task_bug.py"

Expected Output

I think the created task should be stored under COLLECTED_TASKS["task_bug.py"] where the task was created.

Dec 05 '23 21:12 NickCrews

Ehh, actually the solution of using the module where pytask.task() was called will fall apart if I have some make_task() wrapper in my_tasks.py, since then it will always be "my_tasks.py". Why do we even need to store tasks in a dict? Is that only needed for automatic collection, but when using the programmatic interface should it work a different way? Can/should we add a param to task() so you can override the module where the task is supposed to live?

Dec 06 '23 05:12 NickCrews

As a workaround, does wrapping it in lambda work? Like lambda **x: get_name(**x).

Dec 06 '23 20:12 tobiasraabe

yes that does work, but long term it would be nice if that wasn't required.

Dec 07 '23 00:12 NickCrews

In #521 I added an error message that tells the user about tasks that have not been collected and that the tasks need to be wrapped with a lambda. So, you would have at least received an error in your case.

Why do we even need to store tasks in a dict?

It is to prevent garbage-collecting tasks defined in a loop that will be overwritten in the next iteration. The dictionary could be replaced by anything else that keeps the objects alive.

Can/should we add a param to task() so you can override the module where the task is supposed to live?

This could be an option. The injected path would be used to store the function, and it would be collected with the correct module. But I am unsure how I feel about it.

I am also wondering what the more reasonable assumption is to determine the task module of a function wrapped with @task. Is it the module where the function receives the decorator or where the function is defined? First, seems more natural.

What are the use cases we have?

We wrap a function imported from another module that should be declared in the task module as a task.
What is the wrapper you were talking about? Is it like a task generator that is called from the task module with a task function? And the task generator lives outside the task module and wraps the function with @task?

What if we walk up the call stack when @task is applied to the function until we hit a task module? This would cover both use cases, but it feels even more hacky.

Dec 09 '23 00:12 tobiasraabe

Can we take a step back and question: why does a task need to have a module attribute?

For CLI-discovered tasks (eg a def task_* inside a task_*.py file, found using pytask collect then it makes sense why every task needs a module attribute (so that the live rendering of tasks completing works, and maybe for other reasons?).

But for programmatically-defined tasks (eg created with pytask.task()), do they really need to know which module they were defined in? If not, maybe we can sidestep this whole problem?

Dec 12 '23 23:12 NickCrews

I agree that tasks passed via the programmatic interface pytask.build(tasks=[...]) or marked with the @task decorator should be collected regardless of the location.

The linked PR fixes that and allows setting the module name via the module argument to correct the displayed path.

I still have to see whether it conflicts with code in pytask-paralle. The path is used as a fallback if the function does not yield a path.

Dec 18 '23 00:12 tobiasraabe

Wait wait, if we can I really think we should try to avoid adding this argument to .task(). Why does a task need to have a module attribute? I'm wondering if we can just get rid of that requirement for programmatically defined tasks? I'm probably missing something obvious and there is a real reason they do, but I want to double check. Thanks!

Dec 18 '23 00:12 NickCrews

Hi! Belated happy holidays and a happy new year!

Sorry for skipping right to a change without answering your question. I appreciate it. Let's dig deeper and see whether we find a better solution.

And, sorry for the lengthy response. I am trying to figure out how to best look at the problem.

A bit of background on why there is a path attribute.

There are currently two kinds of tasks, PTaskWithPath not PTask, and only the first kind of tasks have a .path attribute.

PTask is used for tasks defined in Jupyter notebooks or the terminal without a real source file.

PTaskWithPath is the task protocol based on functions found in task modules.

A task decorated with @task can be converted into either of the two, depending on the situation.

The .path attribute is used in the following situations.

To detect whether a task has changed, pytask hashes the module in which a PTaskWithPath is defined. For PTask, pytask hashes the source code of the function found with inspect.getsource.
To create the signature of a task that uniquely identifies it among all other tasks, PTaskWithPath uses the path + function name + identifier when part of a repetition. PTask only uses the name + identifier.
For @task decorated tasks, the path is necessary to collect the task. In #529, I changed it and added a collection step for all tasks decorated with @task even if pytask does not step through their module.
Generally, .path is used when displaying the task in the execution table or elsewhere. Then, parts of the path are shown next to the task name to ease identification.

The amount of path parts shown depends on how many are needed to make the displayed name unique.

Here, your imported function might need some help because the task's path points to the imported module instead of the task module where the task is defined or that triggered the task's creation. Since this might confuse a user, I added patching the location with @task(module=...).

This is purely cosmetic and more helpful for the user, especially in the absence of vscode's support for terminal hyperlinks. Since we have the signature, the names do not need to be unique or long for pytask.
.path is also used in pytask-parallel as a fallback to register task modules with cloudpickle which allows to pickle them and send them to another process. But I wonder if it is essential.

What are possible solutions? 2-4 are more cosmetic, and five would be a bigger change.

Leave it as it is. The module name might be misleading, but everything is still working, and I think this use case will only happen sometimes.
Whenever the task name is set automatically, like ../task_module.py::task_name, and the module name is not prefixed with task_, we remove the path part from the task name. The name becomes just task_example.

It avoids confusion with the module but makes it harder for users to track down the source of the task.

As long as vscode does not support terminal links (https://github.com/microsoft/vscode/issues/176812) navigating to the source of the task without knowing the module is cumbersome.
Walking up the stack until we are in a task module and take that as the path module.
Allow users to patch the module name with @task(module=...). Probably the worst way. People will still be irritated at first; even if they want to solve it, they probably would not read about the module keyword easily.
Do we have to show the module or parts of the path to identify the task?

Again, vscode poses a problem for the short- to mid-run.

On the other hand, we could unify PTask and PTaskWithPath which is a weird differentiation.

If we would remove the .path attribute, what about the five points above?
1. We could fall back to inspect.getsource even for PTaskWithPath. Or, better, we treat the module, if it exists, as a special dependency of the task. 2. Without .path, the signature still needs the module path as a hash component to be unique within bigger projects. We could take it from some attribute for unique dependencies.
2. ignore
3. Use only the task name + identifier.
  
  For pytest, it makes more sense that modules are inherent in how tasks are structured.
  
  For pytask, the task name might capture all relevant information. But, usually, the module or sometimes the first parent directory can carry information as well.
4. ignore

Jan 03 '24 14:01 tobiasraabe

BUG: using `pytask.task()(func_defined_in_other_file)` doesn't work

Code Sample, a copy-pastable example

Problem description

Expected Output