suggestions/improvements to the split-apply-combine pipeline
a thread to discuss the new experimental pipeline.
My couple of minor suggestions:
-
why use the name "pipe" if "apply" if there is the more standard name "apply"? Also, "pipe" implies that multiple functions are supposed to be applied to the data. Yet, it's very possible that a large share of "pipelines" will contain a single function, as it reduces the amount of time spent copying data when using multiprocessing.
-
I find the usage pattern of the
dataargument to be... a bit raw/poorly defined/restrictive? I understand that 'data' is trying to solve the patterns where different steps of the pipeline must create and pass extra information besides the chunks themselves. But there several issues with the current implementation: 2.1) Having an optional argument "data" poses a major block to creating reusable components, as they now have to come in two varieties - one taking chunk as an argument; another taking (chunk, data). 2.2) More importantly, this extra "data" argument does not really solve the issue that, in complicated pipelines, different functions must be custom "fitted" to each other. There is no single "data" that functions can expect and pass downstream. Designing a library that would anticipate what kind of extra data is passed between functions is futile. 2.3) Finally, the only place wheredatais currently used is during balancing, where it stores filtered pixel counts. Correct me if I'm wrong, but, is this case, it's actually fine to modify chunks since the downstream functions do not use the original weights! I'd say, modifying chunks is great, because it enables combinatorial composition of filtering and computing functions w/o custom interfaces.
My proposal:
- drop prepare; if needed, developers themselves can design custom functions take a chunk and output (chunk, extra_data).
- it's okay to modify chunks, unless I am missing something big here.
- use docs to teach the developers that the functions of their pipelines can generate extra data and pass it downstream.
- rename pipe -> apply