Using datapipes on only parts of the input
🚀 The feature
I want to apply the functionality of a already existing datapipe to only parts of my input. Below I listed some solutions. I would like to know the "best" solution for this problem.
Motivation, pitch
Simple example:
I have a tuple containing URLs and additional information (e.g. a text, id, ...).
I want to use the HttpReader to load images behind the URLs.
Currently the HttpReader takes URLs and yields tuples of URLs and Filestreams.
Additional information in the datapipe is not permitted.
Alternatives
- Use
Unzipperthen apply the datapipe you need and then using eitherZipperorIterKeyZipperzip the datapipes back together. Depending on your use-case you might also needForkeror copy a column beforehand to have your key present in both datapipes. - Use
Mapperwithinput_colusing a function that does what I need. If I also want to delete elements (e.g. settingskip_on_error = trueinHTTPReader) I also need to add a filter. This leads to some code redundancy as the functionality already exists as a datapipe. In addition the datapipe is tested while my function is not. - "Add a new datapipe which accepts
(source_data_datapipe, function_datapipe, input_selector, output_merge_fn)Each batchbofsource_data_datapipegets passed intoinput_selector(b)to get a processed inputb'. Thenb'is passed intofunction_datapipefor processing to get outputc. Finally,output_merge_fntakesbandcand combine them into any output." While this works well with my simple example, many datapipes are not compatible with this approach. In addtition theoutput_merge_fnmight get quite complicated depending on your use-case. "One option may be to restrict it to only work for DataPipes that do not change the cardinality of the data." Credits to @NivekT for coming up with this solution - Add
input_colparameter to the Datapipes where necessary/appliccable (A lot of work and maintaining...)
Additional context
No response
My 2 cents: I do think about categorizing all 1-to-1 map DataPipes to accept input_col output_col.HTTPReader is just a special Mapper with extra arguments.
An abstract class with an abstract map function and a specific placeholder for those extra arguments would be sufficient.
Linking #562 as it is relevant for this issue.