Chainer/Concater from single datapipe?
The Concater datapipe takes multiple DPs as input. Is there a class that would take a single datapipe of iterables instead? Something like this:
class ConcaterIterable(IterDataPipe):
def __init__(self, source_datapipe):
self.source_datapipe = source_datapipe
def __iter__(self):
for iterable in self.source_datapipe:
yield from iterable
Basically:
itertools.chain == Concater
itertools.chain.from_iterable == ConcaterIterable
Maybe a neat way of implementing this would be to keep a single Concater class, which would fall back to the ConcaterIterable behaviour if it's passed only one DP as input?
Details: I need this for my benchmarking on manifold where each file is a big pickle archive of multiple images. My DP builder looks like this:
def make_manifold_dp(root, dataset_size):
handler = ManifoldPathHandler()
dp = IoPathFileLister(root=root)
dp.register_handler(handler)
dp = dp.shuffle(buffer_size=dataset_size).sharding_filter()
dp = IoPathFileOpener(dp, mode="rb")
dp.register_handler(handler)
dp = PickleLoaderDataPipe(dp)
dp = ConcaterIterable(dp) # <-- Needed here!
return dp
BTW, this is a NIT, but has it been considered to rename Concater into Chainer to be a bit more consistent with itertools?
You can try to use .unbatch() for it, it is not so generic but might work in your case.
However proper solution would be to add new DataPipe. And I would rather call it flatten
Note: we already used flatten for horizontal/column operations. Perhaps we need rows_flatten or other (way) better name.
You can try to use
.unbatch()for it, it is not so generic but might work in your case.However proper solution would be to add new DataPipe. And I would rather call it
flatten
Looks like unbatch() works for a datapipe that contains lists, but it doesn't work for datapipes that contain datapipes, so in https://github.com/pytorch/data/issues/732 I still had to resort to something like ConcaterIterable above
We will work to introduce the function for your case.
Edit: I think the new operation flatten in this open PR should be able to handle a IterDataPipe of iterables, depending on how we end up implementing that.
Looks like our final solution would be to allow flatmap to have no-op.
Meanwhile, you can use:
dp = dp.flatmap(fn = lambda x: x)
This can be closed.
Closing this. Feel free to re-open if necessary.