KeyError when wrapping a dict with a SequenceWrapper
🐛 Describe the bug
I get the following errors.
from torchdata.datapipes.map import SequenceWrapper
tuple(SequenceWrapper({'a': 100, 'b': 200, 'c': 300, 'd': 400})) # produces KeyError: 0
tuple(SequenceWrapper({0: "100", 1: "200"})) # produces KeyError: 2
I expect them to return (100, 200, 300, 400) and ("100", "200"), respectively, instead.
Versions
torchdata 0.4.1
Can I know what is the use case?
You can do tuple(SequenceWrapper(...). to_iter_datapipe(indices=[...]))
A similar thing happens when you do:
dp = SequenceWrapper({'a': 100, 'b': 200, 'c': 300, 'd': 400})
print(list(dp))
> File "\lib\site-packages\torch\utils\data\datapipes\map\utils.py", line 46, in __getitem__
> return self.sequence[index]
> KeyError: 0
source_dp = IterableWrapper([(i+1, i) for i in range(10)])
map_dp = source_dp.to_map_datapipe()
print(list(map_dp))
> []
Funnily enough
source_dp = IterableWrapper([(i, i) for i in range(10)])
map_dp = source_dp.to_map_datapipe()
print(list(map_dp))
> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
As present in the doc works fine. This prints the values instead of the keys.
This also works as expected:
source_dp = SequenceWrapper(range(5,11))
print(tuple(source_dp))
> (5, 6, 7, 8, 9, 10)
The problem is that while converting to a list/tuple, python does the following (at least according to my debugger): First __len__() is called and then __getitem__() with indices from 0 until an error occurs (not until len-1 ... why?).
Why the first throws a Key Error while the IndexError thrown by __getitem__() in IterToMapConverter or SequenceWrapper is "ignored" is beyond me.
Edit: this post explains the behavior: https://stackoverflow.com/questions/68244987/how-do-dunder-methods-getitem-and-len-provide-iteration
As mentioned by @ejguan this can be solved by first converting to a IterDataPipe.
Alternatively we could implement __iter__() in IterToMapConverter or SequenceWrapper but I am not sure if we want to do this as theses are MapDatapipes.
I personally feel it's not hard to implement. We just need to figure out the expected behavior.
Should we treat MapDataPipe as dict? If so, we should return keys when iter is called.
However, I would imagine some users building pipeline would expect data rather than keys when iter is called.
Personally I would expect the data.
Personally I would expect the data.
Yeah, I would think so for the most of users. @SvenDS9 How about this. Let's open a RFC issue to discuss to introduce this new behavior for all MapDataPipe. When iter is called, we always return the result for data if possible.
It might gives the possibility for the whole pipeline iterable as I would expect the starting point of the Map pipeline would be SequenceWrapper. If SequenceWrapper can provide result via __iter__, the subsequent MapDataPipes should be able to rely on for d in self.datapipe pattern to do iteration.
And, I personally thinks this behavior should always be warned because users should do mapdatapipe.to_iterdatapipe() to achieve iteration. MapDataPipe is kind not designed for iteration, it's for random accessing.
The main concern in my mind is Cython expects integer indices as input for iterator however we are changing this behavior like accepting any kind of indices.
https://github.com/python/cpython/blob/main/Objects/iterobject.c#L46-L77