data icon indicating copy to clipboard operation
data copied to clipboard

KeyError when wrapping a dict with a SequenceWrapper

Open philip-bl opened this issue 3 years ago • 6 comments

🐛 Describe the bug

I get the following errors.

from torchdata.datapipes.map import SequenceWrapper

tuple(SequenceWrapper({'a': 100, 'b': 200, 'c': 300, 'd': 400}))  # produces KeyError: 0
tuple(SequenceWrapper({0: "100", 1: "200"}))  # produces KeyError: 2

I expect them to return (100, 200, 300, 400) and ("100", "200"), respectively, instead.

Versions

torchdata 0.4.1

philip-bl avatar Sep 28 '22 11:09 philip-bl

Can I know what is the use case? You can do tuple(SequenceWrapper(...). to_iter_datapipe(indices=[...]))

ejguan avatar Sep 28 '22 13:09 ejguan

A similar thing happens when you do:

dp = SequenceWrapper({'a': 100, 'b': 200, 'c': 300, 'd': 400})
print(list(dp))

> File "\lib\site-packages\torch\utils\data\datapipes\map\utils.py", line 46, in __getitem__
>    return self.sequence[index]
> KeyError: 0

source_dp = IterableWrapper([(i+1, i) for i in range(10)])
map_dp = source_dp.to_map_datapipe()
print(list(map_dp))

> []

Funnily enough


source_dp = IterableWrapper([(i, i) for i in range(10)])
map_dp = source_dp.to_map_datapipe()
print(list(map_dp))
> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

As present in the doc works fine. This prints the values instead of the keys.

This also works as expected:

source_dp = SequenceWrapper(range(5,11))
print(tuple(source_dp))
> (5, 6, 7, 8, 9, 10)

The problem is that while converting to a list/tuple, python does the following (at least according to my debugger): First __len__() is called and then __getitem__() with indices from 0 until an error occurs (not until len-1 ... why?). Why the first throws a Key Error while the IndexError thrown by __getitem__() in IterToMapConverter or SequenceWrapper is "ignored" is beyond me.

Edit: this post explains the behavior: https://stackoverflow.com/questions/68244987/how-do-dunder-methods-getitem-and-len-provide-iteration

As mentioned by @ejguan this can be solved by first converting to a IterDataPipe. Alternatively we could implement __iter__() in IterToMapConverter or SequenceWrapper but I am not sure if we want to do this as theses are MapDatapipes.

SvenDS9 avatar Feb 24 '23 17:02 SvenDS9

I personally feel it's not hard to implement. We just need to figure out the expected behavior. Should we treat MapDataPipe as dict? If so, we should return keys when iter is called. However, I would imagine some users building pipeline would expect data rather than keys when iter is called.

ejguan avatar Feb 24 '23 19:02 ejguan

Personally I would expect the data.

SvenDS9 avatar Feb 27 '23 08:02 SvenDS9

Personally I would expect the data.

Yeah, I would think so for the most of users. @SvenDS9 How about this. Let's open a RFC issue to discuss to introduce this new behavior for all MapDataPipe. When iter is called, we always return the result for data if possible. It might gives the possibility for the whole pipeline iterable as I would expect the starting point of the Map pipeline would be SequenceWrapper. If SequenceWrapper can provide result via __iter__, the subsequent MapDataPipes should be able to rely on for d in self.datapipe pattern to do iteration.

And, I personally thinks this behavior should always be warned because users should do mapdatapipe.to_iterdatapipe() to achieve iteration. MapDataPipe is kind not designed for iteration, it's for random accessing.

ejguan avatar Feb 27 '23 14:02 ejguan

The main concern in my mind is Cython expects integer indices as input for iterator however we are changing this behavior like accepting any kind of indices.

https://github.com/python/cpython/blob/main/Objects/iterobject.c#L46-L77

ejguan avatar Feb 27 '23 14:02 ejguan