RecTools icon indicating copy to clipboard operation
RecTools copied to clipboard

`get_raw_interactions()` NaN assumptions and edge case

Open DavidLandup0 opened this issue 1 year ago • 3 comments

What happened?

Hello! Great work on the library! I recently found an edge case in the get_raw_interactions() calls, which may or may not be intended, so I'm opening an issue for discussion.

Context

Namely, the get_raw_interactions() method is a couple of wrappers for to the convert_to_external() calls from each identifier, which uses get_from_series_by_index().

get_from_series_by_index() seems to have an explicit assumption that NaNs in the resulting dataframe are caused by the reindex() operation. The operation introduces NaNs when there is a mismatch between item IDs and user IDs:

https://github.com/MobileTeleSystems/RecTools/blob/main/rectools/utils/indexing.py#L123-L127

Edge Case

The reindex() operation isn't the only time NaNs can be introduced, though. For example, if a Dataset has a NaN by itself in its raw form - this exception is raised anyways. For example:

>>> import rectools
>>> import pandas as pd
>>> from rectools.dataset import Dataset

>>> data = {
...     'user_id': [1, 2, 3, None, 5, 1, 2, 3, 4, 5],  # There is a None in the User IDs
...     'item_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
...     'rating': [5, 3, 4, 2, 1, 4, 5, 3, 2, 1]
... }

>>> df = pd.DataFrame(data)
>>> df['weight'] = 1
>>> df['datetime'] = pd.to_datetime('2025-02-25')
>>> dataset = Dataset.construct(df)
>>> dataset.get_raw_interactions()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/dataset/dataset.py", line 248, in get_raw_interactions
    return self.interactions.to_external(self.user_id_map, self.item_id_map, include_weight, include_datetime)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/dataset/interactions.py", line 181, in to_external
    Columns.User: user_id_map.convert_to_external(self.df[Columns.User].values),
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/dataset/identifiers.py", line 221, in convert_to_external
    result = get_from_series_by_index(self.to_external, internal, strict, return_missing)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/david.landup/.pyenv/versions/3.11.10/lib/python3.11/site-packages/rectools/utils/indexing.py", line 127, in get_from_series_by_index
    raise KeyError("Some indices do not exist")
KeyError: 'Some indices do not exist'

While get_from_series_by_index() itself has a very useful strict mode that can be turned off, the get_raw_interactions() method doesn't allow passing this argument down to the external export calls.

Workaround

A simple workaround is to simply re-implement the constituent calls and pass strict=False in a utility module or patch the lib. However, it would be much nicer developer UX if we can simply pass it through the call natively.

Proposal

One backwards-compatible way to allow for this edge case would be to allow **kwargs in the get_raw_interactions() method and pass them to the underlying calls.

Alternatively, it would be nice UX to also raise which indices are missing or provide a bit of extra context in the error message to alleviate part of the debugging.

Thoughts?

Expected behavior

No response

Relevant logs and/or screenshots

No response

Operating System

MacOS

Python Version

3.11

RecTools version

0.8.0

DavidLandup0 avatar Feb 25 '25 02:02 DavidLandup0