gluonts icon indicating copy to clipboard operation
gluonts copied to clipboard

Thoughts on Wide-datasets

Open rsnirwan opened this issue 3 years ago • 8 comments

I have given wide-datasets some thoughts. Here are some suggestions on how to go about it. Related to #2151 .

1

One way is to create a new constructor to PandasDataset that takes as input a df of shape (TxM) with T timestamps as index and M columns which can be targets or dynamic feature. Static features don't make much sense in this case because every thing is shared cross all targets (see example below). Creating a PandasDataset might look like: (The light version)

idx = pd.date_range("2021-01-01", periods=10, freq="1D")
df = pd.DataFrame(np.random.normal(size=(10,5)), columns=["A", "B", "C", "D", "E"])
PandasDataset.from_wide_dataset(df, targets=["A", "D"], feat_dynamic_real=["B", "C", "E"])
  • Dynamic features here have to be shared across all time series.
  • Static features would also need be shared in this framework (if only one df is providing all information).

Another option for more flexibility would be to allow for a mapping feat_dynamic_real={"A": ["B", "C"], "D": ["E", "C"]}. One could also allow for another dataframe/dictionary containing static feature mappings feat_static_cat={"A": [0, 1], "D" [1, 0]}. This brings me to the next point.

2

With the hashmaps we are not losing any flexibility. Even multivariate data can be represented: (The heavier version)

idx = pd.date_range("2021-01-01", periods=10, freq="1D")
df = pd.DataFrame(np.random.normal(size=(10,6)), columns=["A", "B", "C", "D", "E", "F"])
univariate = PandasDataset.from_wide_dataset(
    df, 
    targets=["A", "D"], 
    feat_dynamic_real_map={"A": ["B", "C"], "D": ["E", "C"]},
    feat_dynamic_cat={"A": ["F"], "D": ["F"]},
    feat_static_real={"A": [0, 1], "D" [1, 0]},
    feat_static_cat={"A": [0.1], "D" [1.2]},
)
multivariate = PandasDataset.from_wide_dataset(
    df, 
    targets=[("A", "D")],
    feat_dynamic_real_map={("A", "D"): ["B", "C"]},
    feat_dynamic_cat={("A", "D"): ["E"]},
    feat_static_real={("A", "D"): [0, 1]},
    feat_static_cat={("A", "D"): [0.1]},
)

The constructor has to do too much of heavy lifting to form the dataset into the proper internal representation, so that PandasDataset can iter through it. This might be (maybe) too much for 'just another constructor'. So, the third proposal is to have a new iterator-Type that can leverage the structure of wide dataframes without any reshaping of the dataset itself, i.e. getting the DataEntrys by just indexing.

3

A third option is to provide different dataframes for targets and features. Basically, to separate the values to be modeled (targets) from the covariates. Here is a slightly altered version from #1930 . For the WideDataset we split targets and features into different dataframes and provide the appropriate dataframe directly as input (timestamps are the index and each column corresponds to a time series):

T, N = 10, 2
idx = pd.date_range("2021-01-01", periods=T, freq="1D")
target_df = pd.DataFrame(np.random.normal(size=(T, N)), columns=["A", "B"])
feat_dynamic_real_1_df = pd.DataFrame(np.random.normal(size=(T, N)), columns=["A", "B"])
feat_dynamic_real_2_df = pd.DataFrame(np.random.normal(size=(T, N)), columns=["A", "B"])
feat_dynamic_cat_1_df = pd.DataFrame(np.random.normal(size=(T, N)), columns=["A", "B"]) < 0
feat_static_real = pd.DataFrame(np.random.normal(size=(5, N)), columns=["A", "B"]) # 5 static real features
feat_static_cat = pd.DataFrame(np.random.normal(size=(5, N)), columns=["A", "B"]) < 0  # 5 static cat features

dataset = WidePandasDataset(
    target_df,
    feat_dynamic_real=[feat_dynamic_real_1_df, feat_dynamic_real_2_df],
    feat_dynamic_cat=[feat_dynamic_cat_1_df],
    feat_static_real=feat_static_real,
    feat_static_cat=feat_static_cat,
)

To receive a DataEntry, we just index all dataframes by, e.g., "A". This leverages the wide-datastructure and is more time and memory efficient than long-format in many cases. Here, are examples of target and feature dataframes:

Target

A B
1750-01-01 00:00:00 -0.21 NaN
1750-01-01 01:00:00 -0.33 1.94
1750-01-01 02:00:00 -0.33 2.28

Dynamic features (cat and real)

Dynamic features are provided in a separate dataframes of the same size as the target-dataframe. For mulitple features multiple dataframes are provided.

A B
1750-01-01 00:00:00 0.79 NaN
1750-01-01 01:00:00 0.59 -0.60
1750-01-01 02:00:00 0.39 -0.91

Static features (cat and real)

Static features are also provided in a separate dataframe. For multiple features we have multiple rows in the same dataframe.

A B
static_cat_1 0 1
static_cat_2 1 1

rsnirwan avatar Jul 11 '22 21:07 rsnirwan

@rsnirwan thanks for the detailed discussion of all these options! One note on the side: in general, we may not need to explicitly specify cat vs real to distinguish categorical vs numerical features, and instead rely on the dtype of columns, see here

lostella avatar Jul 12 '22 10:07 lostella

Can we have some motivating examples where we can try out the different approaches? For me, the proposals are a bit hypothetical.

That said, what I probably want to do is something like this:

PandasDataset.from_wide(
    df,
    [
        {
            "target": "A",
            "feat_dynamic_real": ["B", "C"],
            "feat_dynamic_cat": ["F"],
            "feat_static_real": [0, 1],
            "feat_static_cat": [0.1],
        },
        {
            "target": "D",
            "feat_dynamic_real": ["E", "C"],
            "feat_dynamic_cat": ["F"],
            "feat_static_real": [1, 0],
            "feat_static_cat": [1.2],
        },
    ],
)

jaheba avatar Jul 12 '22 11:07 jaheba

Can we have some motivating examples where we can try out the different approaches? For me, the proposals are a bit hypothetical.

That said, what I probably want to do is something like this:

PandasDataset.from_wide(
    df,
    [
        {
            "target": "A",
            "feat_dynamic_real": ["B", "C"],
            "feat_dynamic_cat": ["F"],
            "feat_static_real": [0, 1],
            "feat_static_cat": [0.1],
        },
        {
            "target": "D",
            "feat_dynamic_real": ["E", "C"],
            "feat_dynamic_cat": ["F"],
            "feat_static_real": [1, 0],
            "feat_static_cat": [1.2],
        },
    ],
)

I think that is exactly how I see it. It is a bit confusing for me the way in which you assign two different targets. Is this basically saying that you have two targets with different covariates each?

strakehyr avatar Jul 12 '22 13:07 strakehyr

Can we have some motivating examples where we can try out the different approaches? For me, the proposals are a bit hypothetical.

The question seems a bit too general. Do you mean use cases for wide datasets? for shared/non-shared dynamic features? or for splitting dynamic features and targets into different dataframes as in proposal 3? Or all three :) ?

rsnirwan avatar Jul 12 '22 14:07 rsnirwan

Sorry, what I meant is that the examples in the proposals look a too constructed.

I would like to see the proposal using some real-world dataset :).

jaheba avatar Jul 12 '22 14:07 jaheba

I think that is exactly how I see it. It is a bit confusing for me the way in which you assign two different targets. Is this basically saying that you have two targets with different covariates each?

Yes, here we get a dataset consisting of two time-series, using the columns A and D respectively.

jaheba avatar Jul 12 '22 14:07 jaheba

Even though, I am using them everyday at work, it's not easy to find publicly available time series in wide format (see an artificial example below). So, let me answer the use case question from another perspective.

Why would a user have data in a wide format?

Wide vs Long is rather a data alignment than use case driven problem. All use cases from long would also apply to wide and vice versa. When users fetch data from up-stream systems they fetch it according to a data type (or schema). For data scientists using python this usually is a pd.DataFrame which serves as an intermediate representation. Depending on the problem statement, one would align fetched data in long or wide format.

There are benefits to both, it's pretty much problem dependent. Just to name 1-2 differences between long and wide:

  • If all time series have the same timestamps data alignment in the wide format can reduce memory significantly.
  • finding a time series in the wide case is just indexing, in the long case its filtering/masking.

That being said, by supporting the long format we already have a basis for tabular data in general. So, PandasDataset.from_wide would be syntactic sugar. We could provide it or just let the user convert the data first to long and then use the PandasDataset.from_long constructor.

From maintainers' perspective I understand that having too many exposed interfaces is not a good idea. If we decide to support PandasDataset.from_wide constructor I would go for the hashmap proposal 2 (equivalently @jaheba's suggestion).

Wide dataset example

Say we want to model stock returns for AMZN and GOOG and provide extra features (twitter sentiment and holidays).

AMZN GOOG Twitter_AMZN Twitter_GOOG Twitter_general holiday
01-xx-yyyy -0.01 -0.02 0.1 0.2 0.0 0
02-xx-yyyy 0.02 0.01 0.0 -0.1 0.2 0
03-xx-yyyy -0.01 0.02 -0.1 -0.1 -0.1 1
example1 = PandasDataset.from_wide_dataset(
    df, 
    targets=["AMZN", "GOOG"], 
    feat_dynamic_real={
        "AMZN": ["Twitter_AMZN", "Twitter_general"],
        "GOOG": ["Twitter_GOOG", "Twitter_general"],
    },
    feat_dynamic_cat={"AMZN": ["holiday"], "GOOG": ["holiday"]},
    feat_static_cat={"AMZN": [0], "GOOG" [1]},
)
example2 = PandasDataset.from_wide_dataset(
    df,
    [
        {
            "target": "AMZN",
            "feat_dynamic_real": ["Twitter_AMZN", "Twitter_general"],
            "feat_dynamic_cat": ["holiday"],
            "feat_static_cat": [0],
        },
        {
            "target": "GOOG",
            "feat_dynamic_real": ["Twitter_GOOG", "Twitter_general"],
            "feat_dynamic_cat": ["holiday"],
            "feat_static_cat": [1],
        },
    ]
)

rsnirwan avatar Jul 12 '22 18:07 rsnirwan

Really like the look of both.

Even though, I am using them everyday at work, it's not easy to find publicly available time series in wide format (see an artificial example below). So, let me answer the use case question from another perspective.

Why would a user have data in a wide format?

Wide vs Long is rather a data alignment than use case driven problem. All use cases from long would also apply to wide and vice versa. When users fetch data from up-stream systems they fetch it according to a data type (or schema). For data scientists using python this usually is a pd.DataFrame which serves as an intermediate representation. Depending on the problem statement, one would align fetched data in long or wide format.

There are benefits to both, it's pretty much problem dependent. Just to name 1-2 differences between long and wide:

  • If all time series have the same timestamps data alignment in the wide format can reduce memory significantly.
  • finding a time series in the wide case is just indexing, in the long case its filtering/masking.

That being said, by supporting the long format we already have a basis for tabular data in general. So, PandasDataset.from_wide would be syntactic sugar. We could provide it or just let the user convert the data first to long and then use the PandasDataset.from_long constructor.

From maintainers' perspective I understand that having too many exposed interfaces is not a good idea. If we decide to support PandasDataset.from_wide constructor I would go for the hashmap proposal 2 (equivalently @jaheba's suggestion).

Wide dataset example

Say we want to model stock returns for AMZN and GOOG and provide extra features (twitter sentiment and holidays).

AMZN GOOG Twitter_AMZN Twitter_GOOG Twitter_general holiday 01-xx-yyyy -0.01 -0.02 0.1 0.2 0.0 0 02-xx-yyyy 0.02 0.01 0.0 -0.1 0.2 0 03-xx-yyyy -0.01 0.02 -0.1 -0.1 -0.1 1

example1 = PandasDataset.from_wide_dataset(
    df, 
    targets=["AMZN", "GOOG"], 
    feat_dynamic_real={
        "AMZN": ["Twitter_AMZN", "Twitter_general"],
        "GOOG": ["Twitter_GOOG", "Twitter_general"],
    },
    feat_dynamic_cat={"AMZN": ["holiday"], "GOOG": ["holiday"]},
    feat_static_cat={"AMZN": [0], "GOOG" [1]},
)
example2 = PandasDataset.from_wide_dataset(
    df,
    [
        {
            "target": "AMZN",
            "feat_dynamic_real": ["Twitter_AMZN", "Twitter_general"],
            "feat_dynamic_cat": ["holiday"],
            "feat_static_cat": [0],
        },
        {
            "target": "GOOG",
            "feat_dynamic_real": ["Twitter_GOOG", "Twitter_general"],
            "feat_dynamic_cat": ["holiday"],
            "feat_static_cat": [1],
        },
    ]
)

Really like the look of both.

strakehyr avatar Jul 13 '22 11:07 strakehyr