Thoughts on Wide-datasets
I have given wide-datasets some thoughts. Here are some suggestions on how to go about it. Related to #2151 .
1
One way is to create a new constructor to PandasDataset that takes as input a df of shape (TxM) with T timestamps as index and M columns which can be targets or dynamic feature. Static features don't make much sense in this case because every thing is shared cross all targets (see example below). Creating a PandasDataset might look like:
(The light version)
idx = pd.date_range("2021-01-01", periods=10, freq="1D")
df = pd.DataFrame(np.random.normal(size=(10,5)), columns=["A", "B", "C", "D", "E"])
PandasDataset.from_wide_dataset(df, targets=["A", "D"], feat_dynamic_real=["B", "C", "E"])
- Dynamic features here have to be shared across all time series.
- Static features would also need be shared in this framework (if only one
dfis providing all information).
Another option for more flexibility would be to allow for a mapping feat_dynamic_real={"A": ["B", "C"], "D": ["E", "C"]}.
One could also allow for another dataframe/dictionary containing static feature mappings feat_static_cat={"A": [0, 1], "D" [1, 0]}.
This brings me to the next point.
2
With the hashmaps we are not losing any flexibility. Even multivariate data can be represented: (The heavier version)
idx = pd.date_range("2021-01-01", periods=10, freq="1D")
df = pd.DataFrame(np.random.normal(size=(10,6)), columns=["A", "B", "C", "D", "E", "F"])
univariate = PandasDataset.from_wide_dataset(
df,
targets=["A", "D"],
feat_dynamic_real_map={"A": ["B", "C"], "D": ["E", "C"]},
feat_dynamic_cat={"A": ["F"], "D": ["F"]},
feat_static_real={"A": [0, 1], "D" [1, 0]},
feat_static_cat={"A": [0.1], "D" [1.2]},
)
multivariate = PandasDataset.from_wide_dataset(
df,
targets=[("A", "D")],
feat_dynamic_real_map={("A", "D"): ["B", "C"]},
feat_dynamic_cat={("A", "D"): ["E"]},
feat_static_real={("A", "D"): [0, 1]},
feat_static_cat={("A", "D"): [0.1]},
)
The constructor has to do too much of heavy lifting to form the dataset into the proper internal representation, so that PandasDataset can iter through it. This might be (maybe) too much for 'just another constructor'. So, the third proposal is to have a new iterator-Type that can leverage the structure of wide dataframes without any reshaping of the dataset itself, i.e. getting the DataEntrys by just indexing.
3
A third option is to provide different dataframes for targets and features. Basically, to separate the values to be modeled (targets) from the covariates. Here is a slightly altered version from #1930 . For the WideDataset we split targets and features into different dataframes and provide the appropriate dataframe directly as input (timestamps are the index and each column corresponds to a time series):
T, N = 10, 2
idx = pd.date_range("2021-01-01", periods=T, freq="1D")
target_df = pd.DataFrame(np.random.normal(size=(T, N)), columns=["A", "B"])
feat_dynamic_real_1_df = pd.DataFrame(np.random.normal(size=(T, N)), columns=["A", "B"])
feat_dynamic_real_2_df = pd.DataFrame(np.random.normal(size=(T, N)), columns=["A", "B"])
feat_dynamic_cat_1_df = pd.DataFrame(np.random.normal(size=(T, N)), columns=["A", "B"]) < 0
feat_static_real = pd.DataFrame(np.random.normal(size=(5, N)), columns=["A", "B"]) # 5 static real features
feat_static_cat = pd.DataFrame(np.random.normal(size=(5, N)), columns=["A", "B"]) < 0 # 5 static cat features
dataset = WidePandasDataset(
target_df,
feat_dynamic_real=[feat_dynamic_real_1_df, feat_dynamic_real_2_df],
feat_dynamic_cat=[feat_dynamic_cat_1_df],
feat_static_real=feat_static_real,
feat_static_cat=feat_static_cat,
)
To receive a DataEntry, we just index all dataframes by, e.g., "A". This leverages the wide-datastructure and is more time and memory efficient than long-format in many cases. Here, are examples of target and feature dataframes:
Target
| A | B | |
|---|---|---|
| 1750-01-01 00:00:00 | -0.21 | NaN |
| 1750-01-01 01:00:00 | -0.33 | 1.94 |
| 1750-01-01 02:00:00 | -0.33 | 2.28 |
Dynamic features (cat and real)
Dynamic features are provided in a separate dataframes of the same size as the target-dataframe. For mulitple features multiple dataframes are provided.
| A | B | |
|---|---|---|
| 1750-01-01 00:00:00 | 0.79 | NaN |
| 1750-01-01 01:00:00 | 0.59 | -0.60 |
| 1750-01-01 02:00:00 | 0.39 | -0.91 |
Static features (cat and real)
Static features are also provided in a separate dataframe. For multiple features we have multiple rows in the same dataframe.
| A | B | |
|---|---|---|
| static_cat_1 | 0 | 1 |
| static_cat_2 | 1 | 1 |
@rsnirwan thanks for the detailed discussion of all these options! One note on the side: in general, we may not need to explicitly specify cat vs real to distinguish categorical vs numerical features, and instead rely on the dtype of columns, see here
Can we have some motivating examples where we can try out the different approaches? For me, the proposals are a bit hypothetical.
That said, what I probably want to do is something like this:
PandasDataset.from_wide(
df,
[
{
"target": "A",
"feat_dynamic_real": ["B", "C"],
"feat_dynamic_cat": ["F"],
"feat_static_real": [0, 1],
"feat_static_cat": [0.1],
},
{
"target": "D",
"feat_dynamic_real": ["E", "C"],
"feat_dynamic_cat": ["F"],
"feat_static_real": [1, 0],
"feat_static_cat": [1.2],
},
],
)
Can we have some motivating examples where we can try out the different approaches? For me, the proposals are a bit hypothetical.
That said, what I probably want to do is something like this:
PandasDataset.from_wide( df, [ { "target": "A", "feat_dynamic_real": ["B", "C"], "feat_dynamic_cat": ["F"], "feat_static_real": [0, 1], "feat_static_cat": [0.1], }, { "target": "D", "feat_dynamic_real": ["E", "C"], "feat_dynamic_cat": ["F"], "feat_static_real": [1, 0], "feat_static_cat": [1.2], }, ], )
I think that is exactly how I see it. It is a bit confusing for me the way in which you assign two different targets. Is this basically saying that you have two targets with different covariates each?
Can we have some motivating examples where we can try out the different approaches? For me, the proposals are a bit hypothetical.
The question seems a bit too general. Do you mean use cases for wide datasets? for shared/non-shared dynamic features? or for splitting dynamic features and targets into different dataframes as in proposal 3? Or all three :) ?
Sorry, what I meant is that the examples in the proposals look a too constructed.
I would like to see the proposal using some real-world dataset :).
I think that is exactly how I see it. It is a bit confusing for me the way in which you assign two different targets. Is this basically saying that you have two targets with different covariates each?
Yes, here we get a dataset consisting of two time-series, using the columns A and D respectively.
Even though, I am using them everyday at work, it's not easy to find publicly available time series in wide format (see an artificial example below). So, let me answer the use case question from another perspective.
Why would a user have data in a wide format?
Wide vs Long is rather a data alignment than use case driven problem. All use cases from long would also apply to wide and vice versa. When users fetch data from up-stream systems they fetch it according to a data type (or schema). For data scientists using python this usually is a pd.DataFrame which serves as an intermediate representation. Depending on the problem statement, one would align fetched data in long or wide format.
There are benefits to both, it's pretty much problem dependent. Just to name 1-2 differences between long and wide:
- If all time series have the same timestamps data alignment in the wide format can reduce memory significantly.
- finding a time series in the wide case is just indexing, in the long case its filtering/masking.
That being said, by supporting the long format we already have a basis for tabular data in general. So, PandasDataset.from_wide would be syntactic sugar. We could provide it or just let the user convert the data first to long and then use the PandasDataset.from_long constructor.
From maintainers' perspective I understand that having too many exposed interfaces is not a good idea. If we decide to support PandasDataset.from_wide constructor I would go for the hashmap proposal 2 (equivalently @jaheba's suggestion).
Wide dataset example
Say we want to model stock returns for AMZN and GOOG and provide extra features (twitter sentiment and holidays).
| AMZN | GOOG | Twitter_AMZN | Twitter_GOOG | Twitter_general | holiday | |
|---|---|---|---|---|---|---|
| 01-xx-yyyy | -0.01 | -0.02 | 0.1 | 0.2 | 0.0 | 0 |
| 02-xx-yyyy | 0.02 | 0.01 | 0.0 | -0.1 | 0.2 | 0 |
| 03-xx-yyyy | -0.01 | 0.02 | -0.1 | -0.1 | -0.1 | 1 |
example1 = PandasDataset.from_wide_dataset(
df,
targets=["AMZN", "GOOG"],
feat_dynamic_real={
"AMZN": ["Twitter_AMZN", "Twitter_general"],
"GOOG": ["Twitter_GOOG", "Twitter_general"],
},
feat_dynamic_cat={"AMZN": ["holiday"], "GOOG": ["holiday"]},
feat_static_cat={"AMZN": [0], "GOOG" [1]},
)
example2 = PandasDataset.from_wide_dataset(
df,
[
{
"target": "AMZN",
"feat_dynamic_real": ["Twitter_AMZN", "Twitter_general"],
"feat_dynamic_cat": ["holiday"],
"feat_static_cat": [0],
},
{
"target": "GOOG",
"feat_dynamic_real": ["Twitter_GOOG", "Twitter_general"],
"feat_dynamic_cat": ["holiday"],
"feat_static_cat": [1],
},
]
)
Really like the look of both.
Even though, I am using them everyday at work, it's not easy to find publicly available time series in wide format (see an artificial example below). So, let me answer the use case question from another perspective.
Why would a user have data in a wide format?
Wide vs Long is rather a data alignment than use case driven problem. All use cases from long would also apply to wide and vice versa. When users fetch data from up-stream systems they fetch it according to a data type (or schema). For data scientists using python this usually is a
pd.DataFramewhich serves as an intermediate representation. Depending on the problem statement, one would align fetched data in long or wide format.There are benefits to both, it's pretty much problem dependent. Just to name 1-2 differences between long and wide:
- If all time series have the same timestamps data alignment in the wide format can reduce memory significantly.
- finding a time series in the wide case is just indexing, in the long case its filtering/masking.
That being said, by supporting the long format we already have a basis for tabular data in general. So,
PandasDataset.from_widewould be syntactic sugar. We could provide it or just let the user convert the data first to long and then use thePandasDataset.from_longconstructor.From maintainers' perspective I understand that having too many exposed interfaces is not a good idea. If we decide to support
PandasDataset.from_wideconstructor I would go for the hashmap proposal 2 (equivalently @jaheba's suggestion).Wide dataset example
Say we want to model stock returns for
AMZNandGOOGand provide extra features (twitter sentiment and holidays).AMZN GOOG Twitter_AMZN Twitter_GOOG Twitter_general holiday 01-xx-yyyy -0.01 -0.02 0.1 0.2 0.0 0 02-xx-yyyy 0.02 0.01 0.0 -0.1 0.2 0 03-xx-yyyy -0.01 0.02 -0.1 -0.1 -0.1 1
example1 = PandasDataset.from_wide_dataset( df, targets=["AMZN", "GOOG"], feat_dynamic_real={ "AMZN": ["Twitter_AMZN", "Twitter_general"], "GOOG": ["Twitter_GOOG", "Twitter_general"], }, feat_dynamic_cat={"AMZN": ["holiday"], "GOOG": ["holiday"]}, feat_static_cat={"AMZN": [0], "GOOG" [1]}, ) example2 = PandasDataset.from_wide_dataset( df, [ { "target": "AMZN", "feat_dynamic_real": ["Twitter_AMZN", "Twitter_general"], "feat_dynamic_cat": ["holiday"], "feat_static_cat": [0], }, { "target": "GOOG", "feat_dynamic_real": ["Twitter_GOOG", "Twitter_general"], "feat_dynamic_cat": ["holiday"], "feat_static_cat": [1], }, ] )
Really like the look of both.