[FEA] it is hard to get the schema when you do several workflows
🚀 Feature request
It would be great if we can better keep track of the schema, cardinality and other features. There are a better way to deal with columns than with the schema?
Motivation
I am a bit frustrated when I try to feed the schema into Transformers4Rec. I am trying to adapt the competition code from Coveo Recsys competition to Transformers4Rec.
Being a more complex example than the ones that are updated to 0.7.x nvt and that uses t4r in the examples, I am a bit lost trying to feed/create the schema to use it in t4r.
Following the competition example, it uses several workflows and then it keeps doing changes in pandas, so keeping track of the schema is really hard. Would it be possible to get the dataset statistics (cardinality, max, min, categorical or continuous) automatically without the need of being extracting it from the workflows/nvt.dataset()? So we can keep doing changes to the dataset without the limitation of extracting the schema from there.
Maybe there are actually a proper way of getting the schema but I have been a bit lost.
Your contribution
I am open to discussion and thinking a way to automatically get the schema.
From one of T4R notebooks "Although in this tutorial we are defining the Schema manually, the next NVTabular release is going to generate the schema with appropriate types and tags automatically from the preprocessing workflow, allowing the user to set additional feaure tags if needed."
So I think it will be automated in future.
I have seen that comment, but I was not sure in what version it was commented. Because in v0.7.x I think, you can save the schema doing:
workflow.transform(dataset).to_parquet("./schema", num_partitions=1) or something similar, saves you the schema but I was getting the schema sometimes with min, max and other stats empty. Depending on if it was the first workflow or a workflow that has some previous preprocessing steps with cudf.
I wanted to emphasize the importance of getting the schema automatically as if we want to put this framework into production we cannot be hard-coding a schema if it is a requisite to train models.
Hopefully it will be automated in a near short future as I want to develop a production app with T4R.
@gaceladri you will be able to generate Schema file from NVTabular workflow yes. There is an open PR for that and will be merged soon. We will create also example to use such feature. Hope that helps.
Perfect. My concern was trying to reproduce a more complicated example that the ones that are up-to-date (v0.7.1) on the examples. I was trying to reproduce the example on the competitions repo and I was getting some troubles. The issue was, as you can see in cell [51], after transforming the workflow, @gabrielspmoreira merged a cudf dataframe with the transformed workflow, for example. So, my concern is how, after that merge, for example, we can get the schema from that data frame. Is this solved with this PR @rnyak ? Do you think a way to get the schema from there? as I was not able to get it on 0.7.1. Trying to get the schema from cell [42] that is a data frame coming from a workflow, it gave me a schema with empty statistics.
Perfect. My concern was trying to reproduce a more complicated example that the ones that are up-to-date (v0.7.1) on the examples. I was trying to reproduce the example on the competitions repo and I was getting some troubles. The issue was, as you can see in cell [51], after transforming the workflow, @gabrielspmoreira merged a cudf dataframe with the transformed workflow, for example. So, my concern is how, after that merge, for example, we can get the schema from that data frame. Is this solved with this PR @rnyak ? Do you think a way to get the schema from there? as I was not able to get it on 0.7.1. Trying to get the schema from cell [42] that is a data frame coming from a workflow, it gave me a schema with empty statistics.
@gaceladri for now whatever you do out of an NVTabular workflow will not generate any new/updated schema. Meaning that after that merge, for example, we are not able to get the schema from that data frame unless you do the operations with NVTabular. You can try to do the merge with NVTabular and test the new schema file maybe? As I wrote above there is an open PR to generate a proper schema, it is not merged yet, but it will be soon.
Hi @gaceladri. As @rnyak mentioned, NVTabular will be able to generate soon a schema compatible with Transformers4Rec. But the schema will correspond exactly to the columns that were output by NVtabular in the parquet files. That means that no other further transformation outside the NVTabular preprocessing workflow will be considered.
You can create your schema manually for Transformers4Rec, so that it matches the columns of your final parquet files. You have two options for that:
- Create a protobuf text file like this in a text editor and use
Schema().from_json(path)to load it - Instantiate a Schema object and define the schema by code like this example
s = schema.Schema(
[
schema.ColumnSchema.create_continuous("con_1"),
schema.ColumnSchema.create_continuous("con_2_int", is_float=False),
schema.ColumnSchema.create_categorical("cat_1", 1000),
schema.ColumnSchema.create_categorical(
"cat_2", 100, value_count=schema.ValueCount(1, 20)
),
]
)
Regarding the task of porting our solution for the Coveo Data Challenge to the new API, I have to say that the solution code used an old internal version of the library before our refactory to create the released PyTorch API. Furthermore, there are some missing building blocks in the current API that might prevent you to build the exact same architecture used for the competition easily, for example:
- Having multiple input sequences (user clicks features and user search events features)
- Using pre-trained embeddings (e.g. the image and text embeddings provided in the dataset) as an input features
- Having a post-fusion context added to the Transformers output. But maybe you can create some custom building blocks for those.
We plan to work in the future in an example for the PyTorch API based on the Coveo solution, including its missing building blocks.
Although you could try implementing the missing building blocks, if you are just doing an exercise of reimplementing that architecture with the Transformers4Rec PyTorch API, I would recommend you to focus on the interaction sequence tower (ignoring the search events tower, which is secondary). You can find in the paper reproducibility script a good example on how to use the PyTorch API low-level building blocks to build custom architectures similar to the one for our Coveo dataset solution.
Thank for your answers @gabrielspmoreira & @rnyak . Looking forward to see that PR and to find a solution when you have different source dataframes and merging them into one would crash the memory.