fondant Validate `consumes` and infer `produces` for Lightweight Python components

When the user uses Lightweight Python components (https://github.com/ml6team/fondant/issues/558) we want to get any information we currently get from the component spec from the provided Python code.

For the consumes section, we can assume it matches the schema of the dataset the operation is applied to, possibly altered by the consumes argument passed to the apply method.

For the produces section, the user can either provide a schema via the produces argument on the apply method, or we can try to infer it by simulating the transform function. We could do this by generating dummy data based on the consumes schema, and applying the transform method on it.

This only makes sense for Transform components since we always expect the user to provide a produces schema for a Read component, and a Write component doesn't produce anything.

Inferring the produces schema by simulation would also validate the consumes schema if it succeeds. It doesn't invalidate it when failing though, since there can be multiple reasons for a failed simulation: either the consumes schema is incorrect, there's a bug in the component, or a bug in the dummy data generation.

Jan 02 '24 15:01 RobbeSneyders

See this gist for a quick PoC to simulate transform components using pandera.

Jan 02 '24 15:01 RobbeSneyders

https://www.coiled.io/blog/dask-dtype-astype

Jan 24 '24 08:01 RobbeSneyders

Happy to hear additional opinions on #806. Implements a produce infer for the PandasTransformer components under the prerequisites that all needed requirements are installed on the local machine.

Jan 30 '24 08:01 mrchtr