eis_toolkit Add standardize and normalize

Mar 02 '24 08:03 nmaarnio

@em-t or @lehtonenp , would you have time to review this? Or perhaps @msmiyels or @nialov if you are not too busy.

Mar 04 '24 07:03 nmaarnio

@em-t or @lehtonenp , would you have time to review this? Or perhaps @msmiyels or @nialov if you are not too busy.

@nmaarnio, I got time to review. So, I'll review this one.

Mar 04 '24 09:03 lehtonenp

Initially we implemented private functions for all tools, but this aspect our style guide has loosened to the point where it is not required anymore. I personally started shifting towards implementing one or more private functions only when I felt they were useful / the function had more contents than just checks and simple call to a library function. Additionally, the gh-pages documentation site generated by mkdocs shows the source code of the functions and having only one function makes it show the full code of a function. This could be considered a pro, but I think few people will use it and directly open the source code in their code editor when they wish to inspect our implementations.

Here I felt they would not add much and was not sure which part(s) to separate. However, I am open to suggestions and not against refactoring this! Do you have a specific suggestion for diving the code?

Mar 04 '24 09:03 nmaarnio

I see that the private functions would not add much value to normalize and standardize for the given reasons. It is a plus that mkdocs reveals the implementation of the algorithms. Let's keep the code as you have implemented it. Approving.

Mar 04 '24 09:03 lehtonenp

@nmaarnio sorry for beeing late on this. We actually have the normalization and standardization in the transformation module. So it looks that at this point, we do have multiple functions for the same thing.

What we do not have there yet is handling of tabular data (and I don´t know the CLI status of those), since this was decided during the review process of the functions.

Mar 04 '24 10:03 msmiyels

@nmaarnio sorry for beeing late on this. We actually have the normalization and standardization in the transformation module. So it looks that at this point, we do have multiple functions for the same thing.

What we do not have there yet is handling of tabular data (and I don´t know the CLI status of those), since this was decided during the review process of the functions.

Could you point out the normalization and standardization functions? I as the reviewer was not aware of these existing. I have mostly worked on raster and vector processing so far.

Mar 04 '24 10:03 lehtonenp

No worries. I was aware that they can be performed with the existing transformations functions, but I thought it a good idea to create public functions that are called normalization and standardization (right now I think they are a bit "hidden" in the transformation functions). But in retrospect I now realize it's a bit silly to re-implement them – one option would have been to create public functions called standardize and normalize that simply call z_score_normalization and min_max_scaling with locked parameter values. EDIT: I noticed now that z_score_normalization does not take parameters but performs same operation as standardization always. I had not encountered this name before, is it commonly used @msmiyels ? In contrast, I have seen standardization widely used as a term referring to this rescaling operation.

There are some differences with the inputs and handling. I myself prefer that operations are defined for both vector and raster data if applicable for the plugin and CLI, but defining the core operation on data (Numpy array or Pandas DF) if it makes sense. Maybe we should give this a little thought now, should we merge this or not and how to proceed.

Mar 04 '24 10:03 nmaarnio

@nmaarnio I think it is quite easy to bring in confusion here, since there can be only very subtle differences between those terms and their purpose.

Generally:

scaling and normalize refer to a transformation of data into a new specific range, commonly [0,1]
standardize refers to the transformation of data based on statistical measures and data will not be forced into a specific range

Explicitly:

normalization is commonly the linear transformation into [0, 1]
standardization is commonly the transformation based on the z-score

So it looks that there are some inconsistencies among the naming of these functions, but all of these are transformations.

What about naming these functions f(x)_transform and provide kind of higher level public functions normalize and standardize which call the respective defaults for the [0, 1] min/max linear and the z-score transformations? You could use the exisiting private functions which only need a np.ndarray and parameters as input. "Tabularity" could be integrated in these new public functions as well.

Alternatively, change the existing public functions to accept tabular data? I know it's repetition, but this was explicitely excluded during the review process of the transformation functions 😵

Another thought that just came in 🤯: if it's only for the Plugin 📺 or even CLI ⚙️, we could just name it "stan...", "norm..." and call the existing functions under the hood.

What do you think?

Mar 04 '24 12:03 msmiyels

Hi @msmiyels , I am revisiting this now. So I agree that a good way to implement for example normalize is to just create a wrapper around min-max and enforce the (0, 1) range. Do you think this brings any additional convenience for users?

I guess accepting tabular data or just vector datasets is another thing. I don't know how many people wish the set of transformation tools to accept vector data input in QGIS plugin, but this would mean changes to all of the transformation functions. So at least for now it would make sense to either create a simple higher-level function for normalize or don't add it all.

Oct 22 '24 09:10 nmaarnio