teddy icon indicating copy to clipboard operation
teddy copied to clipboard

sketch possible interface to implement data manipulation

Open shukryzablah opened this issue 5 years ago • 3 comments

Hi,

I would like to propose a possible interface to implement data manipulation. My goal is to start a conversation to take steps forward to improve working with data in lisp and discuss ideas.

The main interface is similar to the successful dplyr package https://dplyr.tidyverse.org/ , but they originally come from SQL.

The main interface is comprised of 6 functions. From the dplyr dplyr documentation:

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

    mutate() adds new variables that are functions of existing variables
    select() picks variables based on their names.
    filter() picks cases based on their values.
    summarise() reduces multiple values down to a single summary.
    arrange() changes the ordering of the rows.

These all combine naturally with group_by() which allows you to perform any operation “by group”. 

The package also contains verbs for joining tables.

Notice the initial phrase "grammar of data manipulation". This package can be the DSL of data manipulation, leaving plotting or reading files into tables for other packages in the future.

What are your initial thoughts?

shukryzablah avatar Jun 27 '20 06:06 shukryzablah

Also, it would be nice to implement a batch mode for operations like in dplyr:

(process df
    (filter (< col1 col2))
    (mutate (col4 (/ col2 col3)))
    (group-by col4))

In this case, we will be able to do some optimizations. For example, to not create intermediate data frames.

svetlyak40wt avatar Jun 28 '20 14:06 svetlyak40wt

Yes, having the piping operation would be great to have on top of all of this.

A big part of this interface is that it relies heavily on vectorized operations:

(mutate
 (filter *df* (and (< col1 col2)
		   (string-equal col1 "foo")))
 ((col4 (* col1 col2))))

Vectorized operations make data manipulation easier to read and write. I will read the cookbook (https://lispcookbook.github.io/cl-cookbook/arrays.html) to familiarize myself with any prior work that might exist.

shukryzablah avatar Jun 28 '20 21:06 shukryzablah

Also, I like to keep in mind this comparison of R's dplyr and Pandas R (https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_r.html#quick-reference). It might be useful if you have prior experience with Pandas.

shukryzablah avatar Jun 28 '20 21:06 shukryzablah