remits icon indicating copy to clipboard operation
remits copied to clipboard

RFC: Iterator definition language

Open badtuple opened this issue 5 years ago • 13 comments

This RFC proposes an Iterator definition language.

Feedback is requested, and this PR will be updated to reflect discussion and conclusions until agreement is reached that it's ready to be merged. Implementation won't start until the RFC is merged.

You can find the rendered version of the RFC here.

badtuple avatar Apr 18 '20 10:04 badtuple

I really like moving towards pipe and filter. It makes ad hoc queries easier to write imo.

Few questions:

  • Is filter going to be able to take functions?
  • You mentioned bin(). What is the full list of functions to make available?
  • Are you going to allow users to implement custom stream processing functions?

Side note: shouldn't this be 0001?

AndrewScibek avatar Apr 18 '20 22:04 AndrewScibek

Is filter going to be able to take functions?

Depends what you mean, but yes.

The expressions inside the parens in a syntax like LOG_NAME | filter( msg.val == "thingy" ) is effectively an anonymous function with msg passed in. If you mean pass in user defined Lua functions, then we should allow that to allow complex behavior.

The thing is there are "Filters" not functions, which is a subtle difference. I think it'd end up looking more like this where we allow recursive expressions, so arguments can just be other pipelines:

MY_LOG | filter (
    select(.secret) | isFalse
) | select(.name) | toUppercase

Any Filter (user defined or otherwise) can be used as an argument to another Filter.

You mentioned bin(). What is the full list of functions to make available?

No idea. Part of this would be figuring out the minimal useful set of functions. I would imagine a grouping filter like bin, a window filter, a select filter to grab needed fields, filter, map, and reduce would cover most use cases? Take (to get batches of n messages), and Average seem like other contenders.

Are you going to allow users to implement custom stream processing functions?

Yes, as Filters. They will be able to define a Lua function and inside it there will be a builtin function to pull the next message and a function to return something as an outgoing message.

Side note: shouldn't this be 0001?

Ah, holdover from Rust's RFC process. Rust RFCs can take months for a super complex feature, so they don't assign a number until it's ready to be merged. That way it's an incrementing number by acceptance into the language. We don't have enough RFCs for that to be an issue, but force of habit.

badtuple avatar Apr 18 '20 23:04 badtuple

Is filter going to be able to take functions?

Depends what you mean, but yes.

The expressions inside the parens in a syntax like LOG_NAME | filter( msg.val == "thingy" ) is effectively an anonymous function with msg passed in. If you mean pass in user defined Lua functions, then we should allow that to allow complex behavior.

Yeah i meant user defined functions or really any function that is more than one line. I really think it is important to support that

The thing is there are "Filters" not functions, which is a subtle difference. I think it'd end up looking more like this where we allow recursive expressions, so arguments can just be other pipelines:

MY_LOG | filter (
    select(.secret) | isFalse
) | select(.name) | toUppercase

Any Filter (user defined or otherwise) can be used as an argument to another Filter.

👍

You mentioned bin(). What is the full list of functions to make available?

No idea. Part of this would be figuring out the minimal useful set of functions. I would imagine a grouping filter like bin, a window filter, a select filter to grab needed fields, filter, map, and reduce would cover most use cases? Take (to get batches of n messages), and Average seem like other contenders.

I feel like the math ones are all important personally. avg, max, min, stddev etc

Are you going to allow users to implement custom stream processing functions?

Yes, as Filters. They will be able to define a Lua function and inside it there will be a builtin function to pull the next message and a function to return something as an outgoing message.

👍

Side note: shouldn't this be 0001?

Ah, holdover from Rust's RFC process. Rust RFCs can take months for a super complex feature, so they don't assign a number until it's ready to be merged. That way it's an incrementing number by acceptance into the language. We don't have enough RFCs for that to be an issue, but force of habit.

I dont have an opinion on that so im good going the rust way. Was just curious

AndrewScibek avatar Apr 19 '20 01:04 AndrewScibek

I feel like the math ones are all important personally. avg, max, min, stddev etc

Totally agreed. Nice thing about Filter additions is that they are "just functions" so adding more shouldn't be hard.

badtuple avatar Apr 19 '20 01:04 badtuple

Ok, I think I have a better representation of the language in the form of a full parser. This is the grammar in pest.rs syntax:

WHITESPACE = _{" " | NEWLINE }

Expression = !{FilterChain | Literal}
Identifier = @{ (ASCII_ALPHA | "_" ) ~ (ASCII_ALPHANUMERIC | "_")* }
Filter = { Identifier ~ ( "(" ~ Expression ~ ("," ~ Expression)* ~ ")" )? }
FilterChain = { Filter ~ ("|" ~ Filter)* }

// Literals
NilLiteral = @{ "nil" }
StringLiteral = @{ "\"" ~ (!"\"" ~ ANY)* ~ "\"" }
IntegerLiteral = @{ ("+" | "-")? ~ ASCII_DIGIT+ }
FloatLiteral = @{ ("+" | "-")? ~ ASCII_DIGIT+ ~ "." ~ ASCII_DIGIT+ }
BooleanLiteral = @{ "true" | "false" }
Literal = { NilLiteral | StringLiteral | FloatLiteral | IntegerLiteral | BooleanLiteral }

If you paste that into the editor at the bottom of https://pest.rs you can play around with it and see what the parsed output is for examples.

This is one parsable example:

LOG_NAME | filter1 | filter2 | filter3(
  subFilter1 | subFilter2,
  "string literal arg",
  secondSubFilter1 | secondSubFilter2
) | filter4

All of the identifiers are obviously placeholders but it shows how arguments work. filter3 takes 3 arguments. The first and third args are filter-chains, and the second is a string literal. Arguments are separated by a comma.

If we end up liking this, we can include that pest grammar and have a Rust macro take it and generate a parser for us. Pest parsers are supposed to be really fast, though since we only parse on creation of the iterator and not the use, it's not really a latency sensitive area.

badtuple avatar Apr 19 '20 06:04 badtuple

Would we still have support for map/reduce with this model? If I have a stream of arbitrary complex date (ie encoded json objects coming from my data source) and I want to run multiple chained filters on something inside that data it would be preferable If I could extract the needed value(s) once instead of in every filter. So like:

LOG_NAME | bin(20) | map(my_extract_val_func(msg.val)) | filter1 | filter2 | filter3

Instead of:

LOG_NAME | bin(20) | filter1(my_extract_val_func(msg.val)) | filter2(my_extract_val_func(msg.val)) | filter3(my_extract_val_func(msg.val))

volgorean avatar Apr 19 '20 07:04 volgorean

Each filter gets the return of the filter before it.

So if I understand your question correctly, you could define your custom filter @my_extract_val_func and use it like this:

LOG_NAME | @my_extract_val_func | filter1 | filter2

The way that'd work is each message would be passed to @my_extract_val_func which would extract your val and return it. That return value would then be passed to Filter1, and the output of that would be passed to filter 2.

Since @my_extract_val_func is applied to every message, it's effectively a map. There's also no reason you couldn't keep state within it and reduce the value as well if you wanted to.

badtuple avatar Apr 19 '20 07:04 badtuple

Ah ok so filter is map not actually filter? or is a empty/null response not get passed to the next filter?

volgorean avatar Apr 19 '20 07:04 volgorean

May bad. The filter1/filter2 stuff was just a placeholder for the language level Filter item...not the higher-level-function "filter" that we've been using to discuss Iterators previously. The function filter can be a Filter, but none of the placeholders have been meant to be that. Sorry, I know that was confusing.

A Filter is just a function that takes a message, either from a log or returned by an earlier Filter.

A more realistic Iterator (using an actual filter Filter) would look something like this:

GAME_EVENTS | filter( @isFromPlayerNumber(1) ) | select("points") | take(5)

That would get all the events from player number 1 (using a user defined lua filter), get just the "points" field from the message, and then batch them up in groups of 5. So when you call the iterator you get 5 at a time.

badtuple avatar Apr 19 '20 07:04 badtuple

@badtuple status?

AndrewScibek avatar May 03 '20 05:05 AndrewScibek

@AndrewScibek great question. So there aren't any pipeline languages like we were talking about floating around, but I went ahead and wrote https://github.com/badtuple/pipelang . This is a very small interpreter that handles parsing, running pipelines, and allows you to write your own filters. Infact, it doesn't have a stdlib at all, so what's left for us to do here is figure out the minimum filters we need for our initial use case.

I can integrate Pipelang into Remits. But we need to know what Filters to write and then write them. I think something as simple as a Batch, Window, and Filter function would be enough for now?

I can either merge this RFC or we can keep it open to discuss the default Filters.

If there's any hesitation around adopting Pipelang, let me know. Happy to go over what's there...it's a very small codebase. A little more than 300 lines not including tests, ~550 including tests. And absolutely zero dependencies.

badtuple avatar May 03 '20 05:05 badtuple

I looked at the code over in pipelang. It looks good to me. I am not against it.

For what to implement i am good with batch, window, filter but i think it would be worthwhile to have a simple math one as well. Like avg, sum, or min/max

AndrewScibek avatar May 05 '20 03:05 AndrewScibek

Just kinda wrapping this up. I'm definitely incorporating Pipelang into the project, infact I'm considering moving the crate into this repo as part of the workspace. Going forward with the following initial filters:

  • [ ] batch
  • [ ] window
  • [ ] filter
  • [ ] avg
  • [ ] sum
  • [ ] min
  • [ ] max

Maybe not the most exciting filters, but it's enough to prove out some solid use cases beyond just a message queue.

badtuple avatar Dec 11 '21 01:12 badtuple