featurebase Add support for SUM() to GroupBy query

Description

This request is probably best explained by Method 1 of #1605. In short, it might be useful for a GroupBy query to return the SUM() of relevant int columns (currently it just returns the COUNT() of columns).

Success criteria (What criteria will consider this ticket closeable?)

A query like that in #1605:

curl http://localhost:10101/index/records/query -d 'GroupBy(Rows(field=user), Rows(field=campaign), filter=Row(20180801 <= day <= 20180820))'

could include something like

Sum(field=impressions)

Jan 14 '19 19:01 travisturner

@travisturner When I read this ticket at first, it sounded like SUM() needed to able to go in place of Rows(). I got part way through that implementation but it was starting to not make sense.

However, looking back at the original issue again, it seems like we need to have a separate having argument in GroupBy() that executes the condition for every row.

Something like:

GroupBy(
    Rows(field=user),
    Rows(field=campaign),
    filter=Row(20180801 <= day <= 20180820),
    having=Sum(field=impressions) >= 10
)

Is that correct or am I still off base?

Feb 01 '19 23:02 benbjohnson

Sorry @benbjohnson I realize now this wasn't very clear. I think you're right about how a having field might work (and I realize that the original ticket described that scenario in SQL), but my thinking here was to first add support to GroupBy allowing it to return something other than count. In this case, that would be sum.

For example, when I run this:

curl http://localhost:10101/index/records/query -d 'GroupBy(Rows(field=user), Rows(field=campaign))'

I get a result like this:

{
  "results": [
    [
      {
        "group": [
          {
            "field": "user",
            "rowID": 1
          },
          {
            "field": "campaign",
            "rowID": 1
          }
        ],
        "count": 1
      },
      {
        "group": [
          {
            "field": "user",
            "rowID": 2
          },
          {
            "field": "campaign",
            "rowID": 1
          }
        ],
        "count": 2
      },
      {
        "group": [
          {
            "field": "user",
            "rowID": 3
          },
          {
            "field": "campaign",
            "rowID": 1
          }
        ],
        "count": 2
      }
    ]
  ]
}

But I think the goal is to be able to get a result like this (note the sum instead of count):

{
  "results": [
    [
      {
        "group": [
          {
            "field": "user",
            "rowID": 1
          },
          {
            "field": "campaign",
            "rowID": 1
          }
        ],
        "sum": 1
      },
      {
        "group": [
          {
            "field": "user",
            "rowID": 2
          },
          {
            "field": "campaign",
            "rowID": 1
          }
        ],
        "sum": 11
      },
      {
        "group": [
          {
            "field": "user",
            "rowID": 3
          },
          {
            "field": "campaign",
            "rowID": 1
          }
        ],
        "sum": 11
      }
    ]
  ]
}

Currently, GroupBy is effectively doing a SELECT COUNT(*)..., but in order to get the result with sums, it needs to do the equivalient of SELECT SUM(impressions)....

I supposed that's going to require a way to specify the aggregation (count, sum, average, etc). Do you think it makes sense to add an argument to GroupBy() to specify the aggregation function? So something like:

GroupBy(
    Rows(field=user),
    Rows(field=campaign),
    filter=Row(20180801 <= day <= 20180820),
    aggregation=Count()
)


GroupBy(
    Rows(field=user),
    Rows(field=campaign),
    filter=Row(20180801 <= day <= 20180820),
    aggregation=Sum(field=impressions)
)

I think this is basically what you were suggesting with your having example, but I was thinking about making the computed aggregate part of the result (and at that point it could be used as a having filter as well).

Feb 02 '19 04:02 travisturner

I think I basically agree with Travis, but I'll restate.

Right now, group by is implicitly doing a count(*), but we would like a way to specify an arbitrary aggregation query on any field. From an implementation standpoint, this amounts to doing that aggregation query on the field for each group with that group's bitmap (including any top level filter) passed as the ROW_CALL.

HAVING would be a slightly different issue. Right now we have basically hard coded HAVING count > 0. This gets a bit trickier when you go beyond that because it can no longer be processed on a per-shard basis. We'd need to compute whether something belongs in the result after aggregating all shards. (unless it's something like HAVING count < 10, in which case you could exclude results as soon as you have >= 10, but I digress).

Interestingly, I don't think there's any reason you can't have multiple aggregations, or a boolean combination of HAVING clauses. This gets into needing to support arbitrary result types and might be cumbersome to specify without major PQL changes.

So... bottom line, I think baseline functionality is to support an aggregation on one field and a single HAVING condition. Let's keep the more general functionality in mind during implementation and have some idea of what it would take to support that when we're done.

Feb 04 '19 19:02 jaffee