featurebase Reduce memory footprint on GroupBy

Description

Just a quick reminder ticket to look at the redundant storage overhead of groupby

Success criteria (What criteria will consider this ticket closeable?)

Jan 09 '19 15:01 tgruben

+1 to this, think high memory usage is the main thing impeding GroupBy's potential usefulness.

Had our round of testing for GroupBy's on Set fields. Was trying to understand value add of it compared to querying for counts group by group. It's very fast on small fields, 20k groups can return in about 5 mins. Which is 2.5 times faster if we do 20k separate Count() queries. But when we try to do GroupBy for 4M groups and more - memory problems start, GroupBy can not complete at all, even after we implemented quite granular GroupBy paging logic, OOMs every time compared to 4M Counts() which can complete (even though in quite a long time - ~40 hours).

GroupBy right now seems like a much faster but significantly more memory intensive alternative to querying group by group. In the situation when the main cluster bottleneck and cluster size driver is memory - more memory intensive operation becomes a lot less valuable. At least for our usecase with huge dataset...

Apr 04 '19 03:04 dmibor

This ticket was actually talking about the overhead in the format we use for the results of group by queries (I only know that because I was present when it was created). I'm surprised to hear that you got OOMs with granular paging. Do you know where memory was being used inside Pilosa?

Apr 04 '19 12:04 jaffee