featurebase Feature: Row Scaling

Description

Explore new methods for improving how pilosa handles massive rowsets. Resources could be managed a bit more efficiently with smarter fragment management.

Success criteria (What criteria will consider this ticket closeable?)

Demonstrate responsive queries to multi-billion row dataset

Feb 13 '18 21:02 tgruben

Todd: put rows in SQLite BLOB fields...

Seebs: Need to be careful of running into the 64 bit boundary with high cardinality fields and large shard widths.

Apr 15 '19 15:04 jaffee

What is the goal here? Are we trying to restrict the fragment size in such a way that will allow it to fit in memory? Reducing the shard width will allow for more rows; in theory you could reduce shard width to 2^16, or one container, so your fragment would effectively be made up of one container per row... which I guess means the number of rows you could support is the number of containers that fit in memory (and not exceeding 2^47).

With this ticket, are we talking about splitting up the fragment data (i.e. introducing a sharding strategy more granular than a fragment)? Or are we just looking for a different and/or more efficient way to represent a fragment?

Apr 16 '19 20:04 travisturner

I think the goal of this ticket is for Pilosa to be horizontally scalable in terms of the number of rows in a given field.

Thinking about it a bit more, I don't think we need to do anything drastic for row scaling to support the traditional analytics use case of Pilosa. Very high cardinality fields tend to be extremely sparse which naturally limits the size of fragments.

The container shrinking work has made very sparse data much more viable, and the roaring extensions that we've been kicking around for extra wide containers would go the rest of the way toward making super sparse data about as performant as it can be.

Where row scaling really comes into play, I think, is for the genomics and bioinformatics use cases where rows are really the main entities, and each row represents one "thing" like a genome rather than each row representing whether or not billions of things have a particular value or not.

In the case of rows representing genomes, we might not only want many billions of rows, but they would be quite densely populated—more than 25% of bits set in the representations we've used previously. Even at a shard width of 2^16, a billion rows would be ~8 TB (assuming all bitmap containers). I think it's clear that in this case, we'd want to be able to distribute single fragments across multiple hosts.

Because this breaks some of Pilosa's basic assumptions (e.g. that each column has all of its rows/fields colocated), it will probably be a fairly significant change. My gut is that we should defer work on this ticket until we have a strong motivating use case.

Apr 17 '19 13:04 jaffee