pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Support no forward index for column

Open kkrugler opened this issue 5 years ago • 8 comments

Currently a text column can be created without any forward index, which is useful when using the column only for filtering. In this situation, the raw (original) text data is not needed, only the text index (see https://github.com/apache/incubator-pinot/pull/6284/).

There are other situations for non-text columns where this same functionality is useful to reduce the size of the column. In our particular use case, we're generating unique terms for a (large) string field, which we save as a multi-value STRING column. We need an inverted index for fast filtering, but we do not need the forward index, which (leaving aside the inverted index, which is built at load time) accounts for more than 80% of the total segment size.

@kishoreg suggested "having a empty forward Index reader impl" as a way of implementing this.

We could possible handle the configuration of this via a new noFwdIndexColumns table config field, similar to the noDictionaryColumns config setting.

There would be situations where specifying no forward index for a column would trigger a table config error, for example doing this for a metrics column (or so I assume).

I'm also not sure whether it would be valid to have a column that has no index/dictionary/forward index; does this mean ignore the field in the input data?

kkrugler avatar Jan 21 '21 23:01 kkrugler

What's the size of the forward index for the multi value column? Dctionary IDs in the forward index are bit encoded. Looks like it's very high cardinality and and must be having several millions of rows per segment to result in reasonable size overhead.

siddharthteotia avatar Jan 30 '21 17:01 siddharthteotia

Hi @siddharthteotia - yes, one example segment is 2,637,935 rows, and metadata.properties for the column of interest (creativeText_terms) has cardinality of 48,591 (though that's lower than what I was expecting).

column.creativeText_terms.cardinality = 48591
column.creativeText_terms.totalDocs = 2637935
column.creativeText_terms.dataType = STRING
column.creativeText_terms.bitsPerElement = 16
column.creativeText_terms.lengthOfEachEntry = 60
column.creativeText_terms.columnType = DIMENSION
column.creativeText_terms.isSorted = false
column.creativeText_terms.hasNullValue = false
column.creativeText_terms.hasDictionary = true
column.creativeText_terms.textIndexType = NONE
column.creativeText_terms.hasInvertedIndex = true
column.creativeText_terms.hasFSTIndex = false
column.creativeText_terms.hasJsonIndex = false
column.creativeText_terms.isSingleValues = false
column.creativeText_terms.maxNumberOfMultiValues = 49
column.creativeText_terms.totalNumberOfEntries = 14628086
column.creativeText_terms.isAutoGenerated = false
column.creativeText_terms.minValue = 0.01
column.creativeText_terms.maxValue = \u1EE9ng
column.creativeText_terms.defaultNullValue = null

The dictionary is 2.9MB, and the forward index is 31MB:

creativeText_terms.dictionary.startOffset = 1648876
creativeText_terms.dictionary.size = 2915468
creativeText_terms.forward_index.startOffset = 4564344
creativeText_terms.forward_index.size = 31110427

kkrugler avatar Feb 19 '21 22:02 kkrugler

Related issue https://github.com/apache/pinot/issues/7870

siddharthteotia avatar May 23 '22 17:05 siddharthteotia

@somandal is working on this.

siddharthteotia avatar May 23 '22 18:05 siddharthteotia

I'm going to start working on this

somandal avatar Jul 19 '22 18:07 somandal

Part 1 to add support for skipping forward index (during segment generation) and making all other code paths (load, query processing) aware of it has been merged in https://github.com/apache/pinot/pull/9333

Subsequent PRs will focus on changes to support regeneration of forward index from dict and inverted index and toggling this feature.

siddharthteotia avatar Oct 12 '22 05:10 siddharthteotia

Here's a document which discusses the reload problem and how to solve it for forwardIndexDisabled columns. Please take a look and leave your feedback. cc @Jackie-Jiang @siddharthteotia @vvivekiyer

Just a note that a few details still need to be figured out and I will update the document as and when we figure them out.

somandal avatar Oct 13 '22 01:10 somandal

User docs - https://docs.pinot.apache.org/basics/indexing/forward-index#disabling-the-forward-index (thanks @somandal)

siddharthteotia avatar Oct 14 '22 05:10 siddharthteotia

Part 2 to disable / delete forward index for an existing column on the reload path has been merged in

https://github.com/apache/pinot/pull/9740

Part 3 will be to regenerate / enable back the forward index for existing column on the reload path using dictionary and inverted index.

FYI - @walterddr @Jackie-Jiang

siddharthteotia avatar Nov 10 '22 18:11 siddharthteotia

With the latest PR getting merged, support for the following is completed

  • No Forward Index during segment generation.
  • Delete Forward Index during reload on an existing column
  • Rebuild forward index on noForwardIndex column during reload using dictionary and inverted index and change dependent indexes if needed.

Support for derived columns and duplicates is pending which will be done as follow-ups as needed.

siddharthteotia avatar Dec 12 '22 16:12 siddharthteotia

@somandal - I think you may want to update user docs and open follow up issues for the pending work and link here.

siddharthteotia avatar Dec 12 '22 16:12 siddharthteotia

Opened issues: https://github.com/apache/pinot/issues/9972 and https://github.com/apache/pinot/issues/9973

somandal avatar Dec 12 '22 18:12 somandal

User docs updated: https://docs.pinot.apache.org/basics/indexing/forward-index#disabling-the-forward-index

somandal avatar Dec 12 '22 22:12 somandal