chronon icon indicating copy to clipboard operation
chronon copied to clipboard

[CHIP-1] Cache batch IRs in the Fetcher

Open caiocamatta-stripe opened this issue 2 years ago • 2 comments

Summary

This PR adds a batch IR cache to the Chronon fetcher.

  • In the fetcher, we now cache the decoded IRs for batch GetRequests. When a request comes in, the BaseFetcher will first try to get the IRs from this cache before sending a KV Store request.
  • This cache is shared across all GroupBys being served by the Fetcher.

During our tests, we saw roughly a 4GB increase in memory utilization with a 10,000-element cache, but this number will vary significantly across Chronon users.

This work is part 0, 1, and 2 of CHIP-1.

Enabling the cache

To enable the cache,

  1. Chronon users must set the Java arg ai.chronon.fetcher.batch_ir_cache_size to a non-zero number. This will create the cache object within Chronon.
  2. Add enable_caching=True to their GroupBy definition.

Having a two-step process allows for safer rollouts.

Why / Goal

Motivation: decrease feature serving latency.

  • At Stripe, we saw up to a 43% decrease in p99 feature serving latency and up to a 30% decrease in CPU utilization after enabling this cache.

CHIP-1 – Online IR and GetRequest Caching & Discussion

Test Plan

  • [X] Added Unit Tests
  • [X] Integration tested (existing FetcherTests)

QA Testing

Beyond the code tests, we also tested this change extensively in our QA environment that has a variety of GroupBys.

  • Using some of the additional logging that we have on our version of Chronon, I tested a few different scenarios and confirmed that all the feature values and logs were correct when caching was enabled.
    • Fetching a key that only has batch data
    • Fetching a key that only has streaming data
    • Fetching a key that has both batch and streaming data
    • Fetching a key that has no data
  • Confirmed that no cache gets created or used when the ai.chronon.fetcher.batch_ir_cache_size arg is false or not set (status quo).
  • Only the GroupBys with enable_caching set get cached.

Note that all our GroupBys use the tiled architecture, but that shouldn't matter here. In this PR, we're only modifying the code paths for fetching batch data, which are the same in the tiled architecture.

Online-Offline Consistency

Using two different Joins with 10-15 GroupBys each, we also confirmed that online-offline consistency metrics remained the same before and after enabling caching for all GroupBys.

Load tests

We use a long-running large-scale load test (5K requests/sec, 15 GroupBys in the Join) to confirm that these changes are stable. We did not see any error or latency spikes during a multi-day load test with caching enabled.

Checklist

  • [ ] Documentation update

Reviewers

@nikhilsimha (feel free to assign others instead) @piyushn-stripe

caiocamatta-stripe avatar Feb 12 '24 21:02 caiocamatta-stripe

@nikhilsimha and @piyushn-stripe - requesting a first-pass review here. Feel free to assign other people to review it instead.

caiocamatta-stripe avatar Apr 02 '24 17:04 caiocamatta-stripe

Hey @nikhilsimha, this PR is ready for a second pass. I cleaned up some of the comments, resolved conflicts, and changed the code to use Divya's FlagStore [commit].

Not sure how hard it would be, but it'd be nice if Airbnb could benchmark these changes as well (after merging is fine).

caiocamatta-stripe avatar May 13 '24 14:05 caiocamatta-stripe

should Add enable_caching=True to their GroupBy definition actually be

Add enable_fetcher_batch_ir_cache=True to their GroupBy definition?

One gap here is how the flagStore is created. I assume it is from metadata?

pengyu-hou avatar Jun 14 '24 04:06 pengyu-hou

Hey @pengyu-hou, I addresses your comments. Could you take another look?

should Add enable_caching=True to their GroupBy definition actually be Add enable_fetcher_batch_ir_cache=True to their GroupBy definition?

I updated the PR description. This Add enable_caching=True to their GroupBy definition comment was related to the previous version of this PR which used customJson in the GroupBy instead of the FlagStore.

caiocamatta-stripe avatar Jun 24 '24 23:06 caiocamatta-stripe