query `setQueryData` is slow with large datasets

Describe the bug

setQueryData iterates over all entries in the cache to find matches. This leads to performance issues when priming large datasets.

Your minimal, reproducible example

https://codesandbox.io/p/sandbox/react-query-setquerydata-complexity-xl4d3t

Steps to reproduce

const myData = range(1, 10000) // eg. a list of search responses received from the server
for (const i of myData) {
  queryClient.setQueryData(i, i)
}

Internally, setQueryData uses find, which iterates over each entry in the cache. Assuming the cache has already been populated, the above function quadratic.

The issue can be avoided by getting the Query-Object directly from the hash-table:

// this returns the cache for this key if existing, or creates a new one with default options:
// Unlike `setQueryData`, it will use the hashtable to find the entry.
const query = queryClient.getQueryCache().build(client, { queryKey: /* ... */ }) 
query.setData(data)

Expected behavior

I'd suggest that setQueryData should either...

be updated to only work with exact matches, since I believe most use-cases will only want to update a single entry
be provided with an exact -filter option that uses the hash-table internally

How often does this bug happen?

Every time

Screenshots or Videos

Platform

MacOS, Chrome 23.0.6262.5

Tanstack Query adapter

vanilla

TanStack Query version

5.17.19

TypeScript version

No response

Additional context

This behaviour has also been observed by users in https://github.com/TanStack/query/issues/1633. However, it was previously believed to be due to structural sharing.

Jan 27 '24 23:01 SebKranz

good find, thanks.

The issue can be avoided by getting the Query-Object directly from the hash-table:

yes, but we cannot call .build before we've executed the functionalUpdate, because it can return undefined, and in that case we don't want to create the cache entry.

I think this diff should fix it though:

index 5cc764fd..18516ccc 100644
--- a/packages/query-core/src/queryClient.ts
+++ b/packages/query-core/src/queryClient.ts
@@ -165,14 +165,6 @@ export class QueryClient {
     >,
     options?: SetDataOptions,
   ): TInferredQueryFnData | undefined {
-    const query = this.#queryCache.find<TInferredQueryFnData>({ queryKey })
-    const prevData = query?.state.data
-    const data = functionalUpdate(updater, prevData)
-
-    if (typeof data === 'undefined') {
-      return undefined
-    }
-
     const defaultedOptions = this.defaultQueryOptions<
       any,
       any,
@@ -181,6 +173,16 @@ export class QueryClient {
       QueryKey
     >({ queryKey })
 
+    const query = this.#queryCache.get<TInferredQueryFnData>(
+      defaultedOptions.queryHash,
+    )
+    const prevData = query?.state.data
+    const data = functionalUpdate(updater, prevData)
+
+    if (typeof data === 'undefined') {
+      return undefined
+    }
+
     return this.#queryCache
       .build(this, defaultedOptions)
       .setData(data, { ...options, manual: true })

do you want to maybe contribute a PR with some tests ?

Jan 28 '24 08:01 TkDodo

do you want to maybe contribute a PR with some tests ?

Yes I would be happy to.

I'm a bit unsure on how to best test this. Do you think, measuringperformance.now() with an arbitrary threshold of 500ms would suffice?

Jan 28 '24 19:01 SebKranz

not sure how to best test this tbh

Jan 28 '24 20:01 TkDodo

I think there is no problem writing the test by measuring an arbitrary threshold using performance.now(). This PR on @tanstack/table did this kind of test https://github.com/TanStack/table/pull/4495

I've tried the fix and test locally and it works fine.

Before fix with tests

 FAIL  |@tanstack/query-core| src/tests/queryClient.test.tsx > queryClient > setQueryData > should set 10k data in less than 500ms
AssertionError: expected 29102.459416 to be less than 500
 ❯ src/tests/queryClient.test.tsx:371:27
    369|       const end = performance.now()
    370|       expect(queryClient.getQueryData([...key, totalRun])).toBe(totalRun)
    371|       expect(end - start).toBeLessThan(500)
       |                           ^
    372|     })
    373|   })

After fix with tests

  Test Files  17 passed (17)
      Tests  318 passed (318)
Type Errors  no errors
   Start at  02:24:05
   Duration  3.28s

wdyt? @TkDodo @SebKranz

Feb 06 '24 19:02 milhamm

Nice 👍

Feb 06 '24 19:02 TkDodo

I would love to help with the PR, I can just push my current changes @TkDodo

Feb 06 '24 19:02 milhamm

@milhamm sounds good, thanks.

It looks like there are actually three functions that call find when they usually don't have to: setQueryData getQueryData and getQueryState. I was therefore tempted to move the Optimisation directly into QueryCache.find, but that is impossible since we don't have access to the hashFn there.

@TkDodo, you're using the default hashFn from the QueryClient. Wouldn't that run into a subtle bug if the user attempts to set data for a Query that was created with a custom hashFn like so?:

useQuery({ 
   // ...
   queryKeyHashFn: () => ...
})

Feb 06 '24 20:02 SebKranz

@SebKranz Correct me if I'm wrong, but I think that's the behavior I normally would expect. I expect the queryKeyHashFn called from useQuery would not override the client queryKeyHasFn function unless they set it when instantiating the initial query client with new QueryClient() or using the queryClient.setDefaultOptions.

But I could be wrong tho.

Feb 06 '24 20:02 milhamm

@milhamm No, I agree. TBH I don't even see the use-case for setting the hashFn on a per-query basis in the first place.

However, currently find computes the query hash by taking the hashFn from each individual Query. I just wanted to make sure that you're okay with changing that behaviour.

Feb 06 '24 21:02 SebKranz

Fair point, it would have a subtle behavior change @SebKranz

Feb 07 '24 03:02 milhamm

@TkDodo, you're using the default hashFn from the QueryClient. Wouldn't that run into a subtle bug if the user attempts to set data for a Query that was created with a custom hashFn like so?

inside find, we are using matchQuery, which will match the passed queryKey against all queries. In doing so, it will hash the passed queryKey with each hashFn of each query. So if two queries have a different hashFn, both will be used on the passed key to produce a match.

now technically, it would be good if find would accept { queryKey, queryKeyHashFn }, but that isn't a valid filter. So I think the current behaviour could be slightly off if you indeed have a custom hashFn for key A, and then try to match A against B and C, because A will be hashed with the hashFns from B and C. In practice, nobody has done that in 4 years :)

Feb 13 '24 12:02 TkDodo