sturdyc icon indicating copy to clipboard operation
sturdyc copied to clipboard

Unexpected sturdyc.ErrInvalidType returned by generic GetOrFetch

Open ladydascalie opened this issue 8 months ago • 4 comments

Describe the bug Occasionally, the generic sturdyc.GetOrFetch function will fail to assert types, returning sturdyc.ErrInvalidType.

Expected behaviour It should not do that, or produce an error message that is way more informative about why the assert might have failed.

To Reproduce I can't get it to reproduce outside my production environment, but I can show the code that hits this snag. I've shortened type & function names for readability, but otherwise it's equivalent.

result, err := sturdyc.GetOrFetch(ctx, shortLived, key, func(ctx context.Context) ([]activeNames, error) {
		return getActiveNames(playerNameLookup, gameID)
	})

        // Workaround to retry without the cache when asserts fail.
	if err != nil {
		if !errors.Is(err, sturdyc.ErrInvalidType) {
			return []names{}, err
		}

		result, err = getActiveNames(playerNameLookup, gameID)
		if err != nil {
			return []names{}, err
		}
	}

// more error handling etc...

Screenshots

Additionally, we're seeing these happen, which, I believe, might be related

Image

I'm unsure how to proceed with debugging this, or figuring out if I'm using the library wrong, or encountering a bug, any advice would be helpful!

  • PS:

I realised I forgot to include the configuration I'm using:

Cache instances are created like so:

cacheInstance := sturdyc.New[any](
		defaults.Capacity,
		defaults.NumShards,
		defaults.TTL,
		defaults.EvictionPercentage,
		sturdyc.WithEarlyRefreshes(
			defaults.EarlyRefreshes.MinRefreshDelay,
			defaults.EarlyRefreshes.MaxRefreshDelay,
			defaults.EarlyRefreshes.SynchronousRefreshDelay,
			defaults.EarlyRefreshes.RetryBaseDelay,
		),
		sturdyc.WithRefreshCoalescing(3, defaults.TTL),
		sturdyc.WithMetrics(newMemoryCacheMetrics(cacheName)),
		sturdyc.WithDistributedStorageEarlyRefreshes(newRedisStore(client, defaults.TTL), 10*time.Second),
	)

the defaults look like this:

// defaults is the default configuration for the cache.
var defaults = &baseConfig{
	Capacity:           10000,
	NumShards:          10,
	TTL:                10 * time.Second,
	EvictionPercentage: 10,
	EarlyRefreshes: earlyRefreshConfig{
		MinRefreshDelay:         time.Millisecond * 10,
		MaxRefreshDelay:         time.Millisecond * 30,
		SynchronousRefreshDelay: time.Second * 10,
		RetryBaseDelay:          time.Millisecond * 10,
	},
}

And the TTL gets overridden based on need for transient, short or long-lived caches.

Additionally, the distributed storage is implemented as follows:


var _ sturdyc.DistributedStorageWithDeletions = &redisStore{}

type redisStore struct {
	database.CachemereClient
	ttl time.Duration
}

func newRedisStore(client database.CachemereClient, ttl time.Duration) *redisStore {
	if ttl == 0 {
		ttl = 5 * time.Minute
	}
	return &redisStore{
		CachemereClient: client,
		ttl:             ttl,
	}
}

// Get implements sturdyc.DistributedStorage.
func (r *redisStore) Get(ctx context.Context, key string) ([]byte, bool) {
	val, err := r.Client.Get(ctx, key).Bytes()
	if err != nil {
		slog.DebugContext(ctx, "failed to get key from redis", "key", key, "err", err)
		return nil, false
	}
	return val, true
}

// GetBatch implements sturdyc.DistributedStorage.
func (r *redisStore) GetBatch(ctx context.Context, keys []string) map[string][]byte {
	records := make(map[string][]byte)
	for _, key := range keys {
		val, ok := r.Get(ctx, key)
		if ok {
			records[key] = val
		}
	}
	return records
}

// Set implements sturdyc.DistributedStorage.
func (r *redisStore) Set(ctx context.Context, key string, value []byte) {
	// default to 5 minutes lifetime
	if err := r.Client.Set(ctx, key, value, r.ttl).Err(); err != nil {
		slog.DebugContext(ctx, "failed to set key in redis", "key", key, "err", err)
	}
}

// SetBatch implements sturdyc.DistributedStorage.
func (r *redisStore) SetBatch(ctx context.Context, records map[string][]byte) {
	for key, val := range records {
		r.Set(ctx, key, val)
	}
}

// Delete implements sturdyc.DistributedStorageWithDeletions.
func (r *redisStore) Delete(ctx context.Context, key string) {
	if err := r.Del(ctx, key).Err(); err != nil {
		slog.DebugContext(ctx, "failed to delete key in redis", "key", key, "err", err)
	}
}

// DeleteBatch implements sturdyc.DistributedStorageWithDeletions.
func (r *redisStore) DeleteBatch(ctx context.Context, keys []string) {
	if err := r.Del(ctx, keys...); err != nil {
		slog.DebugContext(ctx, "failed to delete keys in redis", "keys", keys, "err", err)
	}
}

ladydascalie avatar May 15 '25 09:05 ladydascalie

Hi, it's difficult to say exactly what's going on without being able to run some code to reproduce the issue. However, the sturdyc: error unmarshalling key message should never appear, so I think that's a valuable clue.

Is it possible that some of your cache keys are colliding? That could happen if you're writing different types of objects under the same Redis keys.

Or could it maybe be a versioning issue that occurs around deploys, where the data structure for activeNames has changed but you're still storing the old representation in redis under the same key? So when the cache tries to unmarshal these old objects into the newly updated struct the conversion fails?

viccon avatar May 22 '25 13:05 viccon

Is it possible that some of your cache keys are colliding? That could happen if you're writing different types of objects under the same Redis keys.

In this case, we are using a sha256 hash of the values to generate part of our key, so the likelihood of a collision is extremely low, I would say.

Additionally, concerning the versioning comment, I don't believe this is the case, as we've seen these errors crop up immediately after the first deployment.

Are there any particular quirks to be aware of considering the way keys are compared? Is there an issue with the fact we're using these:types:of:separators? We do this as to follow Redis keyspace recommendations.

I will investigate both collisions and versioning anyway, to be totally sure.

ladydascalie avatar May 22 '25 13:05 ladydascalie

The function that produces that log is only called here: https://github.com/viccon/sturdyc/blob/main/distribution.go#L197 and here: https://github.com/viccon/sturdyc/blob/main/distribution.go#L103

In both cases, it looks like we're checking whether the distributed storage returned true (indicating that the key exists). Therefore, I believe this log should only occur if the Redis client indicates that the key exists, but the unmarshalling fails:

func unmarshalRecord[V any](bytes []byte, key string, log Logger) (distributedRecord[V], error) {
	var record distributedRecord[V]
	unmarshalErr := json.Unmarshal(bytes, &record)
	if unmarshalErr != nil {
		log.Error("sturdyc: error unmarshalling key: " + key)
	}
	return record, unmarshalErr
}

The way I can see this happening (off the top of my head) is as follows:

  • The type for V has changed since it was written to Redis (these logs would likely cluster around deploys).
  • You're somehow writing different structs using the same key.
  • You're manually writing data to the Redis cache. This won't work because the cache writes a distributedRecord[V], which includes additional fields beyond just V.
  • The underlying data source is flaky or returns different types of responses without it resulting in an error.

To avoid the first two issues, I usually write my clients like this:

// You can grep for this variable to make sure it's only used in a single place.
const moviesByIDsCacheKeyPrefix = "v1-movies-by-ids" 

type MoviesByIDsOpts struct {
	IncludeUpcoming bool
	IncludeUpsell   bool
}

func (c *Client) MoviesByIDs(ctx context.Context, ids []string, opts MoviesByIDsOpts) (map[string]Movie, error) {
	cacheKeyFunc := c.cache.PermutatedBatchKeyFn(moviesByIDsCacheKeyPrefix, opts)
	fetchFunc := func(ctx context.Context, cacheMisses []string) (map[string]Movie, error) {
		timeoutCtx, cancel := context.WithTimeout(ctx, c.timeout)
		defer cancel()

		var response map[string]Movie
		err := requests.URL(c.baseURL).
			Path("/movies").
			Param("ids", strings.Join(cacheMisses, ",")).
			Param("include_upcoming", strconv.FormatBool(opts.IncludeUpcoming)).
			Param("include_upsell", strconv.FormatBool(opts.IncludeUpsell)).
			ToJSON(&response).
			Fetch(timeoutCtx)
		return response, err
	}
	return sturdyc.GetOrFetchBatch(ctx, c.cache, ids, cacheKeyFunc, fetchFunc)
}

viccon avatar May 22 '25 14:05 viccon

One way to debug this, by the way, could be to vendor your dependencies and then modify the unmarshalRecord function to add some additional logging in order to gather more clues. I tried to create a small example to reproduce this, but I wasn't successful

viccon avatar May 22 '25 14:05 viccon

Hi @ladydascalie I was just wondering if you made any progress on this?

viccon avatar May 27 '25 06:05 viccon

@ladydascalie I'm likely going to close this issue because I can't reproduce it and therefore can't make any progress without your help!

viccon avatar Jun 17 '25 09:06 viccon

Hello @viccon, we're still seeing this sporadically, but we've implemented a retry in places where it does to mitigate for now.

I've tried reproducing this in a sandboxed environment but really can't get it to happen like this, which would support the idea that something in our application state is possibly getting mangled.

Therefore, I'd say you are free to close this issue, as I don't think we'll manage to make much progress either.

ladydascalie avatar Jun 19 '25 08:06 ladydascalie

Hello again @viccon, I've actually managed to reproduce the issue now, and put together a repository for you: https://github.com/ladydascalie/sturdyc-bug-repro

It seems like the issue was fixed for 1.1.5, but it did appear in 1.1.4. It looks like you've already handled my issue :)

ladydascalie avatar Jun 23 '25 13:06 ladydascalie

Hi @ladydascalie, I'm happy to hear that! I'll go ahead and close this issue then

viccon avatar Jun 26 '25 08:06 viccon