0

I'm using FusionCache on a .NET 8 backend hosted in AKS with multiple pods. The cache is backed by Azure Redis (C3 Standard). Everything was working fine in dev/staging, but since going live in production, I’ve been facing intermittent timeouts and latency spikes on Redis.

This is what's happening : I get a burst of warnings in logs from FusionCache (10+ logs within the same millisecond) for the same key and same pod:

FusionCache: an error occurred while trying to update a memory entry from a distributed entry

Then a RedisTimeoutException occurs shortly after:

 StackExchange.Redis.RedisTimeoutException:
Timeout performing HMGET (30000ms), next: HMGET some-key, inst: 1, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle,
in: 62160, last-in: 0, cur-in: 0, sync-ops: 64, async-ops: 63,
serverEndpoint: redis-x.redis.cache.windows.net:6380,
conn-sec: 41.38, mc: 1/1/0, IOCP: (Busy=0,Free=1000), WORKER: (Busy=36,Free=32731), POOL: (Threads=36,QueuedItems=66,CompletedItems=1197,Timers=19)

Sometimes this results in 504 Gateway Timeout errors from an upstream HTTP endpoint, like:

Status code: GatewayTimeout. Message: Response status code does not indicate success: 504 (Gateway Timeout)

FusionCache configuration:

options
.SetDuration(TimeSpan.FromSeconds(5)) // Memory cache duration
.SetJittering(TimeSpan.FromSeconds(1)) // Avoid synchronized expiry
.SetFailSafe(true, TimeSpan.FromMinutes(5), TimeSpan.FromSeconds(30)) // Enable fail-safe
.SetFactoryTimeouts(TimeSpan.FromSeconds(2), TimeSpan.FromSeconds(30)) // Soft/hard factory timeouts
.SetDistributedCacheDuration(TimeSpan.FromMinutes(5)) // Redis duration
.SetDistributedCacheTimeouts(TimeSpan.FromSeconds(2), TimeSpan.FromSeconds(30)) // Timeout for Redis operations
.AllowBackgroundBackplaneOperations(true)
.AllowBackgroundDistributedCacheOperations(true);

Redis client configuration:

configurationOptions.EndPoints.Add(redisOptions.Endpoint);
configurationOptions.ConnectTimeout = 30000;
configurationOptions.SyncTimeout = 30000;
configurationOptions.AbortOnConnectFail = false;
configurationOptions.KeepAlive = 60;
await configurationOptions.ConfigureForAzureWithTokenCredentialAsync(new DefaultAzureCredential());

In Program.cs

/ Add cache services
builder.Services.AddMemoryCache();

builder.Services.AddSingleton<Task<ConnectionMultiplexer>>(async serviceProvider =>
{
    var redisConfigService = serviceProvider.GetRequiredService<IRedisConfigurationService>();
    var logger = serviceProvider.GetRequiredService<ILogger<Program>>();

    var config = await redisConfigService.GetRedisConfigurationAsync(redisOption);

    logger.LogInformation("[REDIS] Connecting to Redis...");

    var connection = await ConnectionMultiplexer.ConnectAsync(config);

    if (connection.IsConnected)
    {
        logger.LogInformation("[REDIS] Connection established successfully!");
    }
    else
    {
        logger.LogError("[REDIS] Connection established but not active!");
    }

    return connection;
});

builder.Services.AddFusionCache()
    .WithSerializer(new FusionCacheNewtonsoftJsonSerializer())    
    .WithDistributedCache(new RedisCache(new RedisCacheOptions
    {
        ConnectionMultiplexerFactory = async () =>
        {
            var multiplexerTask = builder.Services.BuildServiceProvider().GetRequiredService<Task<ConnectionMultiplexer>>();
            var multiplexer = await multiplexerTask;
            return multiplexer;
        },
        InstanceName = "Flow1:"
    }))
    .WithBackplane(new RedisBackplane(new RedisBackplaneOptions
    {
        ConnectionMultiplexerFactory = async () =>
        {
            var multiplexerTask = builder.Services.BuildServiceProvider().GetRequiredService<Task<ConnectionMultiplexer>>();
            var multiplexer = await multiplexerTask;
            return multiplexer;
        }
    }));

Context:

  1. I'm running multiple pods on AKS, and I suspect that when a cached key expires (after 5s), all pods hit Redis at the same time, causing a thundering herd issue.
  2. Even though I use SetJittering, it might not be enough to avoid concurrent distributed reads.
  3. I also noticed that Redis metrics (async-ops, queued items, etc.) shoot up around the time of the timeouts.

Questions

  1. How can I prevent multiple pods from triggering Redis fetches simultaneously for an expired key?
  2. Is my fail-safe throttle duration (30s) too long and delaying fallback unnecessarily?
  3. Should I reduce the soft timeout (e.g., from 2s to 1s) to fail faster and let FusionCache serve from memory/fail-safe sooner?
  4. Is the combination of SetDuration(5s) and SetDistributedCacheDuration(5min) valid or misleading?
  5. Is it a good idea to enable a backplane (e.g., Redis pub/sub or other) to prevent parallel rebuilds across pods?

Goal:

  1. I’m looking for resilient FusionCache configuration suitable for AKS workloads that:
  2. Minimizes pressure on Redis
  3. Prevents thundering herd effects
  4. Falls back to memory cache or fail-safe quickly
  5. Avoids cascading HTTP 504s caused by slow Redis or cache misses

1 Answer 1

0

FusionCache creator here: in general your configuration looks good.

Regarding your questions:

How can I prevent multiple pods from triggering Redis fetches simultaneously for an expired key?

I was about to suggest to add some jittering, but I see you are already doing that.

Is my fail-safe throttle duration (30s) too long and delaying fallback unnecessarily?

I wouldn't say that, no.

Should I reduce the soft timeout (e.g., from 2s to 1s) to fail faster and let FusionCache serve from memory/fail-safe sooner?

You mean the soft timeout for the factory or the distributed cache?

Keep in mind that, if for some infra reason your Redis instance is somehow slow, using a soft timeout that is too low would prevent Redis from responding in time, therefore making FusionCache always skip it. 2s/1s looks reasonable (usually a reasonably size Redis instance can answer in 5-50ms), but don't set it too low (e.g.: < 100ms, imho).

Is the combination of SetDuration(5s) and SetDistributedCacheDuration(5min) valid or misleading?

It's valid, but ask yourself why you are doing it: every 5s the memory cache will go to the distributed cache: is the reason because you want to have the freshest version possible? If so, the backplane does exactly that, so it's not needed to keep the memory cache duration to 5s.

If instead you want to keep memory allocated as little as possible, than it's another thing.

Is it a good idea to enable a backplane (e.g., Redis pub/sub or other) to prevent parallel rebuilds across pods?

A backplane is basically always a good idea in a distributed environment: it allows for almost instantaneous updates across all nodes, so yeah I would definitely keep it.

PS: having sometimes timeouts with StackExchange.Redis is not something, let's say, so rare (see here).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.