Redis Cache Timeout with .NET Application in AKS

Question

I'm using FusionCache on a .NET 8 backend hosted in AKS with multiple pods. The cache is backed by Azure Redis (C3 Standard). Everything was working fine in dev/staging, but since going live in production, I’ve been facing intermittent timeouts and latency spikes on Redis.

This is what's happening : I get a burst of warnings in logs from FusionCache (10+ logs within the same millisecond) for the same key and same pod:

FusionCache: an error occurred while trying to update a memory entry from a distributed entry

Then a RedisTimeoutException occurs shortly after:

 StackExchange.Redis.RedisTimeoutException:
Timeout performing HMGET (30000ms), next: HMGET some-key, inst: 1, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle,
in: 62160, last-in: 0, cur-in: 0, sync-ops: 64, async-ops: 63,
serverEndpoint: redis-x.redis.cache.windows.net:6380,
conn-sec: 41.38, mc: 1/1/0, IOCP: (Busy=0,Free=1000), WORKER: (Busy=36,Free=32731), POOL: (Threads=36,QueuedItems=66,CompletedItems=1197,Timers=19)

Sometimes this results in 504 Gateway Timeout errors from an upstream HTTP endpoint, like:

Status code: GatewayTimeout. Message: Response status code does not indicate success: 504 (Gateway Timeout)

FusionCache configuration:

options
.SetDuration(TimeSpan.FromSeconds(5)) // Memory cache duration
.SetJittering(TimeSpan.FromSeconds(1)) // Avoid synchronized expiry
.SetFailSafe(true, TimeSpan.FromMinutes(5), TimeSpan.FromSeconds(30)) // Enable fail-safe
.SetFactoryTimeouts(TimeSpan.FromSeconds(2), TimeSpan.FromSeconds(30)) // Soft/hard factory timeouts
.SetDistributedCacheDuration(TimeSpan.FromMinutes(5)) // Redis duration
.SetDistributedCacheTimeouts(TimeSpan.FromSeconds(2), TimeSpan.FromSeconds(30)) // Timeout for Redis operations
.AllowBackgroundBackplaneOperations(true)
.AllowBackgroundDistributedCacheOperations(true);

Redis client configuration:

configurationOptions.EndPoints.Add(redisOptions.Endpoint);
configurationOptions.ConnectTimeout = 30000;
configurationOptions.SyncTimeout = 30000;
configurationOptions.AbortOnConnectFail = false;
configurationOptions.KeepAlive = 60;
await configurationOptions.ConfigureForAzureWithTokenCredentialAsync(new DefaultAzureCredential());

In Program.cs

/ Add cache services
builder.Services.AddMemoryCache();

builder.Services.AddSingleton<Task<ConnectionMultiplexer>>(async serviceProvider =>
{
    var redisConfigService = serviceProvider.GetRequiredService<IRedisConfigurationService>();
    var logger = serviceProvider.GetRequiredService<ILogger<Program>>();

    var config = await redisConfigService.GetRedisConfigurationAsync(redisOption);

    logger.LogInformation("[REDIS] Connecting to Redis...");

    var connection = await ConnectionMultiplexer.ConnectAsync(config);

    if (connection.IsConnected)
    {
        logger.LogInformation("[REDIS] Connection established successfully!");
    }
    else
    {
        logger.LogError("[REDIS] Connection established but not active!");
    }

    return connection;
});

builder.Services.AddFusionCache()
    .WithSerializer(new FusionCacheNewtonsoftJsonSerializer())    
    .WithDistributedCache(new RedisCache(new RedisCacheOptions
    {
        ConnectionMultiplexerFactory = async () =>
        {
            var multiplexerTask = builder.Services.BuildServiceProvider().GetRequiredService<Task<ConnectionMultiplexer>>();
            var multiplexer = await multiplexerTask;
            return multiplexer;
        },
        InstanceName = "Flow1:"
    }))
    .WithBackplane(new RedisBackplane(new RedisBackplaneOptions
    {
        ConnectionMultiplexerFactory = async () =>
        {
            var multiplexerTask = builder.Services.BuildServiceProvider().GetRequiredService<Task<ConnectionMultiplexer>>();
            var multiplexer = await multiplexerTask;
            return multiplexer;
        }
    }));

Context:

I'm running multiple pods on AKS, and I suspect that when a cached key expires (after 5s), all pods hit Redis at the same time, causing a thundering herd issue.
Even though I use SetJittering, it might not be enough to avoid concurrent distributed reads.
I also noticed that Redis metrics (async-ops, queued items, etc.) shoot up around the time of the timeouts.

Questions

How can I prevent multiple pods from triggering Redis fetches simultaneously for an expired key?
Is my fail-safe throttle duration (30s) too long and delaying fallback unnecessarily?
Should I reduce the soft timeout (e.g., from 2s to 1s) to fail faster and let FusionCache serve from memory/fail-safe sooner?
Is the combination of SetDuration(5s) and SetDistributedCacheDuration(5min) valid or misleading?
Is it a good idea to enable a backplane (e.g., Redis pub/sub or other) to prevent parallel rebuilds across pods?

Goal:

I’m looking for resilient FusionCache configuration suitable for AKS workloads that:
Minimizes pressure on Redis
Prevents thundering herd effects
Falls back to memory cache or fail-safe quickly
Avoids cascading HTTP 504s caused by slow Redis or cache misses

Nimantha · Accepted Answer · 2025-07-22 02:54:36Z

FusionCache creator here: in general your configuration looks good.

Regarding your questions:

How can I prevent multiple pods from triggering Redis fetches simultaneously for an expired key?

I was about to suggest to add some jittering, but I see you are already doing that.

Is my fail-safe throttle duration (30s) too long and delaying fallback unnecessarily?

I wouldn't say that, no.

Should I reduce the soft timeout (e.g., from 2s to 1s) to fail faster and let FusionCache serve from memory/fail-safe sooner?

You mean the soft timeout for the factory or the distributed cache?

Keep in mind that, if for some infra reason your Redis instance is somehow slow, using a soft timeout that is too low would prevent Redis from responding in time, therefore making FusionCache always skip it. 2s/1s looks reasonable (usually a reasonably size Redis instance can answer in 5-50ms), but don't set it too low (e.g.: < 100ms, imho).

Is the combination of SetDuration(5s) and SetDistributedCacheDuration(5min) valid or misleading?

It's valid, but ask yourself why you are doing it: every 5s the memory cache will go to the distributed cache: is the reason because you want to have the freshest version possible? If so, the backplane does exactly that, so it's not needed to keep the memory cache duration to 5s.

If instead you want to keep memory allocated as little as possible, than it's another thing.

Is it a good idea to enable a backplane (e.g., Redis pub/sub or other) to prevent parallel rebuilds across pods?

A backplane is basically always a good idea in a distributed environment: it allows for almost instantaneous updates across all nodes, so yeah I would definitely keep it.

PS: having sometimes timeouts with StackExchange.Redis is not something, let's say, so rare (see here).

Collectives™ on Stack Overflow

Redis Cache Timeout with .NET Application in AKS

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related