I'm using FusionCache on a .NET 8 backend hosted in AKS with multiple pods. The cache is backed by Azure Redis (C3 Standard). Everything was working fine in dev/staging, but since going live in production, I’ve been facing intermittent timeouts and latency spikes on Redis.
This is what's happening : I get a burst of warnings in logs from FusionCache (10+ logs within the same millisecond) for the same key and same pod:
FusionCache: an error occurred while trying to update a memory entry from a distributed entry
Then a RedisTimeoutException occurs shortly after:
StackExchange.Redis.RedisTimeoutException:
Timeout performing HMGET (30000ms), next: HMGET some-key, inst: 1, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle,
in: 62160, last-in: 0, cur-in: 0, sync-ops: 64, async-ops: 63,
serverEndpoint: redis-x.redis.cache.windows.net:6380,
conn-sec: 41.38, mc: 1/1/0, IOCP: (Busy=0,Free=1000), WORKER: (Busy=36,Free=32731), POOL: (Threads=36,QueuedItems=66,CompletedItems=1197,Timers=19)
Sometimes this results in 504 Gateway Timeout errors from an upstream HTTP endpoint, like:
Status code: GatewayTimeout. Message: Response status code does not indicate success: 504 (Gateway Timeout)
FusionCache configuration:
options
.SetDuration(TimeSpan.FromSeconds(5)) // Memory cache duration
.SetJittering(TimeSpan.FromSeconds(1)) // Avoid synchronized expiry
.SetFailSafe(true, TimeSpan.FromMinutes(5), TimeSpan.FromSeconds(30)) // Enable fail-safe
.SetFactoryTimeouts(TimeSpan.FromSeconds(2), TimeSpan.FromSeconds(30)) // Soft/hard factory timeouts
.SetDistributedCacheDuration(TimeSpan.FromMinutes(5)) // Redis duration
.SetDistributedCacheTimeouts(TimeSpan.FromSeconds(2), TimeSpan.FromSeconds(30)) // Timeout for Redis operations
.AllowBackgroundBackplaneOperations(true)
.AllowBackgroundDistributedCacheOperations(true);
Redis client configuration:
configurationOptions.EndPoints.Add(redisOptions.Endpoint);
configurationOptions.ConnectTimeout = 30000;
configurationOptions.SyncTimeout = 30000;
configurationOptions.AbortOnConnectFail = false;
configurationOptions.KeepAlive = 60;
await configurationOptions.ConfigureForAzureWithTokenCredentialAsync(new DefaultAzureCredential());
In Program.cs
/ Add cache services
builder.Services.AddMemoryCache();
builder.Services.AddSingleton<Task<ConnectionMultiplexer>>(async serviceProvider =>
{
var redisConfigService = serviceProvider.GetRequiredService<IRedisConfigurationService>();
var logger = serviceProvider.GetRequiredService<ILogger<Program>>();
var config = await redisConfigService.GetRedisConfigurationAsync(redisOption);
logger.LogInformation("[REDIS] Connecting to Redis...");
var connection = await ConnectionMultiplexer.ConnectAsync(config);
if (connection.IsConnected)
{
logger.LogInformation("[REDIS] Connection established successfully!");
}
else
{
logger.LogError("[REDIS] Connection established but not active!");
}
return connection;
});
builder.Services.AddFusionCache()
.WithSerializer(new FusionCacheNewtonsoftJsonSerializer())
.WithDistributedCache(new RedisCache(new RedisCacheOptions
{
ConnectionMultiplexerFactory = async () =>
{
var multiplexerTask = builder.Services.BuildServiceProvider().GetRequiredService<Task<ConnectionMultiplexer>>();
var multiplexer = await multiplexerTask;
return multiplexer;
},
InstanceName = "Flow1:"
}))
.WithBackplane(new RedisBackplane(new RedisBackplaneOptions
{
ConnectionMultiplexerFactory = async () =>
{
var multiplexerTask = builder.Services.BuildServiceProvider().GetRequiredService<Task<ConnectionMultiplexer>>();
var multiplexer = await multiplexerTask;
return multiplexer;
}
}));
Context:
- I'm running multiple pods on AKS, and I suspect that when a cached key expires (after 5s), all pods hit Redis at the same time, causing a thundering herd issue.
- Even though I use SetJittering, it might not be enough to avoid concurrent distributed reads.
- I also noticed that Redis metrics (async-ops, queued items, etc.) shoot up around the time of the timeouts.
Questions
- How can I prevent multiple pods from triggering Redis fetches simultaneously for an expired key?
- Is my fail-safe throttle duration (30s) too long and delaying fallback unnecessarily?
- Should I reduce the soft timeout (e.g., from 2s to 1s) to fail faster and let FusionCache serve from memory/fail-safe sooner?
- Is the combination of SetDuration(5s) and SetDistributedCacheDuration(5min) valid or misleading?
- Is it a good idea to enable a backplane (e.g., Redis pub/sub or other) to prevent parallel rebuilds across pods?
Goal:
- I’m looking for resilient FusionCache configuration suitable for AKS workloads that:
- Minimizes pressure on Redis
- Prevents thundering herd effects
- Falls back to memory cache or fail-safe quickly
- Avoids cascading HTTP 504s caused by slow Redis or cache misses