.NET Amazon Elasticache - Redis cluster random timeout errors

Question

Recently we introduced storing user session data on Amazon's ElastiCache through Redis Cache. I was a little worried of the speed and latency issues of this solution as before the session data was being stored in the server's memory, but to my surprise the Redis cache is actually very quick, and I don't think we really took a performance hit.

We are looking to expand our web servers and we have a load balancer already (but it's mostly for security at the moment), so we wanted to store user session data somewhere else so if users got directed to another server in-between requests, they wouldn't noticed anything.

After about a month of releasing this solution, we got a few users reporting timeouts. It doesn't happen very often, and we've only gotten a few reports, but the user's were very frustrated as the work they were doing on those pages was completely lost.

This timeout is inline with the config setting timeout of 5000ms

<sessionState mode="Custom" customProvider="Custom_RedisSessionStateStore" timeout="60">
  <providers>
    <add name="Custom_RedisSessionStateStore"
         type="Microsoft.Web.Redis.RedisSessionStateProvider"
         settingsClassName="AWS.SessionStateRedisSettings"
         settingsMethodName="ConnectionString"
         operationTimeoutInMilliseconds="5000"
    />
  </providers>
</sessionState>

The class “SessionStateRedisSettings” is setting the connection information for the session store as that information is being stored in AWS secrets manager and is being pulled on the start up of the web application.

namespace AWS
{
    public static class SessionStateRedisSettings
    {
        public static string RedisConnectionString = string.Empty;

        public static void Initialize()
        {
            RedisConnectionString = string.Format("{0}:{1},password={2},ssl=True", SecretsCache.SecretsDictonary["RedisHost"], SecretsCache.SecretsDictonary["RedisPort"], SecretsCache.SecretsDictonary["RedisPass"]);
        }

        public static string ConnectionString()
        {
            return RedisConnectionString;
        }
    }
}

Using the following link from the error message to try to find out the root cause: https://stackexchange.github.io/StackExchange.Redis/Timeouts

Are you getting network or CPU bound?
I don’t think it’s the network or CPU, here are some metrics during the timeout

Are there commands taking a long time to process on the redis-server?
Using “log insights” to query SlowLog in AWS, we can look to verify:

First query to get all the logs, we have over 1k at the moment:

fields @timestamp, @message

Second query is to get all of them that are like EVALSHA, which is almost the entire data set.

fields @timestamp, @message 
| filter Command like /EVALSHA/

Third query to get all the logs that are NOT like EVALSHA

fields @timestamp, @message 
| filter Command not like /EVALSHA/

Fourth query to see which ones took longer than 5 seconds. Duration (us) is measured in microseconds, there is nothin that is longer than 5 seconds.

fields @timestamp, @message
| filter `Duration (us)` > 5000000

Fifth query to prove that query on does work, and the longest duration of a command is about .3 seconds

fields @timestamp, @message
| filter `Duration (us)` > 50000

Was there a big request preceding several small requests to the Redis that timed out?
I thought this was the case initially, but seeing how the “qs” value in the error is 0, it doesn’t seem to be the case.

Are you seeing a high number of busyio or busyworker threads in the timeout exception?
It does not seem that way. The IOCP (Runtime Global Thread Pool IO Threads) has 0 busy threads. The WORKER (Runtime Global Thread Pool Worker Threads) has 13/32767 busy threads

Overall the metrics on the redis server, the slow logs, and the values returned in the error message all look good, so I’m not entirely sure where to continue to look to find the source of the problem.

I did end up optimizing our session usage on the specific page that was getting the redis timeouts. (That page was riddled with session variables, so I thought that would help)

charlie arehart · Accepted Answer · 2024-06-22 06:19:21Z

1

+50

You've asserted there's no network issue, but you've then focused on what's going on INSIDE redis. :-) There may well be network connection issues at play, whether due to matters on your end (the ms web app deployment location) or in getting out and then to the Azure cache service. That operationtimeout could fire for both kinds of delay.

Of course, when they're intermittent/infrequent, that can be challenging to understand.

Have you at least considered any kind of tool to test connections from the client env to redis? Whether from lowly ping to more capable network monitors that better track things over time? You may see hiccups that coincide with the timeouts (which may seem of marginal new value), but you may also discover some pattern of smaller but still significant delays. Those may serve as clues.

Further, if you want to test BOTH the network connectivity AND processing in Redis, you may want to consider running the redis-benchmark cli tool (included with Redis or installable on the client machine--and yes, even if it's Windows if that's indeed what you're running). It's very configurable, so you could even run it for extended periods. More on it here.

It might also be helpful to hear where that client web app is running relative to the redis deployment (is it on-prem perhaps, or on Azure also? etc). And it never hurts to do a tracert from the client to the server to see the hops involved. Sometimes people are shocked by what they find.

Finally, consider that something (on either end) may be throttling requests only very occasionally. It could even be something like a firewall or other network/security software on the client machine or its network.

answered Jun 22, 2024 at 6:19

charlie arehart

6,9693 gold badges30 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

nyduss Over a year ago

The web server is a Windows machine, and the client web app is running on it's own AWS EC2 instance, and then the Redis Cluster is setup through Amazon's ElastiCache. The Redis Cluster does not exist on the EC2 instance, and we wanted that by design so we can add more web servers and the user's session data would exists on the redis server, that way the load balancer can direct a user to a different server in-between requests, and they wont notice a thing. The redis-benchmark cli tool and the tracert from the client to the redis server are interesting ideas. I'll follow up with those.

nyduss Over a year ago

I've also had issues connecting to the Redis cluster through the redis cli from a windows machine when I enable the "Encryption in transit" when setting up the Redis Cluster through Amazon's Elasticache, so I'm not sure this benchmark tool will work for me because it looks to use the same interface.

Collectives™ on Stack Overflow

.NET Amazon Elasticache - Redis cluster random timeout errors

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related