Load Balancing Challenges with NVIDIA GPUs in CCTV Video Decoding

Question

We have a CCTV system where we use NVIDIA GPUs for video decoding. Our current requirement is to monitor GPU decoding and memory usage, and if the usage reaches 80%, we need to automatically switch new streams to the next available GPU.

We have implemented GPU monitoring using NVML, but when multiple streams are initiated simultaneously, they all tend to go to the same GPU. We are looking for an effective strategy or best practices to distribute the streams evenly across multiple GPUs when they are opened concurrently.

Any advice or suggestions on how to achieve this load balancing effectively would be greatly appreciated.

Thank you!

Ext3h · Accepted Answer · 2024-08-06 11:51:02Z

0

Don't monitor - estimate the load. If you try to measure, you will find that the reported load is fluctuating heavily due to various external factors (e.g. stalled uploads delaying the decoder, accidentally sampling utilization just in between frames etc.), and you will almost certainly under- / overshoot the intended load level.

The load is almost proportional to the frame rate and video resolution - the later one rounded up to multiple of 128 pixels in both dimensions. This rounding up is due to an undocumented implementation detail of the video decoder, it's processing videos in tiles of this granularity.

Bitrate or specific encoding details (used or unused codec features) have little to no impact at all. You do have a correction factor for entire codec families only (e.g. H264 vs H265 vs VC-1 vs VP9), but they all compete for resources from the same pool, so you can sum them up trivially.

You have the same amount of resources available on all models of a generation, it does not scale with clock speeds. Only exception is when the unit explicitly has multiple video decoder units in a single chip, in that case you can simply multiply the available decoding budget. There has been actually very little difference between GPUs since the introduction of the Pascal family (3rd gen NVDEC) either, only the feature set has been extended but not the performance per unit.

Check https://developer.nvidia.com/video-encode-and-decode-gpu-support-matrix-new , in the rows "Total # of NVDEC" and "NVDEC Generation". That's the only determining factors for the available throughput.

You will have to make an reference measurement for your used GPU families to determine a reference value in "pixels per second" peak throughput rate for the video codec relevant to you - I can no longer recall what the exact numbers were. Use a single 4k video stream for the reference measurement, as it will scale slightly worse than a bunch of concurrent lower resolution streams.

You can generally run the video decoder unit at up to 95% of the such measured peak througput rate without loosing real-time decoding capabilities.

Video decoder throughput is independent from compute or graphics loads on the shader units.

Don't try to apply this logic to any of the models with a 64bit GDDR4 or slower memory interface - they don't have enough memory bandwidth in order to achieve full throughout on the decoder unit. Likewise, may generally want to avoid saturating the memory bandwidth by shader work, both will stall the video decoder unit.

We are looking for an effective strategy or best practices to distribute the streams evenly across multiple GPUs when they are opened concurrently.

There is really no benefit to distributing eagerly. You will find that if you correctly predict the utilization such that it is guaranteed to stay below 100%, you will achieve the same user experience but at a lower power consumption.

edited Aug 6, 2024 at 11:51

answered Aug 6, 2024 at 5:36

Ext3h

6,48620 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ranjith Ram Over a year ago

Thanks @Ext3h for the response! Your suggestion to estimate load based on resolution and frame rate, rather than monitoring, is helpful. Could you clarify: Estimation: Should I test by assigning different resolutions/frame rates to GPUs, or is there a better approach? API/Tools: Are there NVIDIA APIs/tools to estimate GPU load based on stream parameters without monitoring? Load Calculation: How do I sum varying resolutions/codecs (e.g., H264, H265) to predict GPU load? Any formulas? Appreciate any insights.

Ext3h Over a year ago

You can measure it as a number of pixels per seconds, and add a correction factor for the different codec families. Just convert your video size and frame rates into that number, and track the available spare budget on the GPU. Everything that can display the GPU's video decoder unit utilization will suffice for that initial calibration. Different codecs are a plain correction factor, simply multiplied to the pixels per second number.

Ranjith Ram Over a year ago

Thanks for explaining the use of pixels per second with a codec correction factor. However, I’m still unsure how to measure or calculate the pixels per second. Could you clarify this with an example? For instance, if I have a 1920x1080 video at 30fps running on an NVIDIA P1000, how would I calculate the pixels per second? What tools or steps should I use to measure this for different streams? Appreciate your help!

Ext3h Over a year ago

You have to push it to 100% utilization first, try to decode that 1080p video without throttling, and measure the stable framerate you could achieve. Should be far in excess of 200 FPS or more. Don't forget that the GPU had actually been processing it as a 1920 x 1152 video surface, not 1080p in height. So 1920 x 1152 x <achieved framerate> is your measured peak throughput in pixels per second.

Collectives™ on Stack Overflow

Load Balancing Challenges with NVIDIA GPUs in CCTV Video Decoding

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related