Libtorch performance issue when using multiple GPUs in multiple threads

Ask Question

Asked 2 years, 3 months ago

Modified 2 years, 2 months ago

Viewed 603 times

I have a TorchScript model(traced from a pytorch model) and want to make use of multiple GPUs to do an inference task. The idea is as simple as follows:

Create multiple independent models from the same TorchScript, then copy them to each GPU
Prepare input batches for each GPU.
Then run all models in parallel.
Finally gather all outputs.

The input batch size is 8, since it gets the best performance for a single GPU of mine, But the result performance surprised me a lot:

A single GPU inference speed is ~40FPS,
Two GPUs inference speed is ~66FPS,
Three GPUs inference speed is ~75FPS.

I was expecting that two GPUs give ~80FPS and three GPUs give ~120FPS, since If I run two instances of the test program simultaneously on different GPUs, each gives ~40FPS(total ~80FPS), and run three instances, each also gives ~40FPS(total ~120FPS).

To conclude:

inference in multiple threads of one process, low performance; multiple independent processes, performance as expected. I suspect it's a problem of CUDA, how can I fix it? (I'll try a multi-process approach later)

Environment:

VS2019

pytorch=1.13.1+cu117

libtorch-win-shared-with-deps-1.13.1+cu117

Windows 10

GPU NVIDIA RTX A4000

The test program:

#include <future>
#include <vector>
#include <string>
#include <chrono>
#include <format>
#include <torch/script.h>


struct InferModel {
    torch::jit::Module model;
    int device_id;
};

auto cuda_device(int id) {
    return at::Device(torch::kCUDA, static_cast<at::DeviceIndex>(id));
}

// Copy the same model to each device
std::vector<InferModel> load_models(const std::string &model_file,
                                    const std::vector<int> &device_ids) {
    torch::jit::FusionStrategy bailout = {
            {torch::jit::FusionBehavior::STATIC,  0},
            {torch::jit::FusionBehavior::DYNAMIC, 0}};
    torch::jit::setFusionStrategy(bailout);
    auto models = std::vector<InferModel>{};
    for (auto id: device_ids) {
        auto model = torch::jit::load(model_file, torch::kCPU);
        model.eval();
        model.to(cuda_device(id));
        models.push_back({model, id});
    }
    return models;
}


void model_profile(const std::string &model_file,
                   const std::vector<int> &device_ids,
                   bool model_parallel = true,
                   int batch_size = 8,
                   int input_size = 256,
                   int n_loop = 100) {

    auto models = load_models(model_file, device_ids);
    std::cout << "Model load done. Start warming up the models...";

    // Warm up the models
    for (auto &model: models) {
        torch::jit::optimize_for_inference(model.model);
        for (int i = 0; i < 4; ++i) {
            auto t = torch::rand({1, 3, input_size, input_size}).to(cuda_device(model.device_id));
            model.model({t});
        }
    }
    std::cout << "Model warming up done.\n";

    // Infer loop
    using namespace std::chrono;
    steady_clock::duration total_time{};
    for (int i = 0; i < n_loop; ++i) {
        auto loop_i_time = decltype(total_time){};
        std::vector<std::future<void>> infer_futures;
        // Model run in parallel or in sequence
        for (auto &model: models) {
            infer_futures.push_back(std::async(
                    model_parallel ? std::launch::async : std::launch::deferred,
                    [&model, batch_size, input_size] {
                        auto no_grad = torch::NoGradGuard{};
                        auto batch = torch::rand(
                                {batch_size, 3, input_size, input_size}).to(
                                cuda_device(model.device_id));
                        // Modify the input and output according to the test model
                        auto output = model.model({batch}).toTuple();
                        auto t0 = output->elements()[0].toTensor().to(torch::kCPU);
                        auto t1 = output->elements()[1].toTensor().to(torch::kCPU);
                    }));
        }

        auto start = steady_clock::now();
        for (auto &f: infer_futures) {
            try { f.get(); }
            catch (const std::exception &e) {
                std::cout << e.what() << std::endl;
                exit(1);
            }
        }
        loop_i_time = steady_clock::now() - start;
        total_time += loop_i_time;
        if (i % 10 == 0)
            std::cout << std::format("Loop {}, time {} ms\n", i, duration_cast<milliseconds>(loop_i_time).count());
    }

    auto fps = n_loop * batch_size * device_ids.size() / duration_cast<seconds>(total_time).count();
    std::cout << std::format("Infer on {} GPU(s). Is model parallel: {}. Speed: {} FPS\n",
                             device_ids.size(), model_parallel, fps);
}

int main(int argc, char **argv) {
    std::string model_file{"model.jit.pt"};
    std::vector<int> ids{};
    for (int i = 1; i < argc; ++i) {
        ids.push_back(std::stoi(argv[i]));
    }

    if (ids.size() == 1) {
        int id = std::stoi(argv[1]);
        std::cout << "Use GPU " << id << std::endl;
        model_profile(model_file, {id});
    } else if (ids.size() >= 2) {
        // Check single GPU performance first
        model_profile(model_file, {ids.at(0)});
        // Two GPUs in sequence, result fps is expected to be almost the same as single gpu fps
        model_profile(model_file, ids, false);
        // Two GPUs in parallel, result fps is expected to be about double of single gpu fps
        model_profile(model_file, ids, true);
    } else {
        std::cout << "Usage: test_multi_gpu gpu_id [gpu_id, ...]" << std::endl;
    }
}

The following are appends:

model.eval() and torch::NoGradGuard added, I didn't notice any perfomance change. 80FPS is the best performance I can get on 4 GPUs, although a single gpu performance is 40FPS.

I print the start time and end time in each infer, it seems like the models are running sequentially, it looks like there is a "GIL".

Loop 90, time 807 ms
thread-16412   , start infer 7934243ms, end infer 7934644ms
thread-16100   , start infer 7934243ms, end infer 7935047ms
thread-16412   , start infer 7935051ms, end infer 7935454ms
thread-16100   , start infer 7935051ms, end infer 7935851ms
thread-16100   , start infer 7935855ms, end infer 7936264ms
thread-16412   , start infer 7935855ms, end infer 7936666ms
thread-16100   , start infer 7936670ms, end infer 7937073ms
thread-16412   , start infer 7936670ms, end infer 7937474ms
thread-16412   , start infer 7937477ms, end infer 7937886ms
thread-16100   , start infer 7937477ms, end infer 7938291ms
thread-16412   , start infer 7938295ms, end infer 7938702ms
thread-16100   , start infer 7938295ms, end infer 7939104ms
thread-16412   , start infer 7939108ms, end infer 7939512ms
thread-16100   , start infer 7939108ms, end infer 7939913ms
thread-16412   , start infer 7939917ms, end infer 7940327ms
thread-16100   , start infer 7939917ms, end infer 7940724ms
thread-16100   , start infer 7940728ms, end infer 7941131ms
thread-16412   , start infer 7940728ms, end infer 7941535ms
Infer on 2 GPU(s). Is model parallel: true. Speed: 40 FPS

There's always a ~400ms delay between two threads' end infer.

Well, I found that .to(torch::kCPU) is very slow(although cpu->gpu copy is fast)! If the tensor is not copied to cpu, then the FPS is what I was expecting. Have no idea how to fix it.

Add some comparing data

Process Number	Thread Number	GPU Card Number	Copy Result to CPU?	FPS
1	1	1	yes	38
1	1	1	no	44
1	1	1	yes	38

1	2	2	yes	63
1	2	2	no	86
2	1	2	yes	77

1	3	3	yes	82
1	3	3	no	123
3	1	3	yes	116

1	4	4	yes	107
1	4	4	no	160
4	1	4	yes	157

It seems multi-process should be the solution.

edited Sep 15, 2023 at 6:29

asked Sep 14, 2023 at 7:04

oz1

1,0288 silver badges20 bronze badges

model.eval() and torch::NoGradGuard missed... I'll check the performace again

oz1
– oz1

2023-09-14 07:10:19 +00:00
Commented Sep 14, 2023 at 7:10
are you sure your implementation of std::async uses multiple threads? How long do you run your test for? You're not starting measuring time until after the tests have already started so your results aren't completely accurate. I'd measure the time inside your std::async task rather than outside so that you're only measuring actual execution time

Alan Birtles
– Alan Birtles

2023-09-14 07:12:02 +00:00
Commented Sep 14, 2023 at 7:12
@AlanBirtles Hi, the timing is simulating my real library function, in which a batch data is passed in and distributed on multiple gpus , then inference is done by multiple threads. But I'll take your advice and do a more accurate mesuare.

oz1
– oz1

2023-09-14 07:27:56 +00:00
Commented Sep 14, 2023 at 7:27

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Libtorch performance issue when using multiple GPUs in multiple threads

0

Your Answer

Hot Network Questions

Process Number	Thread Number	GPU Card Number	Copy Result to CPU?	FPS
1	1	1	yes	38
1	1	1	no	44
1	1	1	yes	38

1	2	2	yes	63
1	2	2	no	86
2	1	2	yes	77

1	3	3	yes	82
1	3	3	no	123
3	1	3	yes	116

1	4	4	yes	107
1	4	4	no	160
4	1	4	yes	157

Process Number	Thread Number	GPU Card Number	Copy Result to CPU?	FPS
1	1	1	yes	38
1	1	1	no	44
1	1	1	yes	38

1	2	2	yes	63
1	2	2	no	86
2	1	2	yes	77

1	3	3	yes	82
1	3	3	no	123
3	1	3	yes	116

1	4	4	yes	107
1	4	4	no	160
4	1	4	yes	157

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Process Number	Thread Number	GPU Card Number	Copy Result to CPU?	FPS
1	1	1	yes	38
1	1	1	no	44
1	1	1	yes	38

1	2	2	yes	63
1	2	2	no	86
2	1	2	yes	77

1	3	3	yes	82
1	3	3	no	123
3	1	3	yes	116

1	4	4	yes	107
1	4	4	no	160
4	1	4	yes	157