I have a TorchScript model(traced from a pytorch model) and want to make use of multiple GPUs to do an inference task. The idea is as simple as follows:
- Create multiple independent models from the same TorchScript, then copy them to each GPU
- Prepare input batches for each GPU.
- Then run all models in parallel.
- Finally gather all outputs.
The input batch size is 8, since it gets the best performance for a single GPU of mine, But the result performance surprised me a lot:
- A single GPU inference speed is ~40FPS,
- Two GPUs inference speed is ~66FPS,
- Three GPUs inference speed is ~75FPS.
I was expecting that two GPUs give ~80FPS and three GPUs give ~120FPS, since If I run two instances of the test program simultaneously on different GPUs, each gives ~40FPS(total ~80FPS), and run three instances, each also gives ~40FPS(total ~120FPS).
To conclude:
inference in multiple threads of one process, low performance; multiple independent processes, performance as expected. I suspect it's a problem of CUDA, how can I fix it? (I'll try a multi-process approach later)
Environment:
VS2019
pytorch=1.13.1+cu117
libtorch-win-shared-with-deps-1.13.1+cu117
Windows 10
GPU NVIDIA RTX A4000
The test program:
#include <future>
#include <vector>
#include <string>
#include <chrono>
#include <format>
#include <torch/script.h>
struct InferModel {
torch::jit::Module model;
int device_id;
};
auto cuda_device(int id) {
return at::Device(torch::kCUDA, static_cast<at::DeviceIndex>(id));
}
// Copy the same model to each device
std::vector<InferModel> load_models(const std::string &model_file,
const std::vector<int> &device_ids) {
torch::jit::FusionStrategy bailout = {
{torch::jit::FusionBehavior::STATIC, 0},
{torch::jit::FusionBehavior::DYNAMIC, 0}};
torch::jit::setFusionStrategy(bailout);
auto models = std::vector<InferModel>{};
for (auto id: device_ids) {
auto model = torch::jit::load(model_file, torch::kCPU);
model.eval();
model.to(cuda_device(id));
models.push_back({model, id});
}
return models;
}
void model_profile(const std::string &model_file,
const std::vector<int> &device_ids,
bool model_parallel = true,
int batch_size = 8,
int input_size = 256,
int n_loop = 100) {
auto models = load_models(model_file, device_ids);
std::cout << "Model load done. Start warming up the models...";
// Warm up the models
for (auto &model: models) {
torch::jit::optimize_for_inference(model.model);
for (int i = 0; i < 4; ++i) {
auto t = torch::rand({1, 3, input_size, input_size}).to(cuda_device(model.device_id));
model.model({t});
}
}
std::cout << "Model warming up done.\n";
// Infer loop
using namespace std::chrono;
steady_clock::duration total_time{};
for (int i = 0; i < n_loop; ++i) {
auto loop_i_time = decltype(total_time){};
std::vector<std::future<void>> infer_futures;
// Model run in parallel or in sequence
for (auto &model: models) {
infer_futures.push_back(std::async(
model_parallel ? std::launch::async : std::launch::deferred,
[&model, batch_size, input_size] {
auto no_grad = torch::NoGradGuard{};
auto batch = torch::rand(
{batch_size, 3, input_size, input_size}).to(
cuda_device(model.device_id));
// Modify the input and output according to the test model
auto output = model.model({batch}).toTuple();
auto t0 = output->elements()[0].toTensor().to(torch::kCPU);
auto t1 = output->elements()[1].toTensor().to(torch::kCPU);
}));
}
auto start = steady_clock::now();
for (auto &f: infer_futures) {
try { f.get(); }
catch (const std::exception &e) {
std::cout << e.what() << std::endl;
exit(1);
}
}
loop_i_time = steady_clock::now() - start;
total_time += loop_i_time;
if (i % 10 == 0)
std::cout << std::format("Loop {}, time {} ms\n", i, duration_cast<milliseconds>(loop_i_time).count());
}
auto fps = n_loop * batch_size * device_ids.size() / duration_cast<seconds>(total_time).count();
std::cout << std::format("Infer on {} GPU(s). Is model parallel: {}. Speed: {} FPS\n",
device_ids.size(), model_parallel, fps);
}
int main(int argc, char **argv) {
std::string model_file{"model.jit.pt"};
std::vector<int> ids{};
for (int i = 1; i < argc; ++i) {
ids.push_back(std::stoi(argv[i]));
}
if (ids.size() == 1) {
int id = std::stoi(argv[1]);
std::cout << "Use GPU " << id << std::endl;
model_profile(model_file, {id});
} else if (ids.size() >= 2) {
// Check single GPU performance first
model_profile(model_file, {ids.at(0)});
// Two GPUs in sequence, result fps is expected to be almost the same as single gpu fps
model_profile(model_file, ids, false);
// Two GPUs in parallel, result fps is expected to be about double of single gpu fps
model_profile(model_file, ids, true);
} else {
std::cout << "Usage: test_multi_gpu gpu_id [gpu_id, ...]" << std::endl;
}
}
The following are appends:
model.eval() and torch::NoGradGuard added, I didn't notice any perfomance change. 80FPS is the best performance I can get on 4 GPUs, although a single gpu performance is 40FPS.
I print the start time and end time in each infer, it seems like the models are running sequentially, it looks like there is a "GIL".
Loop 90, time 807 ms
thread-16412 , start infer 7934243ms, end infer 7934644ms
thread-16100 , start infer 7934243ms, end infer 7935047ms
thread-16412 , start infer 7935051ms, end infer 7935454ms
thread-16100 , start infer 7935051ms, end infer 7935851ms
thread-16100 , start infer 7935855ms, end infer 7936264ms
thread-16412 , start infer 7935855ms, end infer 7936666ms
thread-16100 , start infer 7936670ms, end infer 7937073ms
thread-16412 , start infer 7936670ms, end infer 7937474ms
thread-16412 , start infer 7937477ms, end infer 7937886ms
thread-16100 , start infer 7937477ms, end infer 7938291ms
thread-16412 , start infer 7938295ms, end infer 7938702ms
thread-16100 , start infer 7938295ms, end infer 7939104ms
thread-16412 , start infer 7939108ms, end infer 7939512ms
thread-16100 , start infer 7939108ms, end infer 7939913ms
thread-16412 , start infer 7939917ms, end infer 7940327ms
thread-16100 , start infer 7939917ms, end infer 7940724ms
thread-16100 , start infer 7940728ms, end infer 7941131ms
thread-16412 , start infer 7940728ms, end infer 7941535ms
Infer on 2 GPU(s). Is model parallel: true. Speed: 40 FPS
There's always a ~400ms delay between two threads' end infer.
Well, I found that .to(torch::kCPU) is very slow(although cpu->gpu copy is fast)! If the tensor is not copied to cpu, then the FPS is what I was expecting. Have no idea how to fix it.
Add some comparing data
| Process Number | Thread Number | GPU Card Number | Copy Result to CPU? | FPS |
|---|---|---|---|---|
| 1 | 1 | 1 | yes | 38 |
| 1 | 1 | 1 | no | 44 |
| 1 | 1 | 1 | yes | 38 |
| 1 | 2 | 2 | yes | 63 |
| 1 | 2 | 2 | no | 86 |
| 2 | 1 | 2 | yes | 77 |
| 1 | 3 | 3 | yes | 82 |
| 1 | 3 | 3 | no | 123 |
| 3 | 1 | 3 | yes | 116 |
| 1 | 4 | 4 | yes | 107 |
| 1 | 4 | 4 | no | 160 |
| 4 | 1 | 4 | yes | 157 |
It seems multi-process should be the solution.
model.eval()andtorch::NoGradGuardmissed... I'll check the performace againstd::asyncuses multiple threads? How long do you run your test for? You're not starting measuring time until after the tests have already started so your results aren't completely accurate. I'd measure the time inside yourstd::asynctask rather than outside so that you're only measuring actual execution time