Situation:
I am trying to code a solver for a Partial Differential Equation. Specifically its parallel implementation (via domain-decomposition). The code produces the correct result. However, I have some troubles with OpenMp, since I have to execute multiple for-loops inside another while-loop. Before this, I did not have any experience with OpenMP so I am a bit uncertain. I am testing the code on an example where there should be as little load imbalance as possible. During testing with 4 threads I observed almost identical usage of all CPU cores.
I have an intel i5-6500 4x3.2 CPU and am compiling on Windows using Clion with cygwin 2.11.1 (gcc), cmake 3.19 and C++20 and the compiler flag -fopenmp
The main part of the program looks like this:
FunctionData data_array[N];
bool flag=true;
for (int i = 0; i < N; ++i){
function1(data_array[i]);
}
while(flag){
if(cancellation_criterium(data_array)){
flag= false;
}
for (int i = 0; i < N; ++i){
function2(data_array[i]);
}
for (int i = 0; i < N; ++i){
function3(data_array[i]);
}
}
The data for each subdomain is in the respective entry of data_array. There should be no data races, each function is a void-function which modifies data_array[i], which is passed by reference. I checked all of the results, they were all correct.
In cancellation_criterium i want to find the minimum over some member-variables of FunctionData, so i need access to all of the array.
The while-loop gets executed a lot and I am testing an example where there should be very little problems with load-balancing.
Each function can be executed in parallel but they have to happen after each other (first all iterations of function2, then function 3). I made sure that there are no data races, each funtion only needs
Since I am new to OpenMP my first idea was to just add #pragma omp parallel for before each for-loop. This however drastically (2-3x) increased the runtime of the programm. After talking to some friends and doing some research online I concluded, that constantly creating and joining threads in the while-loop could be responsible for this.
My second approach was to create a surrounding #omp parallel region. The resulting code was:
FunctionData data_array[N];
bool flag=true;
#pragma omp parallel shared(data_array, flag)
{
#pragma omp for
for (int i = 0; i < N; ++i){
function1(data_array[i]);
}
#pragma omp barrier
while(flag){
#pragma omp master
{
if(cancellation_criterium(data_array)){
flag= false;
}
}
#pragma omp barrier
#pragma omp for
for (int i = 0; i < N; ++i){
function2(data_array[i]);
}
#pragma omp barrier
#pragma omp for
for (int i = 0; i < N; ++i){
function3(data_array[i]);
}
#pragma omp barrier
}
}
This also suffered similar performance loss as the first try. I also tried to just have the parallel region around the for-loops with function 2 and 3 but to no avail.
In my last attempt I used the parallel region, but used #pragma omp for-loops inside of it. Even though I thought that this should not work, since i think that i create nested parallel regions, I consistently observed a slight performance increase.
Here is the code:
FunctionData data_array[N];
bool flag=true;
#pragma omp parallel shared(data_array, flag)
{
#pragma omp master
#pragma omp parallel for
for (int i = 0; i < N; ++i){
function1(data_array[i]);
}
#pragma omp master
while(flag){
if(cancellation_criterium(data_array)){
flag= false;
}
#pragma omp parallel for
for (int i = 0; i < N; ++i){
function2(data_array[i]);
}
#pragma omp parallel for
for (int i = 0; i < N; ++i){
function3(data_array[i]);
}
}
}
Questions
1. Is there a reason why attempt 3 performs so much better then attempt 1 or do you have any Ideas why it should be? I for one cannot understand how the last attempt does create and join less threads then the first version and even if, how it can be faster.
2. In the second attempt, when the code reaches while(flag) does each thread in the parallel region execute this while-loop created for each thread in the parallel region? I suspected that this could happen, nevertheless the result of the programm is correct.
3. Is there a smarter way to paralellize this problem with OpenMP?
The length of the original code is quite substantial, but if you want to have a look I can gladly give you the link to the github page.
---edit-- added #pragma omp parallel for in the first attempt, N is 2 or 4.
Nand how much timefunction1..3take? You wrote that#pragma omp parallelbefore each for-loop, is it a typo thatforwas missing? If not, as a first try use#pragma omp parallel forbefore each for-loop.schedule(dynamic)and/orschedule(guided)and see if there's any improvement.