0

Situation:

I am trying to code a solver for a Partial Differential Equation. Specifically its parallel implementation (via domain-decomposition). The code produces the correct result. However, I have some troubles with OpenMp, since I have to execute multiple for-loops inside another while-loop. Before this, I did not have any experience with OpenMP so I am a bit uncertain. I am testing the code on an example where there should be as little load imbalance as possible. During testing with 4 threads I observed almost identical usage of all CPU cores.

I have an intel i5-6500 4x3.2 CPU and am compiling on Windows using Clion with cygwin 2.11.1 (gcc), cmake 3.19 and C++20 and the compiler flag -fopenmp

The main part of the program looks like this:

FunctionData data_array[N];
bool flag=true;

for (int i = 0; i < N; ++i){
    function1(data_array[i]);
}

while(flag){

    if(cancellation_criterium(data_array)){
        flag= false;    
    }

    for (int i = 0; i < N; ++i){
        function2(data_array[i]);
    }

    for (int i = 0; i < N; ++i){    
        function3(data_array[i]);
    }

}

The data for each subdomain is in the respective entry of data_array. There should be no data races, each function is a void-function which modifies data_array[i], which is passed by reference. I checked all of the results, they were all correct.

In cancellation_criterium i want to find the minimum over some member-variables of FunctionData, so i need access to all of the array. The while-loop gets executed a lot and I am testing an example where there should be very little problems with load-balancing. Each function can be executed in parallel but they have to happen after each other (first all iterations of function2, then function 3). I made sure that there are no data races, each funtion only needs

Since I am new to OpenMP my first idea was to just add #pragma omp parallel for before each for-loop. This however drastically (2-3x) increased the runtime of the programm. After talking to some friends and doing some research online I concluded, that constantly creating and joining threads in the while-loop could be responsible for this.

My second approach was to create a surrounding #omp parallel region. The resulting code was:

FunctionData data_array[N];
bool flag=true;
#pragma omp parallel shared(data_array, flag)
{
#pragma omp for
    for (int i = 0; i < N; ++i){
        function1(data_array[i]);
    }
#pragma omp barrier    
    while(flag){
    #pragma omp master
        {
            if(cancellation_criterium(data_array)){
                flag= false;    
            }
        }
    #pragma omp barrier
    #pragma omp for
        for (int i = 0; i < N; ++i){
            function2(data_array[i]);
        }
    #pragma omp barrier
    #pragma omp for
        for (int i = 0; i < N; ++i){    
            function3(data_array[i]);
        }
    #pragma omp barrier
    }

}

This also suffered similar performance loss as the first try. I also tried to just have the parallel region around the for-loops with function 2 and 3 but to no avail.

In my last attempt I used the parallel region, but used #pragma omp for-loops inside of it. Even though I thought that this should not work, since i think that i create nested parallel regions, I consistently observed a slight performance increase.

Here is the code:

FunctionData data_array[N];
bool flag=true;
#pragma omp parallel shared(data_array, flag)
{
    #pragma omp master
    #pragma omp parallel for
    for (int i = 0; i < N; ++i){
        function1(data_array[i]);
    }
    #pragma omp master    
    while(flag){
    
        if(cancellation_criterium(data_array)){
            flag= false;    
        }
        #pragma omp parallel for
        for (int i = 0; i < N; ++i){
            function2(data_array[i]);
        }
        #pragma omp parallel for
        for (int i = 0; i < N; ++i){    
            function3(data_array[i]);
        }
    
    }

}

Questions

1. Is there a reason why attempt 3 performs so much better then attempt 1 or do you have any Ideas why it should be? I for one cannot understand how the last attempt does create and join less threads then the first version and even if, how it can be faster.

2. In the second attempt, when the code reaches while(flag) does each thread in the parallel region execute this while-loop created for each thread in the parallel region? I suspected that this could happen, nevertheless the result of the programm is correct.

3. Is there a smarter way to paralellize this problem with OpenMP?

The length of the original code is quite substantial, but if you want to have a look I can gladly give you the link to the github page.

---edit-- added #pragma omp parallel for in the first attempt, N is 2 or 4.

10
  • How big is N and how much time function1..3 take? You wrote that #pragma omp parallel before each for-loop, is it a typo that for was missing? If not, as a first try use #pragma omp parallel for before each for-loop. Commented Aug 16, 2021 at 18:18
  • You said you think the load should be balanced, but it never hurts to test different scheduling. Try it with schedule(dynamic) and/or schedule(guided) and see if there's any improvement. Commented Aug 16, 2021 at 18:31
  • @Laci Yes, it was a typo, thanks. N is 2 or 4, so pretty small Commented Aug 16, 2021 at 18:36
  • Well, there's your issue then. How would you expect to distribute iterations between threads if you only have 2-4 loop iterations? Commented Aug 16, 2021 at 18:39
  • I did some measuring and realized that the loop with function2 takes 3-4 times longer with #pragma omp parallel, while the runtime other loops decreases. This also takes up >95% of the total running time. Commented Aug 16, 2021 at 21:59

2 Answers 2

1

I just answer question 3.: I think the best way to parallelize the code is using tasks:

#pragma omp parallel 
#pragma omp single nowait
{
    bool flag=true;

    for (int i = 0; i < N; ++i){
        #pragma omp task
        function1(data_array[i]);
    }

    while(flag){
        #pragma omp taskwait
        if(cancellation_criterium(data_array)){
            flag= false;    
        }

        for (int i = 0; i < N; ++i){
            #pragma omp task
            {
              function2(data_array[i]);
              function3(data_array[i]);
            }
        }

    }
}
Sign up to request clarification or add additional context in comments.

2 Comments

@weiskohlmoe Have you tried task based parallelization?
thanks for your suggestion, as I pointed out in my answer this was not the real problem. However I now have a version with task, which (apart from function 2) worked really well :)
0

I found my problem, unfortunately it had nothing to do with the original question. A part of FunctionData was a std::vector in which function2 regularly used push back on. As I found out allocating memory in a parallel region is very expensive. See here:

C++ dynamic memory allocation is slower in OpenMP, even for non-parallel sections of code

2 Comments

Probably you should consider using a different algorithm (or a numerical library).
I just used vector::reserve on everything and it is already ~1,7 times (with N=2 so its pretty ok) faster than the sequential version. Thank you for the suggestions!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.