1

I have the following function that calculates the maximum value of 2D/3D arrays in a nested for loop. I used reduction clause to gain some additional speedup however I am not getting a good speedup and I am wondering how to fix this?

Example Function

double maxFunc(double arr2D[]){ 

    double max_val = 0.;
    #pragma omp parallel for reduction(max:max_val ) 
    for (int i = 0; i < nx; i++){
        for (int j = 0; j < nyk; j++){
            if (arr2D[j + nyk*i] > maxVal){
                max_val = arr2D[j + nyk*i];
            }
        }
    }   
    return max_val ;
     
}

Main Code:

static const int nx = 1024;
static const int ny = 1024;
static const int nyk = ny/2 + 1;

double *Array;
Array = (double*) fftw_malloc(nx*ny*sizeof(double)); 

    for (int i = 0; i < nx; i++){
        for (int j = 0; j < ny; j++){
          Array[j + ny*i]  = //Initialize array to some values; 
            
        }
    }

//test maxFunc with different number of threads
for (int nThreads =1; nThreads <= 16; nThreads++){
        double start_time, run_time;
        start_time = omp_get_wtime();
       
        omp_set_num_threads(nThreads);
        double max_val= 0.;
        #pragma omp parallel for reduction(max:max_val) 
        for (int i = 0; i < nx; i++){
            for (int j = 0; j < nyk; j++){
                if (Array[j + nyk*i] > max_val){ 

                    max_val= Array[j + nyk*i];
                }
            }
        }   
        run_time = omp_get_wtime() - start_time;
        cout << "Threads: " << nThreads <<  "Parallel Time in s: " <<  run_time << "s\n";
        
    }

The output I get looks like:

Threads: 1Parallel Time in s: 0.0003244s
Threads: 2Parallel Time in s: 0.0003887s
Threads: 3Parallel Time in s: 0.0002579s
Threads: 4Parallel Time in s: 0.0001945s
Threads: 5Parallel Time in s: 0.000179s
Threads: 6Parallel Time in s: 0.0001456s
Threads: 7Parallel Time in s: 0.0002081s
Threads: 8Parallel Time in s: 0.000135s
Threads: 9Parallel Time in s: 0.0001262s
Threads: 10Parallel Time in s: 0.0001161s
Threads: 11Parallel Time in s: 0.0001499s
Threads: 12Parallel Time in s: 0.0002939s
Threads: 13Parallel Time in s: 0.0002982s
Threads: 14Parallel Time in s: 0.0002399s
Threads: 15Parallel Time in s: 0.0002283s
Threads: 16Parallel Time in s: 0.0002268s

My PC has 6 cores with 12 logical processors so I sort of expect 6 times speed in best case scenario. Thanks!

7
  • 2
    Those times are too short: you are probably completely drowning in OMP overhead. You should at least have a dummy paralllel loop to give OMP the chance to create the threads, before you time your actual loop. But really: 1000 times as much data. I don't trust times this short. Commented Sep 17, 2023 at 20:27
  • 2
    omp_set_num_threads before the parallel loop for the OpenMP runtime to create new threads and delete the previous ones. Since threads are (pretty expensive) kernel resources, you need systems calls to do that (typically done sequentially here). More specifically, at least 2 syscalls/thread and probably more in practice (e.g. sync + config). Each syscall usually takes at least several nanoseconds on a mainstream Linux PC (often more on Windows). This means certainly >100 us to create 16 threads. With a sequential work lasting 324 us, You cannot expect more than a x3 speed up with 16 threads. Commented Sep 17, 2023 at 21:38
  • 2
    Note that the CPU architecture matters. AMD processors are NUMA one so you should care about NUMA effect or use only 1 NUMA node (i.e. CCX/CCD?). Besides, if your code is memory-bound, more threads will not make that (significantly) faster. You need to check the speed of your RAM/Caches so to check what is the limiting factor (I guess this is the RAM here). Your sequential code operates at 12 GiB/s. The parallel one reach up to 33.6 GiB/s. This is good for a mainstream PC, especially if it is not a very good/recent one. Commented Sep 17, 2023 at 21:41
  • 2
    You have been given several correct explanations why your code doesn't scale beyond x3, and one of them is that the code quickly gets memory-bound when adding some threads. That is the computations are too simple and the bottleneck is no longer the CPU, but rather the transfer rate between the and the RAM: using more core/threads cannot help in this case. Commented Sep 18, 2023 at 8:15
  • 1
    "is this a situation where it depends on the ''calculations'' done inside the for loop?" Certainly, yes. This is very common actually since the RAM is often the limiting factor in optimized codes. Note the x3 is approximative since I do not know precisely your hardware/OS. I do not believe false sharing is an issue here. Do you have any clues/facts showing that? Unless you do, I advise you to focus on the memory issue (and check it, for example using advanced profiling). Commented Sep 18, 2023 at 15:01

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.