I have the following function that calculates the maximum value of 2D/3D arrays in a nested for loop. I used reduction clause to gain some additional speedup however I am not getting a good speedup and I am wondering how to fix this?
Example Function
double maxFunc(double arr2D[]){
double max_val = 0.;
#pragma omp parallel for reduction(max:max_val )
for (int i = 0; i < nx; i++){
for (int j = 0; j < nyk; j++){
if (arr2D[j + nyk*i] > maxVal){
max_val = arr2D[j + nyk*i];
}
}
}
return max_val ;
}
Main Code:
static const int nx = 1024;
static const int ny = 1024;
static const int nyk = ny/2 + 1;
double *Array;
Array = (double*) fftw_malloc(nx*ny*sizeof(double));
for (int i = 0; i < nx; i++){
for (int j = 0; j < ny; j++){
Array[j + ny*i] = //Initialize array to some values;
}
}
//test maxFunc with different number of threads
for (int nThreads =1; nThreads <= 16; nThreads++){
double start_time, run_time;
start_time = omp_get_wtime();
omp_set_num_threads(nThreads);
double max_val= 0.;
#pragma omp parallel for reduction(max:max_val)
for (int i = 0; i < nx; i++){
for (int j = 0; j < nyk; j++){
if (Array[j + nyk*i] > max_val){
max_val= Array[j + nyk*i];
}
}
}
run_time = omp_get_wtime() - start_time;
cout << "Threads: " << nThreads << "Parallel Time in s: " << run_time << "s\n";
}
The output I get looks like:
Threads: 1Parallel Time in s: 0.0003244s
Threads: 2Parallel Time in s: 0.0003887s
Threads: 3Parallel Time in s: 0.0002579s
Threads: 4Parallel Time in s: 0.0001945s
Threads: 5Parallel Time in s: 0.000179s
Threads: 6Parallel Time in s: 0.0001456s
Threads: 7Parallel Time in s: 0.0002081s
Threads: 8Parallel Time in s: 0.000135s
Threads: 9Parallel Time in s: 0.0001262s
Threads: 10Parallel Time in s: 0.0001161s
Threads: 11Parallel Time in s: 0.0001499s
Threads: 12Parallel Time in s: 0.0002939s
Threads: 13Parallel Time in s: 0.0002982s
Threads: 14Parallel Time in s: 0.0002399s
Threads: 15Parallel Time in s: 0.0002283s
Threads: 16Parallel Time in s: 0.0002268s
My PC has 6 cores with 12 logical processors so I sort of expect 6 times speed in best case scenario. Thanks!
omp_set_num_threadsbefore the parallel loop for the OpenMP runtime to create new threads and delete the previous ones. Since threads are (pretty expensive) kernel resources, you need systems calls to do that (typically done sequentially here). More specifically, at least 2 syscalls/thread and probably more in practice (e.g. sync + config). Each syscall usually takes at least several nanoseconds on a mainstream Linux PC (often more on Windows). This means certainly >100 us to create 16 threads. With a sequential work lasting 324 us, You cannot expect more than a x3 speed up with 16 threads.