0

I am trying to parallelize the following function (an iterative solver) that has a while loop and a nested for loop inside. The code looks like:

static const int nx = 128;
static const int ny = 128;
static const int nz = 128;
//Initialize
Eigen::Tensor<std::complex<double>, 3> xTerm(nx,ny,nz);  
Eigen::Tensor<std::complex<double>, 3> yTerm(nx,ny,nz);  
Eigen::Tensor<std::complex<double>, 3> zTerm(nx,ny,nz);  

Eigen::Tensor<std::complex<double>, 3> RHS(nx,ny,nz); 
 
double UpdatedSol_max, OldSol_max;
double it_error;

    // Begin counter for the number of iterations it takes to calculate phi
    int count = 0;
    // Begin while loop
    do{         
        // Calculate some thing
        
        
        #pragma omp parallel for collapse(3)
        for(int i = 0; i< nx; i++){
            for(int j = 0; j< ny; j++){
                for(int k = 0; k< nz; k++){ 
                    RHS(k,i,j) = Source(k,i,j) - xTerm(k,i,j) - yTerm(k,i,j) - zTerm(k,i,j);
                }
            }
        }

        
        // Some calculations
        // Increase iteration count by 1
        count = count + 1; 
        
        // Calculate error
        it_error = fabs((UpdatedSol_max-OldSol_max)/UpdatedSol_max);    
        
     }while( it_error > err_max && count  <= max_iter && UpdatedSol_max > err_max); 

     return count;

So, a simple #pragma omp parallel for collapse(3) actually makes the parallel code slower so I tried using reduction since there's some addition in that nested for loop and this is my attempt:

#pragma omp parallel for reduction(+: Terms)
        for(int i = 0; i< nx; i++){
            for(int j = 0; j< ny; j++){
                for(int k = 0; k< nz; k++){ 
                    Terms(k,i,j) += xTerm(k,i,j) + yTerm(k,i,j) + zTerm(k,i,j);
                    RHS(k,i,j) = Source(k,i,j) - Terms(k,i,j);

                }
            }
        }

This however returns the error:

 error: user defined reduction not found for ‘Terms’

which I tried fixing by adding how many elements are to be reduced: #pragma omp parallel for reduction(+: Terms[:nx]) but I am still getting the error:

error: ‘Terms’ does not have pointer or array type

I am sort of confused, can I not use an Eigen array structure here? how can I rewrite this better? Thanks

Edit: I am including a small reproducible example for people to compile and test themselves. This is an iterative solver so somethings are initialized to just zeros to make it easier to compile:

static const int nx = 128;
static const int ny = 128;
static const int nz = 128;

using namespace Eigen;
int main(){
double err_max = 1.e-8;
int phi_iter_max = 500;
int Soliter = 0;

Eigen::initParallel();
Eigen::setNbThreads(omp_get_max_threads());

    Eigen::Tensor<std::complex<double>, 3> invnk(nx,ny,nz);  
    invnk.setZero();
    Eigen::Tensor<std::complex<double>, 3> dndxk(nx,ny,nz);  
    dndxk.setZero();
    Eigen::Tensor<std::complex<double>, 3> dndyk(nx,ny,nz);  
    dndyk.setZero();
    Eigen::Tensor<std::complex<double>, 3> dndzk(nx,ny,nz);  
    dndzk.setZero();
    Eigen::Tensor<std::complex<double>, 3> Sol(nx,ny,nz);  
    dphikdt.setConstant(1.0f); 
    Eigen::Tensor<std::complex<double>, 3> Source(nx,ny,nz);  
    Source.setZero();
    Eigen::Tensor<double, 3> KX(nx,ny,nz); 
    KX.setZero();
    Eigen::Tensor<double, 3> KY(nx,ny,nz); 
    KY.setZero();
    Eigen::Tensor<double, 3> KZ(nx,ny,nz);
    KZ.setZero();
    Eigen::Tensor<double, 3> factor(nx,ny,nz);
    factor.setZero();
    
    double start_time1, run_time1;
    start_time1 = omp_get_wtime();

    //clock_t start_time1 = clock();
   
    Soliter = Solver3D(invnk, dndxk, dndyk, dndzk, Sol, Source, KX, KY, KZ, factor, err_max,  phi_iter_max);
    //clock_t end1 = clock();

   run_time1 = omp_get_wtime() - start_time1;

    cout << "Parallel Time in SEC: " <<  run_time1 << "s\n";
    //cout << "Serial Time in SEC: " <<  (double)(end1-start_time1) / CLOCKS_PER_SEC << "s\n";
   


}

The iterative solver function looks like:

int Solver3D(Eigen::Tensor<std::complex<double>, 3>& invnk, Eigen::Tensor<std::complex<double>, 3>& dndxk, Eigen::Tensor<std::complex<double>, 3>& dndyk, Eigen::Tensor<std::complex<double>, 3>& dndzk,Eigen::Tensor<std::complex<double>, 3>& phik, Eigen::Tensor<std::complex<double>, 3>& Source,Eigen::Tensor<double, 3>& kx, Eigen::Tensor<double, 3>& ky, Eigen::Tensor<double, 3>& kz, Eigen::Tensor<double, 3>&  factor, double err_max, int max_iter){ 

    
    Eigen::Tensor<std::complex<double>, 3> xTerm(nx,ny,nz);  
    xTerm.setZero(); 
    Eigen::Tensor<std::complex<double>, 3> yTerm(nx,ny,nz);  
    yTerm.setZero(); 
    Eigen::Tensor<std::complex<double>, 3> zTerm(nx,ny,nz);  
    zTerm.setZero();
    
    Eigen::Tensor<std::complex<double>, 3> RHS(nx,ny,nz);  
    RHS.setZero();
    
    
    double UpdatedSol_max , OldSol_max;
    double it_error;

    // Begin counter for the number of iterations it takes to calculate phi
    int count = 0;
    
    // Begin while loop

    do{         
        
        
        #pragma omp parallel for collapse(3)
        for(int i = 0; i< nx; i++){
            for(int j = 0; j< ny; j++){
                for(int k = 0; k< nz; k++){ 
                    RHS(k,i,j) = Source(k,i,j) - xTerm(k,i,j) - yTerm(k,i,j) - zTerm(k,i,j);
                }
            }
        }
        
        // Calculate maximum of absolute value of previous solution
        Eigen::Tensor<double, 0> AbsMaxAsTensor = Sol.abs().maximum();
        OldSol_max = AbsMaxAsTensor(0);
        Sol = RHS * factor;

        // Calculate maximum of absolute value of updated solution
        Eigen::Tensor<double, 0> AbsMaxAsTensor1 = Sol.abs().maximum();
        UpdatedSol_max = AbsMaxAsTensor1(0); 
        // Increase iteration count by 1
        count = count + 1;
        
        // Calculate error
        it_error = fabs((UpdatedSol_max-OldSol_max)/UpdatedSol_max);
        
     }while( it_error > err_max && count  <= max_iter && UpdatedSol_max > err_max); 
     return count;

}

I run the following command to obtain the ''parallel'' time:

g++ ParallelTest.cpp -lfftw3 -lfftw3_threads -fopenmp -lm -O3 -Ofast -o /test.out

And the following for the ''serial'' time using ''clock_t for wall time:

g++ ParallelTest.cpp -lfftw3 -lfftw3_threads -lm -O3 -Ofast -o /test.out

Eidt 1: Since it seems difficult to reproduce my example, I am including some of the runtimes for different number of threads: in main code

for (int nThreads =1; nThreads <= 16; nThreads++){
        
        double start_time1, run_time1;
        start_time1 = omp_get_wtime();
        
        omp_set_num_threads(nThreads); 
         Soliter = Solver3D(invnk, dndxk, dndyk, dndzk, Sol, Source, KX, KY, KZ, factor, err_max,  phi_iter_max);
      
        run_time1 = omp_get_wtime() - start_time1;
       
        cout << "Threads: " << nThreads << "   Parallel Time in SEC: " <<  run_time1 << "s\n";

This returns:

Threads: 1   Parallel Time in SEC: 0.444323s
Threads: 2   Parallel Time in SEC: 0.292054s
Threads: 3   Parallel Time in SEC: 0.261521s
Threads: 4   Parallel Time in SEC: 0.260612s
Threads: 5   Parallel Time in SEC: 0.25353s
Threads: 6   Parallel Time in SEC: 0.258204s
Threads: 7   Parallel Time in SEC: 0.275174s
Threads: 8   Parallel Time in SEC: 0.280042s
Threads: 9   Parallel Time in SEC: 0.274214s
Threads: 10   Parallel Time in SEC: 0.271887s
Threads: 11   Parallel Time in SEC: 0.274815s
Threads: 12   Parallel Time in SEC: 0.273562s
Threads: 13   Parallel Time in SEC: 0.283543s
Threads: 14   Parallel Time in SEC: 0.302721s
Threads: 15   Parallel Time in SEC: 0.317697s
Threads: 16   Parallel Time in SEC: 0.290503s

I have 12 processors and 6 cores so really using up to 16 threads is just a test.

11
  • 1
    1) how do you time the execution?; 2) please post the execution time with 1, 2, 4, 8.. threads used; 3) 128^3 is not that much and the OpenMP overheads may hide the benefits of a parallel execution; 4) In addition, this kind of code is mostly memory-bound on usual hardware, and no siggnificant speed-up can be obtained. 5) try swapping the loops on i and j Commented Apr 18, 2023 at 17:45
  • 1
    Please post all parts relevant to reproduce the problem, including the compilation flags (i.e., post a minimal reproducible example)! Commented Apr 18, 2023 at 20:45
  • 1
    1) I can't compile your code; 2) please post the execution times!; 3) 128^3 is not that much and the OpenMP overheads may hide the benefits of a parallel execution; 4) In addition, this kind of code is mostly memory-bound on usual hardware, and no siggnificant speed-up can be obtained. 5) try swapping the loops on i and j Commented Apr 19, 2023 at 6:43
  • 1
    6) Are you sure that this loop represent a significant part of the runtime ? Have you timed it specifically? If it does not, you can not expect any significant speed-up. Commented Apr 19, 2023 at 6:45
  • 1
    So your timing shows that the parallel version is not "slower", contrary to your initial claim... OK, it is not as fast as you probably expect, but there are plenty of good reasons for that, that I have already detailed (128^3 is not that large, the parallel code is memory-bound, and these loops are just a part of the whole Solver code). Commented Apr 21, 2023 at 8:04

1 Answer 1

-1

First of all: if the collapse(3) makes the loop slower, try another compiler. Some compilers are not great with index calculations in collapsed loops. You can also try collapse(2) or even leaving it out if the outer loop has enough iterations. Speaking of which: how many i,j,k iterations are there combined? That number probably has to be at least some thousands.

More about those loops. You are indexing by (k,i,j) but your loops are arranged differently. Since (as @PierU points out) Eigen arrays are column-major, you should order your loops with j outer, i middle, and k inner.

About your attempt to formulate this as a reduction: you don't have a reduction. For a reduction some quantity (usually a scalar) needs to appear both left and right hand side.

About your compile error: you can only reduce on scalars or C-style arrays. That explains your error message about no "used defined reduction". To reduce on any C++ container you could extract its .data() member. I'm not sure if Eigen supports that. Alternatively, Eigen needs to support the operator+ on whatever object you have here. But as noted you don't have a reduction. I just thought I'd explain what that error message is about.

Sign up to request clarification or add additional context in comments.

4 Comments

I am using g++ here to compile, I guess I don't see how the choice of my compiler slows things down (I am trying to learn). I just edited my code to include nx,ny,nz to show how many iterations I have (pretty large). Eigen does support .data() but not sure about the operator+ but it seems like I don't have a reduction here and I am misunderstanding how to do this.
In collapsed loops you need to converted by a single address space and a 3dimensional one. Some compilers dont' do this very cleverly.
Here's another suggestion: you index by k,i,j but your loops are i,j,k. If Eigen's indexing is like ordinary C/C++ array indexing, then you should order your loops j,i,k from outer to inner.
@VictorEijkhout I think that by default Eigen arrays are column major, while C arrays are row major

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.