0

I am trying to benchmark the blas routines dgemv and dgemm in Fortran. For that I have written this simple codes:

matmul.f90:

program test                                                                                                                                                                                  
implicit none                                                                   
integer :: i, s                                                                 
real(8), allocatable :: a(:,:), b(:,:), c(:,:), x(:), y(:)                      
real(8) :: t1, t2                                                               
                                                                                
print *, "Matrix-Matrix multiplication"                                         
do i=1,9                                                                        
    s = 2**i                                                                    
    if (allocated(a)) deallocate(a)                                             
    if (allocated(b)) deallocate(b)                                             
    if (allocated(c)) deallocate(c)                                             
                                                                                
    allocate(a(s,s), b(s,s), c(s,s))                                            
                                                                                
    call random_number(a)                                                       
    call random_number(b)                                                       
                                                                                
    call cpu_time(t1)                                                           
    c = matmul(a, b)                                                            
    call cpu_time(t2)                                                           
    write(*,*) s, t2-t1                                                         
end do                                                                          
                                                                                
print *, "Matrix-vector multiplication"                                         
do i=1,9                                                                        
    s = 2**i                                                                    
    if (allocated(a)) deallocate(a)                                             
    if (allocated(x)) deallocate(x)                                             
    if (allocated(y)) deallocate(y)                                             
                                                                                
    allocate(a(s,s), x(s), y(s))                                                
                                                                                
    call random_number(a)                                                       
    call random_number(x)                                                       
                                                                                
    call cpu_time(t1)                                                           
    y = matmul(a, x)                                                            
    call cpu_time(t2)                                                           
    write(*,*) s, t2-t1                                                         
end do                                                                                                                                                                                    
end program

blas.f90:

program test                                                                                                                                                                                  
implicit none                                                                   
integer :: i, s                                                                 
real(8), allocatable :: a(:,:), b(:,:), c(:,:), x(:), y(:)                      
real(8) :: t1, t2                                                               
                                                                                
print *, "Matrix-Matrix multiplication"                                         
do i=1,9                                                                        
    s = 2**i                                                                    
    if (allocated(a)) deallocate(a)                                             
    if (allocated(b)) deallocate(b)                                             
    if (allocated(c)) deallocate(c)                                             
                                                                                
    allocate(a(s,s), b(s,s), c(s,s))                                            
                                                                                
    call random_number(a)                                                       
    call random_number(b)                                                       
                                                                                
    call cpu_time(t1)                                                           
    c = matmul(a, b)                                                            
    call cpu_time(t2)                                                           
    print *, s, t2-t1                                                           
end do                                                                          
                                                                                
print *, "Matrix-vector multiplication"                                         
do i=1,9                                                                        
    s = 2**i                                                                    
    if (allocated(a)) deallocate(a)                                             
    if (allocated(x)) deallocate(x)                                             
    if (allocated(y)) deallocate(y)                                             
                                                                                
    allocate(a(s,s), x(s), y(s))                                                
                                                                                
    call random_number(a)                                                       
    call random_number(x)                                                       
                                                                                
    call cpu_time(t1)                                                           
    call dgemv('n', s, s, 1.d0, a, s, x, 1, 0.d0, y, 1)                         
    call cpu_time(t2)                                                           
    print *, s, t2-t1                                                           
end do                                                                                                                                                                                    
end program

Note that in the blas program i am still using matmul and i have only changed the matrix-vector multiplication

I compile both codes with:

gfortran -o matmul matmul.f90

gfortran -o blas blas.f90 -lblas

I would expect the times of the matrix-matrix multiplication to be the same, but the results I get are (i write them side-by-side):

Matrix size     matmul.f90                    blas.f90
       2   4.0000000000000105E-005     1.7969999999999930E-003
       4   1.0999999999999725E-005     1.2780000000000014E-003
       8   2.3000000000000017E-005     8.0599999999997340E-004
      16   4.8000000000000299E-005     2.4819999999999842E-003
      32   1.1799999999999962E-004     5.3250000000000242E-003
      64   3.4200000000000029E-004     1.7175999999999969E-002
     128   1.2079999999999999E-003     5.3132000000000013E-002
     256   4.7889999999999999E-003    0.20821500000000004     
     512   1.2796999999999999E-002     1.1222690000000000  

Apparently the behavior of the matmul subroutine has changed in the second program and i am not really sure why. Could someone clarify this behavior to me?

Thanks!

4
  • 1
    Can you check the residuals c - matmul(a,b) to see which one given the incorrect result. Commented Aug 9, 2024 at 15:14
  • 1
    Are you aware of gfortran's -fexternal-blas option? It will map a few Fortran subprograms such as matmul to a BLAS routines. There is also the -finline-matmul-limit=N option that controls the size of matrices/arrays for which gfortran will do inlining. Commented Aug 9, 2024 at 16:19
  • Could you double-check your timings? I tested your code (with gfortran 10 on Linux) and got pretty much the same timings for the matrix-matrix multiplication with the intrinsic matmul with or without -lblas Commented Aug 12, 2024 at 9:56
  • 1
    @PierU; I performed this test in three different computers. Two of them get almost the same timings with matmulor blasas you are finding. However, in the third computer where i was testing i keep getting the result i posted above. It is definitely something on my side and i will have to investigate what could have gone wrong... Thanks! Commented Sep 6, 2024 at 7:23

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.