Fortran LAPACK: high CPU %sys usage with DSYEV - no parallelization - normal?

Question

See further update below

I am observing a quiet high system CPU usage when running my Fortran code. The "user CPU usage" is taking about one core (system is an Intel i7 with 4 cores/ 8 threads, running Linux) whilst system CPU is eating up about 2 cores (hence overall CPU usage about 75%). Can anyone explain to me where this is coming from and if this is "normal" behaviour?

I compile the code with gfortran (optimization turned off -O0, though that part doesn't seem to matter) and link against BLAS, LAPACK and some (other) C-functions. My own code is not using any parallelization and neither does the linked code (as far as I can tell). At least I am not using any parallelized library versions.

The code itself is about assembling and solving finite element systems and uses a lot (?) of allocating and intrinsic function calls (matmul, dot_product), though the overall RAM usage is pretty low (~200MB). I don't know if this information is sufficient/ useful, but I hope someone knows what is going on there.

Best regards, Ben

UPDATE I think I did track down (part of) the problem to a call to DSYEV from LAPACK (computes eigenvalues of a real symm. matrix A, in my case 3x3).

program test

implicit none

integer,parameter :: ndim=3
real(8) :: tens(ndim,ndim)

integer :: mm,nn
real(8), dimension(ndim,ndim):: eigvec
real(8), dimension(ndim)   :: eigval

character, parameter    :: jobz='v'  ! Flags calculation of eigenvectors
character, parameter    :: uplo='u'  ! Flags upper triangular 
integer, parameter      :: lwork=102   ! Length of work array
real(8), dimension(lwork)  :: work      ! Work array
integer :: info   

tens(1,:) = [1.d0, 2.d0, 3.d0]
tens(2,:) = [2.d0, 5.d0, 1.d0]
tens(3,:) = [3.d0, 1.d0, 1.d0]   

do mm=1,5000000    
    eigvec=tens
   ! Call DSYEV
   call dsyev(jobz,uplo,ndim,eigvec,ndim,eigval,work,lwork,info)
enddo

write(*,*) eigvec
write(*,*) int(work(1))

endprogram test

The compiling and linking is done with

gfortran test.f90 -o test -llapack

This program is giving me very high %sys CPU usage. Can anyone verify this (obviously LAPACK is necessary to un the code)? Is this "normal" behaviour or is something wrong with my code/system/librariers...?

UPDATE 2 Encouraged by @roygvib's comment I ran the code on another system. On the second system, the high CPU sys usage could not be reproduced. Comparing the two systems I can't seem to find where this is coming from. Both run the same OS version (Linux Ubuntu), same gfortran version (4.8), Kernel Version, LAPACK and BLAS. "Major" difference: the processor is an i7-4770 on the buggy system and an i7-870 on the other. Running the test code on the buggy one is giving me about %user 16s and %sys 28s. On the i7-870 it is %user 16s %sys 0s. Running the code four times (parallel) gives me an overall timing for each process of about 18s on the other system and 44s on the buggy system. Any ideas what else I could look for?

UPDATE 3 I think we are getting closer: Building the test program on the other system with a static link to the LAPACK and BLAS library,

gfortran test.f90 -O0 /usr/lib/liblapack.a /usr/lib/libblas.a -Wl,--allow-multiple-definition

and running that code in the buggy system gives me a %sys time of about 0 (as desired). On the other hand, building the test program with static links to LAPACK and BLAS on the buggy system and running the code on the other system return high %sys CPU usage as well! So obviously, the libraries seem to differ, right? Building the static version on the buggy system results in a file size of about 18MB(!), on the other system 100KB. Additionaley I have to include the

-Wl,--allow-multiple-definition

command only on the other system (otherwise complains about multiple definitions of xerbla), whilst on the buggy system I have to (explicitly) link against libpthread

gfortran test.f90 -O0 /usr/lib/liblapack.a /usr/lib/libblas.a -lpthread -o test

The interesting thing is that

apt-cache policy liblapack*

returns the same versions and repo destinations for both systems (same goes for libblas*). Any further ideas? Maybe there is some other command to check library version that I don't know of?

See the comment from the nmon developer embedded in this answer : stackoverflow.com/a/5738139/620097 . With out a way to reproduce your problem this Q is likely to be considered as off topic (interesting as it may be). Good luck. — shellter
– shellter, Commented Mar 10, 2016 at 21:20
The model I'm currently working on only has about 4000 elements. But the size doesn't seem to be the matter, as I could reprocude the behaviour with even smaller models. The thing is: using commercial FEM code based on Fortran, only a truly single core is used (0% sys CPU usage). So I don't see the point in the comment @shellter hints at. — PrinceOfMe
– PrinceOfMe, Commented Mar 10, 2016 at 21:55
Is your LAPACK threaded? Where does your LAPACK and BLAS implementation come from? — Vladimir F Героям слава
– Vladimir F Героям слава, Commented Mar 11, 2016 at 11:30
LAPACK and BLAS come from the repo (liblapack3 and libblas3). To my knowledge, those are not threaded, are they? — PrinceOfMe
– PrinceOfMe, Commented Mar 11, 2016 at 13:41

Vladimir F Героям слава · Accepted Answer · 2016-03-16 12:02:32Z

2

My interpretation of the slowdown:

A threaded (probably OpenMP) version of LAPACK and BLAS wes used. These try to launch several threads to solve the linear algebra problem in parallel. That often speeds-up the computation.

However in this case

do mm=1,5000000    
   eigvec=tens
   call dsyev(jobz,uplo,ndim,eigvec,ndim,eigval,work,lwork,info)
enddo

This is numerous times calling the library for a very small problem (a 3x3 matrix). This cannot be efficiently solved in parallel, the matrix is too small. The overhead connected with the synchronization of the threads dominates the solution time. The synchronization (if not even thread creation) is done 5000000 times!

Remedies:

use a non-threaded BLAS and LAPACK
if the parallelization is done using OpenMP set OMP_NUM_THREADS=1 which means use only one thread
do not use LAPACK at all because for the special case 3x3 there are specialized algorithms available https://en.wikipedia.org/wiki/Eigenvalue_algorithm#3.C3.973_matrices

answered Mar 16, 2016 at 12:02

Vladimir F Героям слава

60.8k4 gold badges83 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

PrinceOfMe Over a year ago

Thank you very much for providing the answer. I do stick to openblas, though I set OPENBLAS_NUM_THREADS=1. The test program I provided is just a small example and there are more calls to LAPACK/BLAS than just solving the eigenvalues for a 3x3 matrix. Nevertheless, I did change the calls for those small systems from LAPACK to a faster solver. Thanks for the hint!

Collectives™ on Stack Overflow

Fortran LAPACK: high CPU %sys usage with DSYEV - no parallelization - normal?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related