935 questions
0
votes
0
answers
56
views
JAX script fails with INTERNAL: No BLAS support for stream when running multiple processes in parallel on a shared server
I'm facing a frustrating JAX runtime error on a multi-GPU server. My script works fine for a simple test but fails with a No BLAS support for stream error when I try to run multiple instances of it in ...
Advice
0
votes
1
replies
73
views
BLAS speed much worse on one (supposedly heterogenous) compute node
We have a small local compute cluster consisting of 5 compute nodes (all supposedly having the same hardware and software) and a login/storage node. I'm running an in-house Fortran software that uses ...
4
votes
1
answer
276
views
Why is Eigen C++ int matrix multiplication 10x slower than float multiplication (even slower than naive n^3 algorithm) when compiled with AVX512
I'm testing int matrix multiplication, but I found that it's extremely slow everywhere (python numpy using BLAS backend is also just as slow). Int matmul being slower than float matmul is ...
3
votes
1
answer
147
views
Build numpy 2.3+ without accelerated libraries
Related post: Compile numpy WITHOUT Intel MKL/BLAS/ATLAS/LAPACK
Recent versions of numpy use meson for build configuration, I can build numpy from source but failed to exclude BLAS/LAPACK/... deps.
...
1
vote
1
answer
76
views
OpenBLAS gemm 2x slower in Lisp CFFI compared to direct C calls with same BLAS library
I'm experiencing a significant performance difference where OpenBLAS matrix multiplication runs 2x slower when called through Lisp CFFI compared to direct C calls, despite using the exact same ...
0
votes
0
answers
63
views
Undefined reference to BLAS
I'm trying to install the HurdleNormal R package as a dependency for another package (COZINE), and I'm getting the following error:
C:\rtools45\x86_64-w64-mingw32.static.posix\bin/ld.exe:
...
0
votes
0
answers
18
views
BLAS/LAPACK compatibility
I've been trying to figure out whether the newer version of BLAS/LAPACK are backward compatible with the older releases but I can't find anything on the netlib website or docs.
Are they compatible ...
1
vote
1
answer
70
views
"Invalid read of size 8" warning from Valgrind when calling zhemv blas function in C++
I'm computing a hermitian (self-adjoint) matrix times a complex vector multiplication by means of ZHEMV in BLAS by calling the function from a C++ interface. The problem I see is getting an "...
1
vote
0
answers
145
views
Ifx cannot find modern generic MKL routines like GEMM_F95
I am compiling Fortran code with the ifx compiler (version 2025.0.4) on Windows. I have the Intel MKL library downloaded as well and I am trying to compile a program using it, like this:
ifx test.f90 ...
1
vote
1
answer
311
views
MKL and openBLAS interactions - a question about linking
I'm using a binary (R) that dynamically links to a generic version of BLAS,
for instance (and in a lot of cases) this is openBLAS.
Now, inside R, I'm dynamically loading another shared library (...
1
vote
2
answers
127
views
Undefined reference to cblas_* with cmake on windows
I'working on a project that uses SAF (Spatial Audio Framework) which has OpenBlas and LAPACK as Dependecies. (The Project includes a lot of libraries so I only show the code that relates to my problem:...
1
vote
0
answers
47
views
Confused about cblas_dgemm arguments
Say I want to calculate x^T * Y, x is an n by 1 matrix and Y is an n by n matrix:
cblas_dgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_TRANSPOSE TransB, const ...
5
votes
2
answers
204
views
crossprod(m1, m2) is running slower than t(m1) %*% m2 on my machine
Why does t(mat1) %*% mat2 work quicker than crossprod(mat1, mat2). Isn't the whole point of the latter that it calls a more efficient low-level routine?
r$> mat1 <- array(rnorm(100 * 600), dim = ...
5
votes
1
answer
159
views
How to control (BLAS?) parallelization when using mgcv::gam
I am running some fairly large gam models and don't want to parallelize the computations, or at least want to be able to control the degree of parallelization. (Besides not wanting to fry my machine ...
2
votes
1
answer
86
views
Parallelize operations on arrays and merge results into one array using OpenMP
I am trying to speed up a function that, given a complex-valued array arr with n entries, calculates the sum of m operations on that array using BLAS routines. Finally, it replaces the values of arr.
...
0
votes
0
answers
98
views
Unexpected behaviour of matmul when compiled with blas in Fortran
I am trying to benchmark the blas routines dgemv and dgemm in Fortran. For that I have written this simple codes:
matmul.f90:
program test ...
1
vote
0
answers
125
views
How to use BLAS in C, using gcc on Linux?
On Linux, in the file a.c, I do #include <cblas.h> and later I do cblas_sgemm(...). Compiling with
gcc -O2 -march=native -fopenmp a.c
or with
gcc -O2 -march=native -lblas -fopenmp a.c
results in ...
0
votes
1
answer
161
views
Problems evaluating CUDNN for SGEMM
I used cudnn to test sgemm for C[stride x stride] = A[stride x stride] x B[stride x stride] below,
Configuration
GPU: T1000/SM_75
cuda-12.0.1/driver-535 installed (via the multiverse repos on ubuntu-...
0
votes
0
answers
140
views
How can I use multithreaded BLAS from a single threaded EIgen C++ application?
I'm trying to speed up Eigen dense matrix * matrix operation by using multihreaded BLAS library calls.
I've achieved 100% speed increase using AMD AOCL-BLAS library from within Eigen. But I seem ...
0
votes
1
answer
1k
views
Numpy/Scipy BLAS/LAPACK Linking on macOS (with Apple Accelerate)
Question
I am trying to find out if the latest version of NumPy (2.0.0) is taking advantage of the updated Accelerate BLAS/LAPACK library, including ILP64.
Numpy
Numpy in their 2.0.0 release added ...
0
votes
0
answers
153
views
What is the time-complexity of BLAS level 2 and 3 functions from a vendor which optimized the operations?
BLAS level 2 performs Matrix-vector multiplications, and I know this is O(n^2) in time, when the matrix is shaped (n, n) and the vector is shaped (n, 1).
BLAS level 3 performs Matrix-Matrix ...
1
vote
0
answers
619
views
How can I select the AOCL BLIS/Lapack libraries for building Numpy on Windows 10?
I have an AMD Ryzen 7 2700X and I’m trying to compile Numpy in Anaconda virtual environment using the BLIS/Lapack libraries of AMD AOCL 4.2,that I installed locally.
I tried to compile through pip in ...
0
votes
0
answers
57
views
Installation of C++ libraries 'Boost' and 'BLAS' for Python project fail on Windows
I'm working on a Python project. After cloning a remote git repository I followed the instructions in the README file, executing multiple pip install commands in my VSCode PowerShell terminal to set ...
1
vote
0
answers
321
views
Change the BLAS version used by R
I am currently having issues with the 'Eigen()' function in R.
It was mentioned that I should try using 'OpenBLAS'. How abouts should I go about installing this and make R use this version of BLAS.
I ...
5
votes
0
answers
523
views
CMake Error: Could NOT Find BLAS Using VCPKG and CMake on Windows
As a follow-up to this question, I am trying to set up a project using CMake with VCPKG on Windows to link the BLAS library. Despite following the instructions from the official VCPKG guide, I'm ...
0
votes
1
answer
1k
views
CMake cannot find BLAS libraries after installing OpenBLAS via Conan
I am working on a project that uses Fortran and requires BLAS libraries. I've decided to use OpenBLAS, which I installed via Conan. However, I'm encountering an issue where CMake cannot find the BLAS ...
2
votes
1
answer
168
views
Why is libopenblas from numpy so big?
We are deploying an open source application based on numpy that includes libopenblas.{cryptic string}.gfortran-win32.dll. It is part of the Python numpy package. This dll is over 27MB in size. I'm ...
0
votes
1
answer
245
views
arithmetic intensity of zgemv versus dgemv/sgemv?
The arithmetic intensity of sgemv (or dgemv) is derived in this set of exercises (https://florian.world/wp-content/uploads/FM-High-Performance-Computing-I-Assignment-1.pdf) to be:
0.5 / (1+c), where c ...
1
vote
1
answer
294
views
How to force Julia to use multiple threads for matrix multiplication?
I want to find powers of a relatively small matrix, but this matrix consists of rational numbers of type Rational{BigInt}. By default, Julia utilizes only a single thread for such computations. I want ...
1
vote
1
answer
193
views
Can I multiply the real parts of two complex matrices using dgemm?
I have two complex matrices A and B, with matching shapes.
Is there a way to cleverly setup the dgemm arguments so as to get the result of the matrix multiplications of the real parts of these ...
2
votes
1
answer
1k
views
In Xcode, how do you set compiler flags for standalone module (framework)?
I was writing my own standalone module and wanted to use cblas_dasum for efficient calculation of the sum of absolute values of a double array. Though a message pops up saying that I have to
specify ...
1
vote
0
answers
125
views
Why BLAS cblas_sgemm in C is slower than np.dot?
I made a simple benchmark between Python NumPy and C OpenBLAS to multiply two 500x500 matrices. It seems that np.dot performs almost 9 times faster than cblas_sgemm. Is there anything I'm doing wrong?
...
0
votes
1
answer
154
views
How to properly link mkl interfaces with fortls
In my project I'm doing massive use of the blas subroutines under the mkl implementation, I have no problems in compiling the project thanks to the Intel Advisor, but I can't get fortls to recognize ...
0
votes
1
answer
419
views
Installing scipy on CentOS 6 (OpenBLAS problem)
I'm trying to install scipy on CentOS 6 with python 3.9.18 and get error:
../scipy/meson.build:159:9: ERROR: Dependency "OpenBLAS" not found, tried pkgconfig
The problem is that CentOS 6 ...
0
votes
1
answer
213
views
Fortran with Sparse BLAS not flushing memory
I have a subroutine that builds sparse matrices, and I need to call it several times. However, it seems that if I call this subroutine a lot of times (and/or if the sparse matrices are very large), ...
0
votes
1
answer
120
views
Why multiplying wide matrices are slower than square matrices?
I have noticed the following while trying to increase the performance of my code:
>>> a, b = torch.randn(1000,1000), torch.randn(1000,1000)
>>> c, d = torch.randn(10000, 100), torch....
0
votes
1
answer
210
views
How do I make np.multiply use more than one core?
The title says it already. I am currently parallelizing my code and a major bottleneck is posed by element-wise multiplication of two three-dimensional ndarrays. My system monitor reveals that only ...
4
votes
1
answer
4k
views
No GPU support while running llama-cpp-python inside a docker container
I'm trying to run llama index with llama cpp by following the installation docs but inside a docker container.
Following this repo for installation of llama_cpp_python==0.2.6.
DOCKERFILE
# Use the ...
1
vote
1
answer
203
views
How Does NumPy Internally Handle Matrix Multiplication with Non-continuous Slices?
Hello Stack Overflow community,
I'm working with NumPy for matrix operations and I have a question regarding how NumPy handles matrix multiplication, especially when dealing with non-continuous slices ...
0
votes
1
answer
158
views
Repeated single precison complex matrix vector multiplication (speed and accuracy improvement)
I've boiled a long running function down to a "simple" series of matrix vector multiplications. The matrix does not change, but there are a a lot of vectors. I have put together a test ...
0
votes
0
answers
81
views
cannot call SASUM by itself as in x=SASUM without fortran 'call'
compiled a program with 'call segesv()' to solve system of 3 Equates in 3 Vars and that works fine so I know I'm linked to Blas and Lapack however, 'call SASUM' also compiles (I pass a vector of ...
0
votes
0
answers
171
views
Is BLIS suitable for cross-plattform development, including Apple Silicon?
I am currently re-working a scientific C++ project that makes heavy use of matrix-vector operations like multiplying a (skew)-symmetric matrix with a vector, adding or multiplying two vectors or ...
0
votes
0
answers
274
views
Linker errors with BLAS/LAPACK symbols (snrm2_, sdot_, etc) when building Fortran project with gfortran on Windows
I'm trying to build the Elmer finite element software (version 9.0) using gfortran 10.2.0 and OpenBLAS 0.3.15 libraries on Windows 10. I'm running into linker errors when creating the shared libraries,...
3
votes
2
answers
262
views
snrm2 calculation instability for single-precision floats on Accelerate
I'm trying to use snrm2 to perform a single precision float calculation in Rust. I'm linking to the Accelerate framework on OSX and using the blas crate for the C-bridge. Regardless of the randomly ...
0
votes
1
answer
187
views
What is wrong with my sparse matrix-multiple vectors (SpMM) product function for CSR?
I have the following code for the sparse matrix-vector (SpMV) product in C assuming a CSR storage format:
void dcsrmv(SparseMatrixCSR *A, double *x, double *y) {
for (int i=0; i<A->m; i++) {
...
1
vote
0
answers
115
views
numpy built with locally built blis does not use multithreading
I'm looking for help with an issue I'm having building Numpy against locally built blis for zen3.
I've configured blis to enable threading using openmp. (it is installed and working on my machine, ...
-1
votes
1
answer
206
views
Why does the magma_dgemm function not use tensor cores on the V100 GPU?
I run MAGMA testing_dgemm code both on V100 and H100 GPU. With Nsight Systems, I found that on the V100 the code doesn't use tensor cores, but code on the H100 it does.
V100 result:
H100 result:
The ...
0
votes
0
answers
219
views
`np.dot` yields a different result when computed in two pieces
import numpy as np
N = 4
m = 2
mat = np.random.rand(N,2*m)
vec = np.random.rand(N)
dot1 = np.dot(vec,mat)
dot2 = np.concatenate([np.dot(vec,mat[:,:m]), np.dot(vec,mat[:,m:])])
print('Max difference:',...
1
vote
1
answer
509
views
How to see details behind CPU-only Libtorch Matrix-Matrix multiplication routines?
I have downloaded the libtorch CPU-only version from the website and unzipped it.
Inside my .cpp application which uses libtorch, I write (I am using intel-mkl for other parts of the application, and ...
0
votes
1
answer
185
views
"undefined reference to" error during linking process
I am building my application using OpenMPI (built with LLVM) and few other external libraries including netcdf-fortran, BLAS and LAPACK. The files compile without any problem, but in the last stage ...