14,734 questions
-5
votes
1
answer
46
views
PyCharm and PyTorch - Not able to run CUDA
I have CUDA installed via the regular Windows downloadable installer via the official website, and am trying to use PyTorch in the PyCharm program using CUDA as kernel.
PyTorch now works fine, however ...
0
votes
0
answers
47
views
TensorFlow + RTX 5090 + WSL: CUDA 12 Installed in WSL but Windows Driver Uses CUDA 13 [closed]
I’m trying to run TensorFlow with GPU on Windows 11 + WSL2 using an NVIDIA RTX 5090.
The issue
TensorFlow currently supports:
CUDA 12.3
cuDNN 9.x
Requires NVIDIA driver ≥ 560.94
But RTX 50-series GPUs ...
0
votes
0
answers
49
views
why my tiled matmul cuda kernel is giving wrong results [closed]
can anyone point me out what am I doing wrong in the following tiled matmul kernel in CUDA.
It is a tiled matrix multiplication of two matrices a and b, whose dimensions are mxk and kxn respectively. ...
Advice
1
vote
5
replies
94
views
CUDA C: How to keep an entire, somewhat complex calculcation on the GPU w/o bringing intermediate results back to host
So I'm trying to learn CUDA C. I had an idea for a simple code that could calculate the simple average of a float array. The idea is that main() will call a host function get_average(), which will ...
0
votes
0
answers
54
views
How to force NCCL build to embed PTX for all kernels (prevent linker from stripping ncclDevKernel PTX)?
I am compiling NCCL 2.27.5-1 (I tried also 2.28.9-1) from source for a V100 GPU (sm_70). My goal is to have libnccl.so contain compute_70 PTX for every kernel.
Despite passing explicit -gencode=arch=...
0
votes
0
answers
61
views
Completion semantics of cudaMemcpyDeviceToDevice with CUDA IPC over NVLink/NVSwitch
I am implementing a GPU-to-GPU benchmark using CUDA IPC on a node where two GPUs are connected via NVLink/NVSwitch
The workflow is the following:
Each process allocates a device buffer on its local ...
0
votes
0
answers
56
views
JAX script fails with INTERNAL: No BLAS support for stream when running multiple processes in parallel on a shared server
I'm facing a frustrating JAX runtime error on a multi-GPU server. My script works fine for a simple test but fails with a No BLAS support for stream error when I try to run multiple instances of it in ...
5
votes
2
answers
342
views
clangd in CUDA mode treats host-side C++ standard library as unavailable (std::format, chrono, iostream errors)
Problem
I'm trying to use clangd for LSP in Neovim with CUDA .cu files, but it fails to recognize standard C++ library features on the host side. Even simple host functions using std::format, std::...
3
votes
0
answers
94
views
Can I modify host data after cudaMemcpyAsync
Can I modify host data in host_data_ptr after the following ?
cudaMemcpyAsync(device_data_ptr,
host_data_ptr,
size,
cudaMemcpyHostToDevice,
...
1
vote
1
answer
302
views
How to correctly install JAX with CUDA on Linux when `jax[cuda12_pip]` consistently falls back to the CPU version?
I am trying to install JAX with GPU support on a powerful, dedicated Linux server, but I am stuck in what feels like a Catch-22 where every official installation method fails in a different way, ...
-2
votes
0
answers
66
views
why does jtop say that jetpack is missing despite cli returning the packages?
I had followed 6:47 onwards of this video https://www.youtube.com/watch?v=q4fGac-nrTI to install jetpack.
These are the outputs of dpkg -l and jtop respectively:
https://imgur.com/a/Xwwu0XX
as dpkg -l ...
3
votes
1
answer
119
views
Deleted function compiler errors using thrust::remove in C++
I am currently attempting to use the thrust::remove function on a thrust::device_vector of structs in my main function as shown bellow:
#include <iostream>
#include <thrust/device_vector.h>...
-1
votes
1
answer
127
views
Why Cuda threads are repeating same task?
I have a coded my simple CUDA ZIP password cracker but it seems that it prints same password for a number of times and i couldn't figure out why and this is weighing down my program.
Here is the full ...
0
votes
1
answer
139
views
Linking fails with: in function `main.cold': undefined reference to `__cxa_call_terminate'
I'm trying to build, using CMake, a program involving C++ and CUDA-C++ code. It used to build file, several months ago, but - now am getting a linker error I'm not familiar with:
in function `main....
3
votes
1
answer
134
views
Unable to run CUDA program in google colab
I am trying to run basic CUDA program in google colab but its not giving kernel output.
Below are the steps what I tried:
Changed run type to T4 GPU.
!pip install nvcc4jupyter
%load_ext ...
1
vote
0
answers
89
views
cuda & cpp - compilation and linking using cmake
I want to create a skeleton for a project in which there are multiple cuda and cpp files. They will be compiled individually and then linked together to form a single executable.
Currently I have the ...
1
vote
1
answer
64
views
How to debug cuda in Visual Studio with "step over"
I installed NVIDIA Nsight Visual Studio Edition 2025.01 in Visual Studio 2022.
I want to debug code, but I can't debug with step over(F10), The debugger always stops at a location without a breakpoint....
3
votes
1
answer
141
views
Numba CUDA code crashing due to unknown error, fixed with the addition of blank print statement in any thread
I'm writing some Hamiltonian evolution code that relies heavily on matrix multiplication, so I've been trying to learn about developing for a GPU using python.
However, when I run these lines of code ...
2
votes
0
answers
119
views
Implementing Arbitrary Precision Arithmetic in CubeCL for Infinite Zoom Fractals
Context
I'm implementing a Julia set fractal renderer using CubeCL (a Rust GPU compute framework). I want to achieve "infinite zoom" similar to deep Mandelbrot zoom videos, which requires ...
1
vote
0
answers
185
views
Why does “Command Buffer Full” appear in PyTorch CUDA kernel launches?
I’m using the PyTorch profiler to analyze sglang, and I noticed that in the CUDA timeline, some kernels show “Command Buffer Full”. This causes the cudaLaunchKernel time to become very long, as shown ...
1
vote
1
answer
166
views
memcpy_async does not work with pipeline roles
If I do a memcpy_async on a per thread basis, everything works fine, see the test_memcpy32 below.
This code prefetches data within a single warp.
I want to expand this, so that I can prefetch data in ...
2
votes
1
answer
181
views
Executing a CUDA Graph from a CUDA kernel
I’m trying to launch a captured CUDA Graph from inside a regular CUDA kernel (i.e., device-side graph launch).
From the NVIDIA blog on device graph launch, it seems this should be supported on newer ...
1
vote
0
answers
73
views
How can a removal of a boundary check introduce a BSYNC instruction in a following memory action?
I have some CUDA kernel code doing the following:
half * restrict output;
half v;
// ... etc ...
int i = whatever();
#ifdef CHECK_X
if (i >= 0 && i <= SOME_CONSTANT)
#endif
{
output[...
0
votes
1
answer
45
views
Automatic cmake parameter /Zc:__cplusplus interpreted as a file name by nvcc
I am working on C++ project on Windows, using CUDA 12.0, cmake 3.31.6, vcpkg (updated to recent commit a62ce77).
During configuration CMake tries to launch nvcc with some small test program to get ...
0
votes
0
answers
99
views
TensorRT: enqueueV3 fails when using dynamic shapes and Green Contexts
I am trying to benchmark TensorRT inference using CUDA Green Contexts and splitting SMs. My code runs fine when I generate the .engine with fixed input shapes, but it fails when I build the engine ...
0
votes
1
answer
99
views
CPU-GPU producer-consumer pattern using unified memory but GPU is in spin loop
I am trying to implement producer consumer problem in GPU-CPU. Required for some other project. GPU requests some data via Unified memory to CPU. CPU copies that data to a specific location in global ...
1
vote
1
answer
59
views
Dask-CUDA LocalCUDACluster on WSL2: NVML errors despite enable_nvml=False
I’m trying to set up a LocalCUDACluster on WSL2 (Ubuntu 22.04) from Windows 11 for GPU computations. The cluster starts and runs, but performance is ~10× slower than running directly on the GPU, and ...
-3
votes
1
answer
70
views
How to debug cuda kernels in python, using vscode (linux)
I use cupy to call cuda kernels, but I don't know how to debug cuda code, here is my wrapper file:
wrapper.py
import math
from pathlib import Path
import cupy as cp
import numpy as np
with open(Path(...
8
votes
1
answer
624
views
How do I get the GPU clock rate in CUDA 13?
I updated CUDA to version 13.
But it seems that cudaGetDeviceProperties has changed.
Instead of returning the cudaDeviceProp struct with clockRate, it returns a mutilated version thereof with ...
0
votes
1
answer
153
views
How to correctly monitor a program’s GPU memory bandwidth utilization and SM utilization? (DCGM DRAM_ACTIVE vs in-program bandwidth differs a lot)
I want to quantitatively measure the memory bandwidth utilization and SM utilization of a CUDA program for performance analysis and regression testing.
My approach so far:
Compute the theoretical ...
2
votes
1
answer
128
views
ILGPU kernel silently not compiling
I am trying to debug a kernel written for ILGPU which does not compile.
My aplication has 2 big kernels.
The first (that loads and does the right thing):
/// <summary>
/// Unified GPU kernel ...
1
vote
1
answer
159
views
std::complex in cuda kernels
CUDA allows to run constexpr member functions when compiling with --expt-relaxed-constexpr. This allows to use std::complex<double> in cuda kernels. However, while doing this, I get incorrect ...
3
votes
0
answers
102
views
CUDA: Load misaligned float4 vector
I want to load 4 floats per thread. I know they are not 16 byte aligned. What is the best way to do this?
Specific conditions:
I cannot align the array without replicating the data because other ...
1
vote
1
answer
352
views
How are fp6 and fp4 supported on NVIDIA Tensor Core on Blackwell?
I am writing PTX assembly code on CUDA C++ for research. This is my setup:
I have just downloaded the latest CUDA C++ toolkit (13.0) yesterday on WSL linux.
The local compilation environment does not ...
1
vote
1
answer
110
views
What is the actual maximum nesting depth of dynamic parallelism in CUDA?
Without getting into too much detail, the project I'm working on needs three different phases, each corresponding to a different kernel. I only know the number of threads needed in the second phase ...
3
votes
1
answer
157
views
calling constructor with different types of parameters in template function
I have a simple function gpu_allocate() to helps allocate memory on GPU (CUDA):
template <typename T> T *gpu_allocate() {
T *data;
cudaMallocManaged(&data, sizeof(T));
return data;
}
...
2
votes
1
answer
76
views
How to correctly pass float4 vector to kernel using PyCUDA?
I am trying to pass a float4 as argument to my cuda kernel (by value) using PyCUDA’s make_float4(). But there seems to be some misalignment when the data is transferred to the kernel. If I read the ...
-2
votes
1
answer
71
views
cmake generating a bad command line option for CUDA in MSVC on Windows [closed]
Cmake build is producing this error message,
nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
when running this command like that itself generates:...
0
votes
1
answer
105
views
CudaSetValidDevices doesn't seem to work as expected
I m using a Nvidia machine which has 2 Tesla V100 GPUs. I used cudaSetValidDevices API to set only 1 valid device which is device 1. After that if I'm trying to set device 0, the API still seem to ...
-1
votes
1
answer
180
views
Parallelising CCITT-CRC16 over via CUDA
I have been tasked by my boss to convert a sequential CRC16 algorithm that runs on the CPU into something that can run on the GPU via CUDA (that isn't just running the sequential algorithm in a single ...
2
votes
1
answer
118
views
Strange behaviour of atomicCAS when used as a mutex
I'm trying to learn CUDA programming, and recently I have been working on the lectures in this course: https://people.maths.ox.ac.uk/~gilesm/cuda/lecs/lec3.pdf, where they discussed the atomicCAS ...
2
votes
0
answers
41
views
What do shuffle instructions do on the hardware? [duplicate]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-shfl-...
1
vote
0
answers
73
views
Why does cusolverDnDsytri not find the inverse of a matrix in Fortran?
I have recently been trying to use CUDA in Fortran for a project which requires finding the inverse of symmetric matrices. The matrices I am working with are not positive definite so I cannot use a ...
0
votes
1
answer
122
views
Why does NVCC not optimize ldexpf with a constexpr power-of-two exponent into a simple fmul?
Consider the following CUDA code:
enum { p = 5 };
__device__ float adjust_mul(float x) { return x * (1 << p); }
__device__ float adjust_ldexpf(float x) { return ldexpf(x, p); }
I would expect ...
2
votes
1
answer
86
views
Degree of Bank conflicts in cuda - Picture not clear from GPU GEMS Prefix Sum article
I am trying to understand this article : https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda
More specifically bank-conflicts is what I am ...
4
votes
1
answer
91
views
Can threads in a warp synchronize with different calls to __shfl_sync?
I'm trying to learn Cuda and I'm having trouble to wrap my head around warps and lanes in them. A lane is the word I use for threads inside of the same warp. I can exchange data between lanes as ...
1
vote
0
answers
40
views
cusolverDnDgesvdj for the computation of singular values only under Python
I have set up the code below for the computation of the SVD of a matrix. The code uses cuSOLVER's cusolverDnDgesvdj.
import ctypes
import numpy as np
import pycuda.autoinit # implicit ...
0
votes
1
answer
368
views
Distinction CuTe and NVIDIA Cutlass
I'm confused what exactly is handled by CuTe and by Cutlass.
From my understanding Cutlass handles the following:
Gemm computation of CuTe Tensors
Communication between CPU and GPU
Abstract memory ...
1
vote
1
answer
173
views
CUDA `cudaMemcpyBatchAsync` "invalid argument"
I'm consistently encountering an "invalid argument" error when calling cudaMemcpyBatchAsync for host-to-device transfers.
CUDA error at btest.cu:43 - invalid argument
Line 43 is CUDA_CHECK(...
3
votes
1
answer
137
views
Does Clang support dynamic parallelism in cuda?
Dynamic parallelism means kernels calls kernels. Its possible to compile CUDA program using clang, but do clang support dynamic parallelism ?
I am getting this error when attempting to compile a CUDA ...