Newest 'cuda' Questions

-5 votes

1 answer

46 views

PyCharm and PyTorch - Not able to run CUDA

I have CUDA installed via the regular Windows downloadable installer via the official website, and am trying to use PyTorch in the PyCharm program using CUDA as kernel. PyTorch now works fine, however ...

alexanderjansma

57

asked yesterday

0 votes

0 answers

47 views

TensorFlow + RTX 5090 + WSL: CUDA 12 Installed in WSL but Windows Driver Uses CUDA 13 [closed]

I’m trying to run TensorFlow with GPU on Windows 11 + WSL2 using an NVIDIA RTX 5090. The issue TensorFlow currently supports: CUDA 12.3 cuDNN 9.x Requires NVIDIA driver ≥ 560.94 But RTX 50-series GPUs ...

Mansoor

27

asked Dec 5 at 10:35

0 votes

0 answers

49 views

why my tiled matmul cuda kernel is giving wrong results [closed]

can anyone point me out what am I doing wrong in the following tiled matmul kernel in CUDA. It is a tiled matrix multiplication of two matrices a and b, whose dimensions are mxk and kxn respectively. ...

explorer

1

asked Dec 4 at 17:42

Advice

1 vote

5 replies

94 views

CUDA C: How to keep an entire, somewhat complex calculcation on the GPU w/o bringing intermediate results back to host

So I'm trying to learn CUDA C. I had an idea for a simple code that could calculate the simple average of a float array. The idea is that main() will call a host function get_average(), which will ...

bob.sacamento

6,713

asked Dec 4 at 13:59

0 votes

0 answers

54 views

How to force NCCL build to embed PTX for all kernels (prevent linker from stripping ncclDevKernel PTX)?

I am compiling NCCL 2.27.5-1 (I tried also 2.28.9-1) from source for a V100 GPU (sm_70). My goal is to have libnccl.so contain compute_70 PTX for every kernel. Despite passing explicit -gencode=arch=...

CiZ

9

asked Nov 26 at 17:05

0 votes

0 answers

61 views

Completion semantics of cudaMemcpyDeviceToDevice with CUDA IPC over NVLink/NVSwitch

I am implementing a GPU-to-GPU benchmark using CUDA IPC on a node where two GPUs are connected via NVLink/NVSwitch The workflow is the following: Each process allocates a device buffer on its local ...

Matteo Sperini

1

asked Nov 20 at 18:52

0 votes

0 answers

56 views

JAX script fails with INTERNAL: No BLAS support for stream when running multiple processes in parallel on a shared server

I'm facing a frustrating JAX runtime error on a multi-GPU server. My script works fine for a simple test but fails with a No BLAS support for stream error when I try to run multiple instances of it in ...

PowerPoint Trenton

115

asked Nov 16 at 13:31

5 votes

2 answers

342 views

clangd in CUDA mode treats host-side C++ standard library as unavailable (std::format, chrono, iostream errors)

Problem I'm trying to use clangd for LSP in Neovim with CUDA .cu files, but it fails to recognize standard C++ library features on the host side. Even simple host functions using std::format, std::...

NeKon

314

asked Nov 15 at 15:58

3 votes

0 answers

94 views

Can I modify host data after cudaMemcpyAsync

Can I modify host data in host_data_ptr after the following ? cudaMemcpyAsync(device_data_ptr, host_data_ptr, size, cudaMemcpyHostToDevice, ...

YSF

41

asked Nov 12 at 9:49

1 vote

1 answer

302 views

How to correctly install JAX with CUDA on Linux when `jax[cuda12_pip]` consistently falls back to the CPU version?

I am trying to install JAX with GPU support on a powerful, dedicated Linux server, but I am stuck in what feels like a Catch-22 where every official installation method fails in a different way, ...

PowerPoint Trenton

115

asked Nov 12 at 9:36

-2 votes

0 answers

66 views

why does jtop say that jetpack is missing despite cli returning the packages?

I had followed 6:47 onwards of this video https://www.youtube.com/watch?v=q4fGac-nrTI to install jetpack. These are the outputs of dpkg -l and jtop respectively: https://imgur.com/a/Xwwu0XX as dpkg -l ...

algo

69

asked Nov 11 at 11:28

3 votes

1 answer

119 views

Deleted function compiler errors using thrust::remove in C++

I am currently attempting to use the thrust::remove function on a thrust::device_vector of structs in my main function as shown bellow: #include <iostream> #include <thrust/device_vector.h>...

AowynB

33

asked Nov 11 at 8:44

-1 votes

1 answer

127 views

Why Cuda threads are repeating same task?

I have a coded my simple CUDA ZIP password cracker but it seems that it prints same password for a number of times and i couldn't figure out why and this is weighing down my program. Here is the full ...

actgroup inc

27

asked Nov 10 at 5:19

0 votes

1 answer

139 views

Linking fails with: in function `main.cold': undefined reference to `__cxa_call_terminate'

I'm trying to build, using CMake, a program involving C++ and CUDA-C++ code. It used to build file, several months ago, but - now am getting a linker error I'm not familiar with: in function `main....

einpoklum

138k

asked Nov 9 at 23:14

3 votes

1 answer

134 views

Unable to run CUDA program in google colab

I am trying to run basic CUDA program in google colab but its not giving kernel output. Below are the steps what I tried: Changed run type to T4 GPU. !pip install nvcc4jupyter %load_ext ...

Digvijay Singh Thakur

3,351

asked Nov 6 at 7:52

1 vote

0 answers

89 views

cuda & cpp - compilation and linking using cmake

I want to create a skeleton for a project in which there are multiple cuda and cpp files. They will be compiled individually and then linked together to form a single executable. Currently I have the ...

ThErOmAnEmPiRe

63

asked Nov 4 at 20:08

1 vote

1 answer

64 views

How to debug cuda in Visual Studio with "step over"

I installed NVIDIA Nsight Visual Studio Edition 2025.01 in Visual Studio 2022. I want to debug code, but I can't debug with step over(F10), The debugger always stops at a location without a breakpoint....

Imagination Youth

11

asked Oct 31 at 2:36

3 votes

1 answer

141 views

Numba CUDA code crashing due to unknown error, fixed with the addition of blank print statement in any thread

I'm writing some Hamiltonian evolution code that relies heavily on matrix multiplication, so I've been trying to learn about developing for a GPU using python. However, when I run these lines of code ...

user2506833

116

asked Oct 29 at 18:52

2 votes

0 answers

119 views

Implementing Arbitrary Precision Arithmetic in CubeCL for Infinite Zoom Fractals

Context I'm implementing a Julia set fractal renderer using CubeCL (a Rust GPU compute framework). I want to achieve "infinite zoom" similar to deep Mandelbrot zoom videos, which requires ...

Marco Fanelli

41

asked Oct 27 at 18:52

1 vote

0 answers

185 views

Why does “Command Buffer Full” appear in PyTorch CUDA kernel launches?

I’m using the PyTorch profiler to analyze sglang, and I noticed that in the CUDA timeline, some kernels show “Command Buffer Full”. This causes the cudaLaunchKernel time to become very long, as shown ...

plznobug

143

asked Oct 23 at 12:36

1 vote

1 answer

166 views

memcpy_async does not work with pipeline roles

If I do a memcpy_async on a per thread basis, everything works fine, see the test_memcpy32 below. This code prefetches data within a single warp. I want to expand this, so that I can prefetch data in ...

Johan

77.4k

asked Oct 21 at 6:37

2 votes

1 answer

181 views

Executing a CUDA Graph from a CUDA kernel

I’m trying to launch a captured CUDA Graph from inside a regular CUDA kernel (i.e., device-side graph launch). From the NVIDIA blog on device graph launch, it seems this should be supported on newer ...

Mohammad Siavashi

1,292

asked Oct 13 at 11:38

1 vote

0 answers

73 views

How can a removal of a boundary check introduce a BSYNC instruction in a following memory action?

I have some CUDA kernel code doing the following: half * restrict output; half v; // ... etc ... int i = whatever(); #ifdef CHECK_X if (i >= 0 && i <= SOME_CONSTANT) #endif { output[...

einpoklum

138k

asked Oct 8 at 14:27

0 votes

1 answer

45 views

Automatic cmake parameter /Zc:__cplusplus interpreted as a file name by nvcc

I am working on C++ project on Windows, using CUDA 12.0, cmake 3.31.6, vcpkg (updated to recent commit a62ce77). During configuration CMake tries to launch nvcc with some small test program to get ...

CygnusX1

22.1k

asked Oct 3 at 17:07

0 votes

0 answers

99 views

TensorRT: enqueueV3 fails when using dynamic shapes and Green Contexts

I am trying to benchmark TensorRT inference using CUDA Green Contexts and splitting SMs. My code runs fine when I generate the .engine with fixed input shapes, but it fails when I build the engine ...

Gota_12

23

asked Oct 2 at 14:14

0 votes

1 answer

99 views

CPU-GPU producer-consumer pattern using unified memory but GPU is in spin loop

I am trying to implement producer consumer problem in GPU-CPU. Required for some other project. GPU requests some data via Unified memory to CPU. CPU copies that data to a specific location in global ...

Chinmaya Bhat K K

1

asked Sep 30 at 18:38

1 vote

1 answer

59 views

Dask-CUDA LocalCUDACluster on WSL2: NVML errors despite enable_nvml=False

I’m trying to set up a LocalCUDACluster on WSL2 (Ubuntu 22.04) from Windows 11 for GPU computations. The cluster starts and runs, but performance is ~10× slower than running directly on the GPU, and ...

Marek Majoch

21

asked Sep 29 at 9:37

-3 votes

1 answer

70 views

How to debug cuda kernels in python, using vscode (linux)

I use cupy to call cuda kernels, but I don't know how to debug cuda code, here is my wrapper file: wrapper.py import math from pathlib import Path import cupy as cp import numpy as np with open(Path(...

S200331082

1

asked Sep 25 at 13:08

8 votes

1 answer

624 views

How do I get the GPU clock rate in CUDA 13?

I updated CUDA to version 13. But it seems that cudaGetDeviceProperties has changed. Instead of returning the cudaDeviceProp struct with clockRate, it returns a mutilated version thereof with ...

Johan

77.4k

asked Sep 12 at 11:45

0 votes

1 answer

153 views

How to correctly monitor a program’s GPU memory bandwidth utilization and SM utilization? (DCGM DRAM_ACTIVE vs in-program bandwidth differs a lot)

I want to quantitatively measure the memory bandwidth utilization and SM utilization of a CUDA program for performance analysis and regression testing. My approach so far: Compute the theoretical ...

plznobug

143

asked Sep 5 at 10:48

2 votes

1 answer

128 views

ILGPU kernel silently not compiling

I am trying to debug a kernel written for ILGPU which does not compile. My aplication has 2 big kernels. The first (that loads and does the right thing): /// <summary> /// Unified GPU kernel ...

AlessandroParma

161

asked Aug 29 at 12:23

1 vote

1 answer

159 views

std::complex in cuda kernels

CUDA allows to run constexpr member functions when compiling with --expt-relaxed-constexpr. This allows to use std::complex<double> in cuda kernels. However, while doing this, I get incorrect ...

thetwom

57

asked Aug 27 at 15:00

3 votes

0 answers

102 views

CUDA: Load misaligned float4 vector

I want to load 4 floats per thread. I know they are not 16 byte aligned. What is the best way to do this? Specific conditions: I cannot align the array without replicating the data because other ...

Homer512

15.1k

asked Aug 22 at 12:53

1 vote

1 answer

352 views

How are fp6 and fp4 supported on NVIDIA Tensor Core on Blackwell?

I am writing PTX assembly code on CUDA C++ for research. This is my setup: I have just downloaded the latest CUDA C++ toolkit (13.0) yesterday on WSL linux. The local compilation environment does not ...

Junhao Liu

11

asked Aug 14 at 10:03

1 vote

1 answer

110 views

What is the actual maximum nesting depth of dynamic parallelism in CUDA?

Without getting into too much detail, the project I'm working on needs three different phases, each corresponding to a different kernel. I only know the number of threads needed in the second phase ...

StefanoTrv

308

asked Aug 12 at 13:29

3 votes

1 answer

157 views

calling constructor with different types of parameters in template function

I have a simple function gpu_allocate() to helps allocate memory on GPU (CUDA): template <typename T> T *gpu_allocate() { T *data; cudaMallocManaged(&data, sizeof(T)); return data; } ...

Rahn

5,565

asked Aug 9 at 0:01

2 votes

1 answer

76 views

How to correctly pass float4 vector to kernel using PyCUDA?

I am trying to pass a float4 as argument to my cuda kernel (by value) using PyCUDA’s make_float4(). But there seems to be some misalignment when the data is transferred to the kernel. If I read the ...

Dodilei

308

asked Aug 7 at 19:49

-2 votes

1 answer

71 views

cmake generating a bad command line option for CUDA in MSVC on Windows [closed]

Cmake build is producing this error message, nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified when running this command like that itself generates:...

alfC

16.8k

asked Aug 2 at 1:18

0 votes

1 answer

105 views

CudaSetValidDevices doesn't seem to work as expected

I m using a Nvidia machine which has 2 Tesla V100 GPUs. I used cudaSetValidDevices API to set only 1 valid device which is device 1. After that if I'm trying to set device 0, the API still seem to ...

Satyanvesh D

405

asked Jul 29 at 16:47

-1 votes

1 answer

180 views

Parallelising CCITT-CRC16 over via CUDA

I have been tasked by my boss to convert a sequential CRC16 algorithm that runs on the CPU into something that can run on the GPU via CUDA (that isn't just running the sequential algorithm in a single ...

Louis Child

115

asked Jul 29 at 14:03

2 votes

1 answer

118 views

Strange behaviour of atomicCAS when used as a mutex

I'm trying to learn CUDA programming, and recently I have been working on the lectures in this course: https://people.maths.ox.ac.uk/~gilesm/cuda/lecs/lec3.pdf, where they discussed the atomicCAS ...

Dang Manh Truong

710

asked Jul 28 at 9:53

2 votes

0 answers

41 views

What do shuffle instructions do on the hardware? [duplicate]

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-shfl-...

Tom Huntington

3,750

asked Jul 23 at 5:49

1 vote

0 answers

73 views

Why does cusolverDnDsytri not find the inverse of a matrix in Fortran?

I have recently been trying to use CUDA in Fortran for a project which requires finding the inverse of symmetric matrices. The matrices I am working with are not positive definite so I cannot use a ...

Ethan

11

asked Jul 21 at 5:55

0 votes

1 answer

122 views

Why does NVCC not optimize ldexpf with a constexpr power-of-two exponent into a simple fmul?

Consider the following CUDA code: enum { p = 5 }; __device__ float adjust_mul(float x) { return x * (1 << p); } __device__ float adjust_ldexpf(float x) { return ldexpf(x, p); } I would expect ...

einpoklum

138k

asked Jul 20 at 15:09

2 votes

1 answer

86 views

Degree of Bank conflicts in cuda - Picture not clear from GPU GEMS Prefix Sum article

I am trying to understand this article : https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda More specifically bank-conflicts is what I am ...

user8469759

2,948

asked Jul 18 at 14:30

4 votes

1 answer

91 views

Can threads in a warp synchronize with different calls to __shfl_sync?

I'm trying to learn Cuda and I'm having trouble to wrap my head around warps and lanes in them. A lane is the word I use for threads inside of the same warp. I can exchange data between lanes as ...

SomeoneWithQuestions

1,627

asked Jul 7 at 12:30

1 vote

0 answers

40 views

cusolverDnDgesvdj for the computation of singular values only under Python

I have set up the code below for the computation of the SVD of a matrix. The code uses cuSOLVER's cusolverDnDgesvdj. import ctypes import numpy as np import pycuda.autoinit # implicit ...

Vitality

21.7k

asked Jul 6 at 14:22

0 votes

1 answer

368 views

Distinction CuTe and NVIDIA Cutlass

I'm confused what exactly is handled by CuTe and by Cutlass. From my understanding Cutlass handles the following: Gemm computation of CuTe Tensors Communication between CPU and GPU Abstract memory ...

jonithani123

254

asked Jul 2 at 14:23

1 vote

1 answer

173 views

CUDA `cudaMemcpyBatchAsync` "invalid argument"

I'm consistently encountering an "invalid argument" error when calling cudaMemcpyBatchAsync for host-to-device transfers. CUDA error at btest.cu:43 - invalid argument Line 43 is CUDA_CHECK(...

NDrew

13

asked Jun 30 at 15:04

3 votes

1 answer

137 views

Does Clang support dynamic parallelism in cuda?

Dynamic parallelism means kernels calls kernels. Its possible to compile CUDA program using clang, but do clang support dynamic parallelism ? I am getting this error when attempting to compile a CUDA ...

michael101

53

asked Jun 30 at 10:40

Collectives™ on Stack Overflow