Skip to main content
Filter by
Sorted by
Tagged with
-5 votes
1 answer
46 views

I have CUDA installed via the regular Windows downloadable installer via the official website, and am trying to use PyTorch in the PyCharm program using CUDA as kernel. PyTorch now works fine, however ...
alexanderjansma's user avatar
0 votes
0 answers
47 views

I’m trying to run TensorFlow with GPU on Windows 11 + WSL2 using an NVIDIA RTX 5090. The issue TensorFlow currently supports: CUDA 12.3 cuDNN 9.x Requires NVIDIA driver ≥ 560.94 But RTX 50-series GPUs ...
Mansoor's user avatar
  • 27
0 votes
0 answers
49 views

can anyone point me out what am I doing wrong in the following tiled matmul kernel in CUDA. It is a tiled matrix multiplication of two matrices a and b, whose dimensions are mxk and kxn respectively. ...
explorer's user avatar
Advice
1 vote
5 replies
94 views

So I'm trying to learn CUDA C. I had an idea for a simple code that could calculate the simple average of a float array. The idea is that main() will call a host function get_average(), which will ...
bob.sacamento's user avatar
0 votes
0 answers
54 views

I am compiling NCCL 2.27.5-1 (I tried also 2.28.9-1) from source for a V100 GPU (sm_70). My goal is to have libnccl.so contain compute_70 PTX for every kernel. Despite passing explicit -gencode=arch=...
CiZ's user avatar
  • 9
0 votes
0 answers
61 views

I am implementing a GPU-to-GPU benchmark using CUDA IPC on a node where two GPUs are connected via NVLink/NVSwitch The workflow is the following: Each process allocates a device buffer on its local ...
Matteo Sperini's user avatar
0 votes
0 answers
56 views

I'm facing a frustrating JAX runtime error on a multi-GPU server. My script works fine for a simple test but fails with a No BLAS support for stream error when I try to run multiple instances of it in ...
PowerPoint Trenton's user avatar
5 votes
2 answers
342 views

Problem I'm trying to use clangd for LSP in Neovim with CUDA .cu files, but it fails to recognize standard C++ library features on the host side. Even simple host functions using std::format, std::...
NeKon's user avatar
  • 314
3 votes
0 answers
94 views

Can I modify host data in host_data_ptr after the following ? cudaMemcpyAsync(device_data_ptr, host_data_ptr, size, cudaMemcpyHostToDevice, ...
YSF's user avatar
  • 41
1 vote
1 answer
302 views

I am trying to install JAX with GPU support on a powerful, dedicated Linux server, but I am stuck in what feels like a Catch-22 where every official installation method fails in a different way, ...
PowerPoint Trenton's user avatar
-2 votes
0 answers
66 views

I had followed 6:47 onwards of this video https://www.youtube.com/watch?v=q4fGac-nrTI to install jetpack. These are the outputs of dpkg -l and jtop respectively: https://imgur.com/a/Xwwu0XX as dpkg -l ...
algo's user avatar
  • 69
3 votes
1 answer
119 views

I am currently attempting to use the thrust::remove function on a thrust::device_vector of structs in my main function as shown bellow: #include <iostream> #include <thrust/device_vector.h>...
AowynB's user avatar
  • 33
-1 votes
1 answer
127 views

I have a coded my simple CUDA ZIP password cracker but it seems that it prints same password for a number of times and i couldn't figure out why and this is weighing down my program. Here is the full ...
actgroup inc's user avatar
0 votes
1 answer
139 views

I'm trying to build, using CMake, a program involving C++ and CUDA-C++ code. It used to build file, several months ago, but - now am getting a linker error I'm not familiar with: in function `main....
einpoklum's user avatar
  • 138k
3 votes
1 answer
134 views

I am trying to run basic CUDA program in google colab but its not giving kernel output. Below are the steps what I tried: Changed run type to T4 GPU. !pip install nvcc4jupyter %load_ext ...
Digvijay Singh Thakur's user avatar
1 vote
0 answers
89 views

I want to create a skeleton for a project in which there are multiple cuda and cpp files. They will be compiled individually and then linked together to form a single executable. Currently I have the ...
ThErOmAnEmPiRe's user avatar
1 vote
1 answer
64 views

I installed NVIDIA Nsight Visual Studio Edition 2025.01 in Visual Studio 2022. I want to debug code, but I can't debug with step over(F10), The debugger always stops at a location without a breakpoint....
Imagination Youth's user avatar
3 votes
1 answer
141 views

I'm writing some Hamiltonian evolution code that relies heavily on matrix multiplication, so I've been trying to learn about developing for a GPU using python. However, when I run these lines of code ...
user2506833's user avatar
2 votes
0 answers
119 views

Context I'm implementing a Julia set fractal renderer using CubeCL (a Rust GPU compute framework). I want to achieve "infinite zoom" similar to deep Mandelbrot zoom videos, which requires ...
Marco Fanelli's user avatar
1 vote
0 answers
185 views

I’m using the PyTorch profiler to analyze sglang, and I noticed that in the CUDA timeline, some kernels show “Command Buffer Full”. This causes the cudaLaunchKernel time to become very long, as shown ...
plznobug's user avatar
  • 143
1 vote
1 answer
166 views

If I do a memcpy_async on a per thread basis, everything works fine, see the test_memcpy32 below. This code prefetches data within a single warp. I want to expand this, so that I can prefetch data in ...
Johan's user avatar
  • 77.4k
2 votes
1 answer
181 views

I’m trying to launch a captured CUDA Graph from inside a regular CUDA kernel (i.e., device-side graph launch). From the NVIDIA blog on device graph launch, it seems this should be supported on newer ...
Mohammad Siavashi's user avatar
1 vote
0 answers
73 views

I have some CUDA kernel code doing the following: half * restrict output; half v; // ... etc ... int i = whatever(); #ifdef CHECK_X if (i >= 0 && i <= SOME_CONSTANT) #endif { output[...
einpoklum's user avatar
  • 138k
0 votes
1 answer
45 views

I am working on C++ project on Windows, using CUDA 12.0, cmake 3.31.6, vcpkg (updated to recent commit a62ce77). During configuration CMake tries to launch nvcc with some small test program to get ...
CygnusX1's user avatar
  • 22.1k
0 votes
0 answers
99 views

I am trying to benchmark TensorRT inference using CUDA Green Contexts and splitting SMs. My code runs fine when I generate the .engine with fixed input shapes, but it fails when I build the engine ...
Gota_12's user avatar
  • 23
0 votes
1 answer
99 views

I am trying to implement producer consumer problem in GPU-CPU. Required for some other project. GPU requests some data via Unified memory to CPU. CPU copies that data to a specific location in global ...
Chinmaya Bhat K K's user avatar
1 vote
1 answer
59 views

I’m trying to set up a LocalCUDACluster on WSL2 (Ubuntu 22.04) from Windows 11 for GPU computations. The cluster starts and runs, but performance is ~10× slower than running directly on the GPU, and ...
Marek Majoch's user avatar
-3 votes
1 answer
70 views

I use cupy to call cuda kernels, but I don't know how to debug cuda code, here is my wrapper file: wrapper.py import math from pathlib import Path import cupy as cp import numpy as np with open(Path(...
S200331082's user avatar
8 votes
1 answer
624 views

I updated CUDA to version 13. But it seems that cudaGetDeviceProperties has changed. Instead of returning the cudaDeviceProp struct with clockRate, it returns a mutilated version thereof with ...
Johan's user avatar
  • 77.4k
0 votes
1 answer
153 views

I want to quantitatively measure the memory bandwidth utilization and SM utilization of a CUDA program for performance analysis and regression testing. My approach so far: Compute the theoretical ...
plznobug's user avatar
  • 143
2 votes
1 answer
128 views

I am trying to debug a kernel written for ILGPU which does not compile. My aplication has 2 big kernels. The first (that loads and does the right thing): /// <summary> /// Unified GPU kernel ...
AlessandroParma's user avatar
1 vote
1 answer
159 views

CUDA allows to run constexpr member functions when compiling with --expt-relaxed-constexpr. This allows to use std::complex<double> in cuda kernels. However, while doing this, I get incorrect ...
thetwom's user avatar
  • 57
3 votes
0 answers
102 views

I want to load 4 floats per thread. I know they are not 16 byte aligned. What is the best way to do this? Specific conditions: I cannot align the array without replicating the data because other ...
Homer512's user avatar
  • 15.1k
1 vote
1 answer
352 views

I am writing PTX assembly code on CUDA C++ for research. This is my setup: I have just downloaded the latest CUDA C++ toolkit (13.0) yesterday on WSL linux. The local compilation environment does not ...
Junhao Liu's user avatar
1 vote
1 answer
110 views

Without getting into too much detail, the project I'm working on needs three different phases, each corresponding to a different kernel. I only know the number of threads needed in the second phase ...
StefanoTrv's user avatar
3 votes
1 answer
157 views

I have a simple function gpu_allocate() to helps allocate memory on GPU (CUDA): template <typename T> T *gpu_allocate() { T *data; cudaMallocManaged(&data, sizeof(T)); return data; } ...
Rahn's user avatar
  • 5,565
2 votes
1 answer
76 views

I am trying to pass a float4 as argument to my cuda kernel (by value) using PyCUDA’s make_float4(). But there seems to be some misalignment when the data is transferred to the kernel. If I read the ...
Dodilei's user avatar
  • 308
-2 votes
1 answer
71 views

Cmake build is producing this error message, nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified when running this command like that itself generates:...
alfC's user avatar
  • 16.8k
0 votes
1 answer
105 views

I m using a Nvidia machine which has 2 Tesla V100 GPUs. I used cudaSetValidDevices API to set only 1 valid device which is device 1. After that if I'm trying to set device 0, the API still seem to ...
Satyanvesh D's user avatar
-1 votes
1 answer
180 views

I have been tasked by my boss to convert a sequential CRC16 algorithm that runs on the CPU into something that can run on the GPU via CUDA (that isn't just running the sequential algorithm in a single ...
Louis Child's user avatar
2 votes
1 answer
118 views

I'm trying to learn CUDA programming, and recently I have been working on the lectures in this course: https://people.maths.ox.ac.uk/~gilesm/cuda/lecs/lec3.pdf, where they discussed the atomicCAS ...
Dang Manh Truong's user avatar
2 votes
0 answers
41 views

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-shfl-...
Tom Huntington's user avatar
1 vote
0 answers
73 views

I have recently been trying to use CUDA in Fortran for a project which requires finding the inverse of symmetric matrices. The matrices I am working with are not positive definite so I cannot use a ...
Ethan's user avatar
  • 11
0 votes
1 answer
122 views

Consider the following CUDA code: enum { p = 5 }; __device__ float adjust_mul(float x) { return x * (1 << p); } __device__ float adjust_ldexpf(float x) { return ldexpf(x, p); } I would expect ...
einpoklum's user avatar
  • 138k
2 votes
1 answer
86 views

I am trying to understand this article : https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda More specifically bank-conflicts is what I am ...
user8469759's user avatar
  • 2,948
4 votes
1 answer
91 views

I'm trying to learn Cuda and I'm having trouble to wrap my head around warps and lanes in them. A lane is the word I use for threads inside of the same warp. I can exchange data between lanes as ...
SomeoneWithQuestions's user avatar
1 vote
0 answers
40 views

I have set up the code below for the computation of the SVD of a matrix. The code uses cuSOLVER's cusolverDnDgesvdj. import ctypes import numpy as np import pycuda.autoinit # implicit ...
Vitality's user avatar
  • 21.7k
0 votes
1 answer
368 views

I'm confused what exactly is handled by CuTe and by Cutlass. From my understanding Cutlass handles the following: Gemm computation of CuTe Tensors Communication between CPU and GPU Abstract memory ...
jonithani123's user avatar
1 vote
1 answer
173 views

I'm consistently encountering an "invalid argument" error when calling cudaMemcpyBatchAsync for host-to-device transfers. CUDA error at btest.cu:43 - invalid argument Line 43 is CUDA_CHECK(...
NDrew's user avatar
  • 13
3 votes
1 answer
137 views

Dynamic parallelism means kernels calls kernels. Its possible to compile CUDA program using clang, but do clang support dynamic parallelism ? I am getting this error when attempting to compile a CUDA ...
michael101's user avatar

1
2 3 4 5
295