Skip to main content
Filter by
Sorted by
Tagged with
2 votes
1 answer
86 views

I am trying to understand this article : https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda More specifically bank-conflicts is what I am ...
user8469759's user avatar
  • 2,948
0 votes
1 answer
87 views

I learned CUDA function cudaMallocPitch creates padded memory that helps avoid bank conflict from this nice SO answer. I can understand well how does the padding help alignment, as it very much ...
PkDrew's user avatar
  • 2,301
1 vote
0 answers
268 views

I am trying to incrementally optimize matrix transpose operation on CUDA and gain some hands on experience. I have tried a few things but the timing measurements that I am getting do not make sense. ...
Saydon's user avatar
  • 27
3 votes
0 answers
377 views

I'm working on the render part of Assignment 2 for CMU's 15-418 course,which involves writing a high-performance renderer using CUDA. In my code, each CUDA thread is responsible for computing a single ...
Sunjnn's user avatar
  • 51
1 vote
0 answers
151 views

I'm working with different CUDA kernels (gemm3, gemm4, and gemm5) for matrix multiplication: gemm3: baseline of shared memory GEMM gemm4: less thread blocks in x dimension gemm5: less blocks in both ...
Worldbuffer's user avatar
1 vote
0 answers
106 views

As the trick described in here, I tested the following code and got the corresponding profiling result. Conflicts were notably diminished, but some still persist. // store conflict __global__ void ...
picklesmithy129's user avatar
0 votes
1 answer
134 views

I was trying to reproduce a bank conflict scenario (minimal working example here) and decided to perform a benchmark when a warp (32 threads) access 32 integers of size 32-bits each in the following 2 ...
Ferdinand Mom's user avatar
0 votes
1 answer
716 views

If all threads in same block visit the same address i.e. array[0] for some old compute capability, there is a bank conflict. But does this conflict still exist for the latest compute capabilities (i.e....
cctv's user avatar
  • 19
3 votes
1 answer
784 views

I have a little confusion about bank conflicts, avoiding them using memory padding and coalesced memory access. What I've read so far: Coalesced memory access from global memory is optimal. If it isn'...
SimonH's user avatar
  • 1,455
2 votes
1 answer
1k views

I would like to be sure that I correctly understand bank conflicts in shared memory. I have 32 segments of data. These segments consist of 128 integers each. [[0, 1, ..., 126, 127], [128, 129, ..., ...
Piotr K.'s user avatar
1 vote
1 answer
310 views

It is possible to use nvprof to access/read bank conflicts counters for CUDA exec: nvprof --events shared_st_bank_conflict,shared_ld_bank_conflict my_cuda_exe However it does not work for the code ...
Artyom's user avatar
  • 31.5k
-1 votes
1 answer
2k views

I'm in the process of writing some N-body simulation code with short-ranged interactions in CUDA targeted toward Volta and Turing series cards. I plan on using shared memory, but it's not quite clear ...
Ian Graham's user avatar
0 votes
1 answer
853 views

I have the following code that performs a tiled matrix transpose using shared memory to improve performance. The shared memory is padded with 1 column to avoid bank conflict for a 32x32 thread block. ...
Moody's user avatar
  • 1,417
1 vote
2 answers
1k views

Suppose I have a full warp of threads in a CUDA block, and each of these threads is intended to work with N elements of type T, residing in shared memory (so we have warp_size * N = 32 N elements ...
einpoklum's user avatar
  • 138k
1 vote
1 answer
504 views

I have 5 large size arrays A(N*5), B(N*5), C(N*5), D(N*5), E(N*2) number 5 and 2 represents the components of these variables in different planes/axes. That's why I have structured arrays in this ...
user2415927's user avatar
1 vote
1 answer
2k views

I have an array like this: data[16] = {10,1,8,-1,0,-2,3,5,-2,-3,2,7,0,11,0,2} I want to compute the reduction of this array using shared memory on a G80 GPU. The kernel as cited in the NVIDIA ...
sara idrissi's user avatar
0 votes
1 answer
269 views

I am working on a kernel that does a vector reduction. It basically adds up all the positions in the vector and stores the result in position 0. I'm following this scheme, with blocks of 512 float ...
ismarlowe's user avatar
  • 157
3 votes
1 answer
1k views

I am doing a detailed code analysis for which I want to measure the total number of bank conflicts per warp. The nvvp documentation lists this metric, which was the only one I could find related to ...
Kajal's user avatar
  • 611
3 votes
1 answer
442 views

I understand the bank conflict when dealing with 4-byte data types, but I wonder if we get any bank conflict (4-way/8-way?) with the following code __shared__ char shared[]; foo = shared[threadIdx.x]; ...
Karl's user avatar
  • 31
3 votes
1 answer
422 views

It is a mystery for me how shared memory on CUDA devices work. I was curious to count threads having access to the same shared memory. For this I wrote a simple program #include <cuda_runtime.h> ...
yarchik's user avatar
  • 367
0 votes
2 answers
331 views

in these days, i'm trying program on mobile gpu(adreno) the algorithm what i use for image processing has 'randomness' for memory access. it refers some pixels in 'fixed' range for filtering. BUT, ...
eclipse0922's user avatar
0 votes
2 answers
249 views

I'm developing face detection app in android platform using OpenCL. Face detection algorithm is based on Viola Jones algorithm. I tried to make Cascade classification step kernel code. and I set ...
youngwan lee's user avatar
-1 votes
2 answers
1k views

I've been working on optimizing some code and ran into an issue with the shared memory bank conflict report from the CUDA Nsight performance analysis. I was able to reduce it to a very simple piece ...
Nisrak's user avatar
  • 335
3 votes
1 answer
2k views

I'm running into (what I believe are) shared-memory bank conflicts in a CUDA kernel. The code itself is fairly complex, but I reproduced it in the simple example attached below. In this case it is ...
Bart's user avatar
  • 10.4k
0 votes
1 answer
228 views

Suppose our hardware has 32 banks of 4 byte width. And we have a 1D kernel of size 32, and a local 1D array of ints. Then, ensuring that each consecutive thread accesses consecutive memory locations ...
Jacko's user avatar
  • 13.4k
3 votes
1 answer
1k views

From what I read in the CUDA documentation, shared memory bank conflicts are irrelevant on sm_20 and higher because values are broadcasted when they are requested simultaneously, preventing any sort ...
user3800357's user avatar
10 votes
1 answer
1k views

This blog post explains how memory bank conflicts kill the transpose function's performance. Now I can't but wonder: does the same happen on a "normal" cpu (in a multithreaded context)? Or is this ...
rubenvb's user avatar
  • 77.2k
6 votes
1 answer
1k views

I just learned (from Why only one of the warps is executed by a SM in cuda?) that Kepler GPUs can actually execute instructions from several (apparently 4) warps at once. Can a shared memory bank ...
user3314215's user avatar
1 vote
1 answer
262 views

I am accessing global memory to load data to shared memory and would like to know if there is a bank conflict. Here is the setup: In global memory: g_array. A 2D matrix of size (256, 64) This is ...
Adjeiinfo's user avatar
  • 149
7 votes
1 answer
3k views

Shared memory is "striped" into banks. This leads to the whole issue of bank conflicts, as we all know. Question: But how can you determine how many banks ("stripes") exist in ...
cmo's user avatar
  • 4,154
1 vote
1 answer
641 views

In my program I use shared memory to do prefetching of data. A 2D block of threads, dimentions 8 by 4 (32), gets 8 * 4 * 8 * sizeof(float4) bytes of shared memory. Each thread copies 8 float4s in a ...
Dorota Kadłubowska's user avatar
2 votes
1 answer
2k views

On NVIDIA's 2.x architecture, each warp has 64kb of memory that is by default partitioned into 48kb of Shared Memory and 16kb of L1 cache (servicing global and constant memory). We all know about the ...
cmo's user avatar
  • 4,154
1 vote
1 answer
3k views

I'm a rookie in learning CUDA parallel programming. Now I'm confused in the global memory access of device. It's about the warp model and coalescence. There are some points: It's said that threads in ...
Han's user avatar
  • 407
0 votes
1 answer
231 views

This piece of CUDA code reports lots of bank conflicts when analysed by Nsight. The first snippet contains the constants definition and kernel call: // Front update related constants #define NDEQUES ...
dsilva.vinicius's user avatar
8 votes
1 answer
175 views

The kernel parameters are stored in on-chip shared memory. Shared memory can have bank conflicts if threads try to access the same bank. So my question is: does that mean that using kernel parameters ...
Netuimeni's user avatar
7 votes
3 answers
788 views

Let A be a properly aligned array of 32-bit integers in shared memory. If a single warp tries to fetch elements of A at random, what is the expected number of bank conflicts? In other words: ...
CygnusX1's user avatar
  • 22.1k
1 vote
1 answer
344 views

What is a bank conflict in devices with 2.x devices? As I understand the CUDA C programming guide, in 2.x devices, if two threads access the same 32 bit word in the same shared memory bank, it does ...
gmemon's user avatar
  • 2,761
1 vote
1 answer
448 views

I am designing a CUDA kernel that will be launched with 16 threads per thread block. I have an array of N ints in shared memory (i.e. per thread block) that I wish to process. If the access pattern ...
twerdster's user avatar
  • 5,023
0 votes
2 answers
899 views

I have to use shared memory that is 64 elements in size, twice the number of banks and threads in a warp. How should I address them to yield a bank-conflict-free access?
Behzad Baghapour's user avatar
0 votes
1 answer
1k views

I try to transfer some data from shared memory to global memory. Some consecutive threads will access one bank (but not the same 32 bits). So there are some bank conflicts. (I use Visual Profiler to ...
papayamomo's user avatar
1 vote
1 answer
487 views

I apologize in advance for the vagueness of this question. Background: I am attempting to write a morphological image processing function in OpenCL. I have a __local buffer which I use to store ...
Reefpoints's user avatar
14 votes
2 answers
13k views

I am trying to understand how bank conflicts take place. I have an array of size 256 in global memory and I have 256 threads in a single block, and I want to copy the array to shared memory. Therefore ...
scatman's user avatar
  • 14.6k
21 votes
4 answers
9k views

One thing I haven't figured out and google isn't helping me, is why is it possible to have bank conflicts with shared memory, but not in global memory? Can there be bank conflicts with registers? ...
smuggledPancakes's user avatar
130 votes
5 answers
70k views

I have been reading the programming guide for CUDA and OpenCL, and I cannot figure out what a bank conflict is. They just sort of dive into how to solve the problem without elaborating on the subject ...
smuggledPancakes's user avatar
4 votes
3 answers
3k views

What is the difference between coalescence and bank conflicts when programming with cuda? Is it only that coalescence happens in global memory while bank conflicts in shared memory? Should I worry ...
hero's user avatar
  • 41