Array access/write performance differences?

Question

This is probably going to language dependent, but in general, what is the performance difference between accessing and writing to an array?

For example, if I am trying to write a prime sieve and am representing the primes as a boolean array.

Upon finding a prime, I can say

for(int i = 2; n * i < end; i++)
{
    prime[n * i] = false;
}

or

for(int i = 2; n * i < end; i++)
{
    if(prime[n * i])
    {
        prime[n * i] = false;
    }
}

The intent in the latter case is to check the value before writing it to avoid having to rewrite many values that have already been checked. Is there any realistic gain in performance here, or are access and write mostly equivalent in speed?

You have the codes. Profile them! We can provide only guesses, you are the only one who can find out the real answer. — svick
– svick, Commented Jul 17, 2011 at 18:54
Fwiw, you might be able to optimize the loop condition n * i < end slightly. n and end stay constant throughout the loop. So before the loop begins, int end_n = end/n; and then you can change the loop condition to i < end_n. That will avoid the repeated multiplication each loop. Again, I say might - I don't know assembly; there could be a compound instruction for testing if 'a*b < c' that makes this 'optimization' worthless. — Ponkadoodle
– Ponkadoodle, Commented Jul 17, 2011 at 19:19
Some very rudimentary profiling suggests a complex answer: there is a cutoff line where the number of writes saved is actually beneficial and the second method pays off. But unless your ratio of redundant writes to necessary writes is very high, it is unlikely to pay off. — donnyton
– donnyton, Commented Jul 18, 2011 at 0:55

em_and_m · Accepted Answer · 2011-07-17 18:39:55Z

3

Impossible to answer such a generic question without the specifics of the machine/OS this is running on, but in general the latter is going to be slower because:

The second example you have to get the value from RAM to L2/L1 cache and read it to a register, make a chance on the value and write it back. In the first case you might very well get away with simply writing a value to the L1/L2 caches. It can written to RAM from the caches later while your program is doing something else.
The second form has much more code to execute per iteration. For large enough number of iterations, the difference gets big real fast.

answered Jul 17, 2011 at 18:39

em_and_m

15.4k43 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Karoly Horvath Over a year ago

2. for a short loop like this executions per iteration doesn't really matter as memory access is the bottleneck.

em_and_m Over a year ago

Who said it's a short loop? - I don't know what the value of end is. :-)

bchurchill · Accepted Answer · 2011-07-17 19:54:48Z

In general this depends much more on the machine than the programing language. The writes often will take a few more clock cycles because, depending on the machine, more cache values need to be updated in memory.

However, your second segment of code will be WAY slower, and it's not just because there's "more code". The big reason is that anytime you use an if-statement on most machines the CPU uses a branch predictor. The CPU literally predicts which way the if-statement will run ahead of time, and if it's wrong it has to backtrack. See http://en.wikipedia.org/wiki/Pipeline_%28computing%29 and http://en.wikipedia.org/wiki/Branch_predictor to understand why.

If you want to do some optimization, I would recommend the following:

Profile! See what's really taking up time.
Multiplication is much harder than addition. Try rewriting the loop so that i += n, and use this for your array index.
The loop condition "should" be totally reevaluated at every iteration unless the compiler optimizes it away. So try avoiding multiplication in there.
Use -O2 or -O3 as a compiler option
You might find that some values of n are faster than others because of cache locality. You might think of some clever ways to rewrite your code to take advantage of this.
Disassemble the code and look at what it's actually doing on your processor

tomasz · Accepted Answer · 2011-07-17 20:14:37Z

It's a hard question and it heavily depends on your hardware, OS and complier. But for sake of theory, you should consider two things: branching and memory access. As branching is generally evil, you want to avoid it. I wouldn't even surprise if some compiler optimization took place and your second snippet would be reduced to the first one (compilers love avoiding branches, they probably consider it as a hobby, but they have a reason). So in these terms the first example is much cleaner and easier to deal with.

There're also CPU caches and other memory related issues. I believe that in both examples you have to actually load the memory into the CPU cache, so you can either read it or update. While reading is not a problem, writing have to propagate the changes up. I wouldn't be worried if you use the function in a single thread (as @gby pointed out, OS can push the changes a little bit later).

There is only one scenario I can come up with, that would make me consider solution from your second example. If I shared the table between threads to work on it in parallel (without locking) and had separate caches for different CPUs. Then, every time you amend the cache line from one thread, the other thread have to update it's copy before reading or writing to the same memory block. It's known as a cache coherence and it actually may hurt your performance badly; in such a case I could consider conditional writes. But wait, it's probably far away from your question...

Collectives™ on Stack Overflow

Array access/write performance differences?

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related