How does array_diff work?

Question

How does array_diff() work? It obviously couldn't work as follows:

function array_diff($arraya, $arrayb)
{
    $diffs = array();
    foreach ($arraya as $keya => $valuea)
    {
        $equaltag = 0;
        foreach ($arrayb as $valueb)     
        {
            if ($valuea == $valueb)
            {
                $equaltag =1;
                break;
            }
        }
        if ($equaltag == o)
        {
              $diffs[$keya]=$valuea;
        }

    }
    return $diffs;                          
}                                  //couldn't be worse than this

Does anyone know a better solution?

EDIT @animuson:

function array_diff($arraya, $arrayb)
{
    foreach ($arraya as $keya => $valuea)
    {
        if (in_array($valuea, $arrayb))
        {
            unset($arraya[$keya]);
        }
    }
    return $arraya;
}

BurninLeo · Accepted Answer · 2014-11-27 09:44:51Z

33

user187291's suggestion to do it in PHP via hash tables is simply great! In a rush of adrenaline taken from this phantastic idea, I even found a way to speed it up a little more (PHP 5.3.1):

function leo_array_diff($a, $b) {
    $map = array();
    foreach($a as $val) $map[$val] = 1;
    foreach($b as $val) unset($map[$val]);
    return array_keys($map);
}

With the benchmark taken from user187291's posting:

LEO=0.0322  leo_array_diff()
ME =0.1308  my_array_diff()
YOU=4.5051  your_array_diff()
PHP=45.7114 array_diff()

The array_diff() performance lag is evident even at 100 entries per array.

Note: This solution implies that the elements in the first array are unique (or they will become unique). This is typical for a hash solution.

Note: The solution does not preserve indices. Assign the original index to $map and finally use array_flip() to preserve keys.

function array_diff_pk($a, $b) {
    $map = array_flip($a);
    foreach($b as $val) unset($map[$val]);
    return array_flip($map);
}

PS: I found this while looking for some array_diff() paradoxon: array_diff() took three times longer for practically the same task if used twice in the script.

edited Nov 27, 2014 at 9:44

answered Jul 14, 2011 at 21:59

BurninLeo

4,5244 gold badges44 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user753531 Over a year ago

Although this quite an old topic I found it only today but I couldn't reproduce what you said in order to have an associative array as output.

BurninLeo Over a year ago

Added another short function array_diff_pk to preserve keys, also within associative arrays. Yet, I did not test the performance of array_flip or the overall function. Please also note, that using these replacement function will only make sense, if large arrays are handled that actually cause performance issued with the built-in (and in the meanwhile optimized) functions.

onin Over a year ago

I really like your solution.

Community · Accepted Answer · 2017-05-23 10:31:20Z

25

UPDATE

see below for faster/better code.
array_diff behaviour is much better in php 5.3.4, but still ~10 times slower than Leo's function.
also it's worth noting that these functions are not strictly equivalent to array_diff since they don't maintain array keys, i.e. my_array_diff(x,y) == array_values(array_diff(x,y)).

/UPDATE

A better solution is to use hash maps

function my_array_diff($a, $b) {
    $map = $out = array();
    foreach($a as $val) $map[$val] = 1;
    foreach($b as $val) if(isset($map[$val])) $map[$val] = 0;
    foreach($map as $val => $ok) if($ok) $out[] = $val;
    return $out;
}

$a = array('A', 'B', 'C', 'D');
$b = array('X', 'C', 'A', 'Y');

print_r(my_array_diff($a, $b)); // B, D

benchmark

function your_array_diff($arraya, $arrayb)
{
    foreach ($arraya as $keya => $valuea)
    {
        if (in_array($valuea, $arrayb))
        {
            unset($arraya[$keya]);
        }
    }
    return $arraya;
}

$a = range(1, 10000);
$b = range(5000, 15000);

shuffle($a);
shuffle($b);

$ts = microtime(true);
my_array_diff($a, $b);
printf("ME =%.4f\n", microtime(true) - $ts);

$ts = microtime(true);
your_array_diff($a, $b);
printf("YOU=%.4f\n", microtime(true) - $ts);

result

ME =0.0137
YOU=3.6282

any questions? ;)

and, just for fun,

$ts = microtime(true);
array_diff($a, $b);
printf("PHP=%.4f\n", microtime(true) - $ts);

result

ME =0.0140
YOU=3.6706
PHP=19.5980

that's incredible!

edited May 23, 2017 at 10:31

CommunityBot

11 silver badge

answered Mar 19, 2010 at 19:44

user187291

54.1k19 gold badges98 silver badges127 bronze badges

5 Comments

Young Over a year ago

OOPS!!That's really incredible!

goat Over a year ago

+1. I'm surprised that this is even faster, although unlike array_diff, index association is lost: array_keys(array_diff_key(array_fill_keys($a, 1), array_fill_keys($b, 1)))

ColinM Over a year ago

Also if $a contains duplicate values not in $b these will be lost.

Emsal Over a year ago

PHP is much faster now ME =0.0036 YOU=0.1217 PHP=0.0018

OlivierH Over a year ago

Same here with PHP 7.4 ME =0.0066 YOU=0.1145 PHP=0.0014

Amal · Accepted Answer · 2014-02-03 13:37:06Z

7

The best solution to know how it works it to take a look at its source-code ;-)
(Well, that's one of the powers of open source -- and if you see some possible optimization, you can submit a patch ;-) )

For array_diff, it should be in ext/standard -- which means, for PHP 5.3, it should be there : branches/PHP_5_3/ext/standard

And, then, the array.c file looks like a plausible target ; the php_array_diff function, line 3381, seems to correspond to array_diff.

(Good luck going through the code : it's quite long...)

edited Feb 3, 2014 at 13:37

Amal

76.8k18 gold badges134 silver badges155 bronze badges

answered Mar 19, 2010 at 19:17

Pascal MARTIN

402k82 gold badges665 silver badges666 bronze badges

1 Comment

Pascal MARTIN Over a year ago

Yeah, that's the kind of situations in which I think I should not have stopped using C... But, in the same have, have no regret ^^

Rune Schjellerup Philosof · Accepted Answer · 2013-10-11 10:25:27Z

It seems you can speed it up a good deal more by using another array instead of unsetting. Though, this uses more memory, which might be an issue depeding on the use-case (I haven't tested actual differences in memory allocation).

<?php
function my_array_diff($a, $b) {
  $map = $out = array();
  foreach($a as $val) $map[$val] = 1;
  foreach($b as $val) if(isset($map[$val])) $map[$val] = 0;
  foreach($map as $val => $ok) if($ok) $out[] = $val;
  return $out;
}
function leo_array_diff($a, $b) {
  $map = $out = array();
  foreach($a as $val) $map[$val] = 1;
  foreach($b as $val) unset($map[$val]);
  return array_keys($map);
}
function flip_array_diff_key($b, $a) {
  $at = array_flip($a);
  $bt = array_flip($b);
  $d = array_diff_key($bt, $at);
  return array_keys($d);
}
function flip_isset_diff($b, $a) {
  $at = array_flip($a);
  $d = array();
  foreach ($b as $i)
    if (!isset($at[$i]))
      $d[] = $i;
  return $d;
}
function large_array_diff($b, $a) {
  $at = array();
  foreach ($a as $i)
    $at[$i] = 1;
  $d = array();
  foreach ($b as $i)
    if (!isset($at[$i]))
      $d[] = $i;
  return $d;
}

$functions = array("flip_array_diff_key", "flip_isset_diff", "large_array_diff", "leo_array_diff", "my_array_diff", "array_diff");
#$functions = array_reverse($functions);
$l = range(1, 1000000);
$l2 = range(1, 1000000, 2);

foreach ($functions as $function) {
  $ts = microtime(true);
  for ($i = 0; $i < 10; $i++) {
    $f = $function($l, $l2);
  }
  $te = microtime(true);
  $timing[$function] = $te - $ts;
}
asort($timing);
print_r($timing);

My timings are (PHP 5.3.27-1~dotdeb.0):

[flip_isset_diff] => 3.7415699958801
[flip_array_diff_key] => 4.2989008426666
[large_array_diff] => 4.7882599830627
[flip_flip_isset_diff] => 5.0816700458527
[leo_array_diff] => 11.086831092834
[my_array_diff] => 14.563184976578
[array_diff] => 99.379411935806

The three new functions were found at http://shiplu.mokadd.im/topics/performance-optimization/

I just tried these vs the built-in version with ~200k rows of real data, and the built-in was so much faster (a few seconds) that these versions didn't even finish before I got bored and cancelled the process. (~5 min?)

binaryLV · Accepted Answer · 2011-07-15 08:31:07Z

As this has been brought up (see @BurninLeo's answer), what about something like this?

function binary_array_diff($a, $b) {
    $result = $a;
    asort($a);
    asort($b);
    list($bKey, $bVal) = each($b);
    foreach ( $a as $aKey => $aVal ) {
        while ( $aVal > $bVal ) {
            list($bKey, $bVal) = each($b);
        }
        if ( $aVal === $bVal ) {
            unset($result[$aKey]);
        }
    }
    return $result;
}

After performing some tests, results seem to be acceptable:

$a = range(1, 10000);
$b = range(5000, 15000);

shuffle($a);
shuffle($b);

$ts = microtime(true);
for ( $n = 0; $n < 10; ++$n ) {
    array_diff($a, $b);
}
printf("PHP    => %.4f\n", microtime(true) - $ts);

$ts = microtime(true);
for ( $n = 0; $n < 10; ++$n ) {
    binary_array_diff($a, $b);
}
printf("binary => %.4f\n", microtime(true) - $ts);

$binaryResult = binary_array_diff($a, $b);
$phpResult    = array_diff($a, $b);
if ( $binaryResult == $phpResult && array_keys($binaryResult) == array_keys($phpResult) ) {
    echo "returned arrays are the same\n";
}

Output:

PHP    => 1.3018
binary => 1.3601
returned arrays are the same

Of course, PHP code cannot perform as good as C code, therefore there's no wonder that PHP code is a bit slower.

animuson · Accepted Answer · 2010-03-19 19:20:14Z

-2

From PHP: "Returns an array containing all the entries from array1 that are not present in any of the other arrays."

So, you just check array1 against all arrayN and any values in array1 that don't appear in any of those arrays will be returned in a new array.

You don't necessarily even need to loop through all of array1's values. Just for all the additional arrays, loop through their values and check if each value is in_array($array1, $value).

answered Mar 19, 2010 at 19:20

animuson♦

54.9k28 gold badges142 silver badges150 bronze badges

2 Comments

Byron Whitlock Over a year ago

-1 It is more complex than that. There are advanced algorithms and data structures being used. See pascal's answer.

animuson Over a year ago

It is still a basic idea of what is happening and is a better solution than what he had.

Collectives™ on Stack Overflow

How does array_diff work?

6 Answers 6

3 Comments

5 Comments

1 Comment

1 Comment

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

5 Comments

1 Comment

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related