2

I have some C code running on a dev board with an ARM-Cortex-A9 which does image processing that I need to speed up. What I have is code that reads 8 RGB pixels where each color is represented as an uint8_t. These pixels need to be color corrected so a lookup table is used to lookup the color corrected value of a single channel of the pixel. A color corrected channel uses a 16 bit type but the actual bits that are used can be different based on the output_color_depth parameter.

After this preprocessing step I need to extract each significant bit of each channel of each pixel and store that in an output buffer.

The code below is the function in question:


struct Pixel {
    uint8_t r;
    uint8_t g;
    uint8_t b;
};


static const uint16_t colorLookup[256] = { ... };

void postProcessImage(const struct Pixel* img, const uint16_t imgWidth, const uint16_t imgHeight,
                      uint8_t** output, const uint8_t output_color_depth)
{
    const uint8_t input_color_depth = 8;

    for (uint16_t y = 0; y < imgHeight; ++y)
    {
        const uint16_t top_offset = y * imgWidth;

        for (uint16_t x = 0; x < imgWidth; x += 8)
        {
            const uint16_t offset = top_offset + x;

            // Get 8 pixels to use. This is done since 8 pixels
            // means 24 color channels which can fit exactly into
            // 3 bytes
            const uint16_t r0 = colorLookup[img[offset + 0].r];
            const uint16_t g0 = colorLookup[img[offset + 0].g];
            const uint16_t b0 = colorLookup[img[offset + 0].b];

            const uint16_t r1 = colorLookup[img[offset + 1].r];
            const uint16_t g1 = colorLookup[img[offset + 1].g];
            const uint16_t b1 = colorLookup[img[offset + 1].b];

            const uint16_t r2 = colorLookup[img[offset + 2].r];
            const uint16_t g2 = colorLookup[img[offset + 2].g];
            const uint16_t b2 = colorLookup[img[offset + 2].b];

            const uint16_t r3 = colorLookup[img[offset + 3].r];
            const uint16_t g3 = colorLookup[img[offset + 3].g];
            const uint16_t b3 = colorLookup[img[offset + 3].b];

            const uint16_t r4 = colorLookup[img[offset + 4].r];
            const uint16_t g4 = colorLookup[img[offset + 4].g];
            const uint16_t b4 = colorLookup[img[offset + 4].b];

            const uint16_t r5 = colorLookup[img[offset + 5].r];
            const uint16_t g5 = colorLookup[img[offset + 5].g];
            const uint16_t b5 = colorLookup[img[offset + 5].b];

            const uint16_t r6 = colorLookup[img[offset + 6].r];
            const uint16_t g6 = colorLookup[img[offset + 6].g];
            const uint16_t b6 = colorLookup[img[offset + 6].b];

            const uint16_t r7 = colorLookup[img[offset + 7].r];
            const uint16_t g7 = colorLookup[img[offset + 7].g];
            const uint16_t b7 = colorLookup[img[offset + 7].b];

            for (uint8_t c = 0; c < output_color_depth; ++c)
            {
                // For each significant bit we create the resulting byte
                // and store it into the output buffer.
                output[c][offset + 0] = (((g2 >> c) & 1) << 7) | (((r2 >> c) & 1) << 6)
                                      | (((b1 >> c) & 1) << 5) | (((g1 >> c) & 1) << 4)
                                      | (((r1 >> c) & 1) << 3) | (((b0 >> c) & 1) << 2)
                                      | (((g0 >> c) & 1) << 1) |  ((r0 >> c) & 1);

                output[c][offset + 1] = (((r5 >> c) & 1) << 7) | (((b4 >> c) & 1) << 6)
                                      | (((g4 >> c) & 1) << 5) | (((r4 >> c) & 1) << 4)
                                      | (((b3 >> c) & 1) << 3) | (((g3 >> c) & 1) << 2)
                                      | (((r3 >> c) & 1) << 1) |  ((b2 >> c) & 1);

                output[c][offset + 2] = (((b7 >> c) & 1) << 7) | (((g7 >> c) & 1) << 6)
                                      | (((r7 >> c) & 1) << 5) | (((b6 >> c) & 1) << 4)
                                      | (((g6 >> c) & 1) << 3) | (((r6 >> c) & 1) << 2)
                                      | (((b5 >> c) & 1) << 1) |  ((g5 >> c) & 1);
            }
        }
    }
}

Now this function performs too slow and I need to optimize it. I'm looking at using NEON instructions to optimize it. It's hard for me to find examples online but intuitively this is something that I think should be able to be vectorized. Could someone give me some pointers how I could achieve something like that? I'm also open to other suggestion on how to optimize this code!

Any help is appreciated.

16
  • Mandatory question: are you compiling your code with compiler optimizations enabled? If not, do that as step 1. Commented Sep 6, 2019 at 19:15
  • 3
    Optimization #1: Use a single array, not an array of pointers to arrays. Indexed as img[y*imgWidth+x]. Because you run through the whole array, you can simply index as img[index], with index running from 0 to imgWidth*imgHeight. Not only the indexing is cheaper, but you'll have less memory fragmentation and better cache locality. Commented Sep 6, 2019 at 19:18
  • 1
    why are you assigning to output[c][offset] three times consecutively? Commented Sep 6, 2019 at 19:19
  • 1
    Please show the compile/link command you use and also the GCC version. Commented Sep 6, 2019 at 19:22
  • 1
    Try adding -march=native Commented Sep 6, 2019 at 19:30

2 Answers 2

1

I can give hints as to what you could improve. The biggest improvement would be to change the problem and avoid the strange bit fiddling (it looks like you convert an RGB image to something like a bit plane image?).

Probably one approach is to do everything in one step (I believe this would be something like the answer above by Pete D.).

If you stay with your two part solution (first lookup, then bit reassignment): The first part you can't really optimize much, because you need the 16 bit load for each input byte. What you can do is changing the lookup table to 32 bits, as the Cortex-A9 is a strange CPU: it takes 1 cycle for 32 bit loads, but 2 cycles for anything else (8 bit/16 bit). Also if you can do it: rearrange the loading of the RGB image so that it also uses 32 bit loads (4 bytes) and extract the 4 times 8 bits one after the other. Or in short: the limiting factor in the lookup part is only loads, so you have to reduce these, preferably using as many 32 bit (aligned) loads as possible.

If your lookup table values can be calculated with only a few arithmetic operations, it would be faster to avoid the lookup and do a vectorized calculation (8 bit input --> 16 bit output).

For the second part: If you go deep into NEON, you probably could arrange the 16 bit values from the lookup into NEON vectors and do the bit fiddling in parallel on multiple values at once. Maybe with a parallization of 3 (for the three components in the loop). Or rearrange the lookup values in a way that a lot of the bit fiddling can be avoided (Store more in a 32 bit lookup table, as 32 bits are better for the Cortex-A9 anyway).

In general avoid to use 8 or 16 bit loop counters. Use the native size (32 bits), this sometimes avoids overhead, if the compiler is not clever enough (which it usually isn't).

Sign up to request clarification or add additional context in comments.

Comments

0

I don't know NEON, but this could be optimized in standard C++ by using a 64 bit unsigned int as equivalent of 8 * 8 bit bytes, and accessing it by making it a union with 8 chars. I think NEON has 64 & 128 bit vectors. The following is pseudo code giving the gist of the solution. OP can work out the details.

The OP code can be simplified by the observation that it is doing the same thing 3 times. Regard the input as a stream of width * height * 3 bytes, and chunk into blocks of eight bytes.

Instead of the look up table (LUT) returning 16 bits, return a 64 bit vector of 8 * 8 bits. The problem takes the first bit of each byte of a block of eight, then the second bit of each byte from the same block, for up to 8 colour depths. Depth can't be more than 8 because the input is only 8 bit. So the LUT returns the 8 bytes required for the 8 depths.

If colorLookup[ImageByte] == 11111111 then LUT[ImageBtye] ==

1000000010000000100000001000000010000000100000001000000010000000

If colorLookup[ImageByte] == 01111110 then LUT[ImageBtye] ==

0000000010000000100000001000000010000000100000001000000000000000

So for each block of 8 bytes, loop over the block and OR and shift (or unroll):

union {
  unsigned long long   Sum;     // 64 bit
  char                 Byte[8]; // Need to address each byte of Sum later
};

for ( int I=0; I<8; ++I ) {
  Sum |= LUT[Stream[S]] >> I;
  ++S;
}

That calculates 8 depths in parallel. Now just peel off the number of depths from the union (or unroll), or use NEON instructions to directly address the 8 bytes of the vector register.

for ( int c=0; c<depth; ++c ) {
  output[?+c][?] = Byte[c];     
}

The original code has approx. 24 LUTs per 24 bytes , and 24 <<, 24 >> and 24 |, for each 24 bytes, for each depth.

The above has approx. 24 LUTs, 24 |, and 24 >> for each 24 bytes. Depths are calculated in parallel. There might be less register spilling as well.

Could also add another LUT, where LUT1 == LUT >> 1, then the "unroll" might be:-

  Sum0 = LUT[Stream[S  ]] | LUT1[Stream[S+1]];
  Sum1 = LUT[Stream[S+2]] | LUT1[Stream[S+3]];
  SumA = Sum0 | Sum1 >> 2;
  Sum0 = LUT[Stream[S+4]] | LUT1[Stream[S+5]];
  Sum1 = LUT[Stream[S+6]] | LUT1[Stream[S+7]];
  SumB = Sum0 | Sum1 >> 2;
  Sum  = SumA | SumB >> 4;

Could add LUT2 and 3, but an ARM only has 8K or 16K L1 cache (as far as I know), and the LUTs are 2K each.

2 Comments

@MichaelNastenko Yes. See last sentence. LUT is 2K. ARM has 8K L1 cache. Depends how big OPs images are. Images might be Megs, so an 2K LUT or two is nothing in comparison considering OP has a 512 byte table anyway.
@MichaelNastenko If image input, or colorLookup bound, then no. If compute bound it will be * output_color_depth faster. If the former, its a hardware, or specification, constraint.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.