I have some C code running on a dev board with an ARM-Cortex-A9 which does image processing that I need to speed up. What I have is code that reads 8 RGB pixels where each color is represented as an uint8_t. These pixels need to be color corrected so a lookup table is used to lookup the color corrected value of a single channel of the pixel. A color corrected channel uses a 16 bit type but the actual bits that are used can be different based on the output_color_depth parameter.
After this preprocessing step I need to extract each significant bit of each channel of each pixel and store that in an output buffer.
The code below is the function in question:
struct Pixel {
uint8_t r;
uint8_t g;
uint8_t b;
};
static const uint16_t colorLookup[256] = { ... };
void postProcessImage(const struct Pixel* img, const uint16_t imgWidth, const uint16_t imgHeight,
uint8_t** output, const uint8_t output_color_depth)
{
const uint8_t input_color_depth = 8;
for (uint16_t y = 0; y < imgHeight; ++y)
{
const uint16_t top_offset = y * imgWidth;
for (uint16_t x = 0; x < imgWidth; x += 8)
{
const uint16_t offset = top_offset + x;
// Get 8 pixels to use. This is done since 8 pixels
// means 24 color channels which can fit exactly into
// 3 bytes
const uint16_t r0 = colorLookup[img[offset + 0].r];
const uint16_t g0 = colorLookup[img[offset + 0].g];
const uint16_t b0 = colorLookup[img[offset + 0].b];
const uint16_t r1 = colorLookup[img[offset + 1].r];
const uint16_t g1 = colorLookup[img[offset + 1].g];
const uint16_t b1 = colorLookup[img[offset + 1].b];
const uint16_t r2 = colorLookup[img[offset + 2].r];
const uint16_t g2 = colorLookup[img[offset + 2].g];
const uint16_t b2 = colorLookup[img[offset + 2].b];
const uint16_t r3 = colorLookup[img[offset + 3].r];
const uint16_t g3 = colorLookup[img[offset + 3].g];
const uint16_t b3 = colorLookup[img[offset + 3].b];
const uint16_t r4 = colorLookup[img[offset + 4].r];
const uint16_t g4 = colorLookup[img[offset + 4].g];
const uint16_t b4 = colorLookup[img[offset + 4].b];
const uint16_t r5 = colorLookup[img[offset + 5].r];
const uint16_t g5 = colorLookup[img[offset + 5].g];
const uint16_t b5 = colorLookup[img[offset + 5].b];
const uint16_t r6 = colorLookup[img[offset + 6].r];
const uint16_t g6 = colorLookup[img[offset + 6].g];
const uint16_t b6 = colorLookup[img[offset + 6].b];
const uint16_t r7 = colorLookup[img[offset + 7].r];
const uint16_t g7 = colorLookup[img[offset + 7].g];
const uint16_t b7 = colorLookup[img[offset + 7].b];
for (uint8_t c = 0; c < output_color_depth; ++c)
{
// For each significant bit we create the resulting byte
// and store it into the output buffer.
output[c][offset + 0] = (((g2 >> c) & 1) << 7) | (((r2 >> c) & 1) << 6)
| (((b1 >> c) & 1) << 5) | (((g1 >> c) & 1) << 4)
| (((r1 >> c) & 1) << 3) | (((b0 >> c) & 1) << 2)
| (((g0 >> c) & 1) << 1) | ((r0 >> c) & 1);
output[c][offset + 1] = (((r5 >> c) & 1) << 7) | (((b4 >> c) & 1) << 6)
| (((g4 >> c) & 1) << 5) | (((r4 >> c) & 1) << 4)
| (((b3 >> c) & 1) << 3) | (((g3 >> c) & 1) << 2)
| (((r3 >> c) & 1) << 1) | ((b2 >> c) & 1);
output[c][offset + 2] = (((b7 >> c) & 1) << 7) | (((g7 >> c) & 1) << 6)
| (((r7 >> c) & 1) << 5) | (((b6 >> c) & 1) << 4)
| (((g6 >> c) & 1) << 3) | (((r6 >> c) & 1) << 2)
| (((b5 >> c) & 1) << 1) | ((g5 >> c) & 1);
}
}
}
}
Now this function performs too slow and I need to optimize it. I'm looking at using NEON instructions to optimize it. It's hard for me to find examples online but intuitively this is something that I think should be able to be vectorized. Could someone give me some pointers how I could achieve something like that? I'm also open to other suggestion on how to optimize this code!
Any help is appreciated.
img[y*imgWidth+x]. Because you run through the whole array, you can simply index asimg[index], withindexrunning from 0 toimgWidth*imgHeight. Not only the indexing is cheaper, but you'll have less memory fragmentation and better cache locality.output[c][offset]three times consecutively?-march=native