This document describes how to use NEON on the SP7350 to accelerate data processing.We will describe what scenarios are suitable for using NEON, NEON learning materials, and examples of how to use NEON acceleration.

Table of contents

When to use NEON

When performing algorithmic calculations, the following questions need to be considered:

Whether the input of the next calculation depends on the output of the previous calculation，and the calculation is single-instruction, single-data (SISD). if so, the algorithm may not be suitable for SIMD(neon on C3V).
However, some SISD-form algorithms may be transformed or approached from a different perspective, potentially allowing them to be converted into SIMD form while still meeting the required precision loss tolerance.

When considering algorithm optimization, do we necessarily need NEON (SIMD)?

If you're not familiar with NEON, are there still ways to optimize algorithms? Here is a possible approach:

Are there any loops in the algorithm? If not, can we make some improvements to introduce them?
Within the loop body, do some inputs of the subsequent iteration depend on the outputs of the previous iteration? If so, can we eliminate this dependency?
Can we extract the algorithm-related code into a separate source file and compile it with the O3 optimization flag in gcc (if O3 is not suitable for compiling all source files in the entire program)?

Suitable scenarios:

When data in memory is stored in a certain format, such as (RGBRGB)
Multiple data can be loaded into one or more variables regularly in a single operation.
One or more variables can be stored in memory regularly in a single operation.
Data types can be understood as 8-bit, 16-bit, 32-bit, and 64-bit signed, unsigned integers, or floating-point types.
Corresponding calculations or algorithms can have multiple inputs and outputs in a single computation.

NEON Study Materials

Official Reference Documents

Programming Guide: DEN0018A_neon_programmers_guide.pdf

Official Website Link: https://developer.arm.com/documentation/den0018/latest

Function Quick Reference Manual: https://developer.arm.com/architectures/instruction-sets/intrinsics
Quick Start: https://developer.arm.com/documentation/102159/latest

Official Documentation Learning Path

Programming Guide
Quick Start
During programming, you may refer to the Function Quick Reference Manual to look up the functions you should use.

NEON Examples

1. YUV to BGR Conversion

YUV format

There are many formats for YUV, but the format used in the example code is limited to YUYV, which has the following format: YUYV YUYV YUYV ... YUYV. It has the following characteristics：

Each pair of Y components shares a pair of UV components, so every two pixels occupy 4 bytes. For a YUYV image with a width of W and a height of H, it occupies W * H bytes.

RGB format

Due to the instructions for decentralized storage loading in neon, there is basically no impact on performance depending on which type of RGB is stored, whether it is RGB, BGA, RGBA, BGRA, etc.

Analysis

We need to extract a certain amount of Y, a certain amount of V, and a certain amount of U data from the buffer for calculations. It is important to maximize the number of Y components participating in the calculations without causing any overflow. Given that we are dealing with 8-bit unsigned integers and considering the conversion algorithm, 16-bit float data should be sufficient. Here's how we proceed:

When using Q registers, we can process 128 / 16 = 8 F16-type data at a time.
Before the calculation, we need to load the data as follows:○ Load 8 Ys into a float16x8_t variable. The content will be:Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8○ Load the 4 Us shared by these 8 Ys into a float16x8_t variable. For convenient calculation, the content will be:U1 U1 U2 U2 U3 U3 U4 U4○ Load the 4 Vs shared by these 8 Ys into a float16x8_t variable. For convenient calculation, the content will be:V1 V1 V2 V2 V3 V3 V4 V4
To load the data in this format, we need a specific NEON instruction: vqtbl1_u8. This instruction extracts 8 bytes from specified positions in a uint8x16 variable (the size of a Q register) and forms a uint8x8 variable (the size of a D register). Here's an example code snippet:

static const uint8_t yuyv_y_indices[8] = {0, 2, 4, 6,  8, 10, 12, 14};
static const uint8_t yuyv_u_indices[8] = {1, 1, 5, 5,  9,  9, 13, 13};
static const uint8_t yuyv_v_indices[8] = {3, 3, 7, 7, 11, 11, 15, 15};

... 
uint8x16_t A = vld1q_u8(yuvBuffer + i);
uint8x8_t y8 = vqtbl1_u8(A, vld1_u8(yuyv_y_indices));
uint8x8_t u8 = vqtbl1_u8(A, vld1_u8(yuyv_u_indices));
uint8x8_t v8 = vqtbl1_u8(A, vld1_u8(yuyv_v_indices));

... 
// Then convert uint8x8_t to float8x16_t and then apply the real convertion.

Additional Notes

This is just an example NEON program. Under the condition of accepting a certain level of precision loss, there are better methods or optimization directions for converting YUV to RGB. You can search online for articles on YUV to RGB conversion, Alternative conversion methods include:

Lookup Table Approach
Using Integer Data in Calculations

Conversion Performance

Using the gcc (O3) optimization option

convert norm_bgr succeed, takes 27
convert neon_bgr succeed, takes 13

NEON conversion takes 13ms
Traditional conversion takes 27ms

Without the gcc (O3) optimization option

convert norm_bgr succeed, takes 125
convert neon_bgr succeed, takes 55

NEON conversion takes 55ms
Traditional conversion takes 125ms

Sample Code for the General Conversion Method

static inline uint8_t limit_value_to_0_255(float value)
{
    if (value < 0) {
        value = 0;
    }
    else if (value > 255) {
        value = 255;
    }
    return (uint8_t)value;
}

static inline void yuvtoRgbByNormx2(
    uint8_t* ru0, uint8_t* gu0, uint8_t* bu0,
    uint8_t* ru1, uint8_t* gu1, uint8_t* bu1,
    float y1, float y2, float u0, float v0)
{
    /* calu Y0 */
    float r0  = y1 + 1.402 * v0;
    float g0  = y1 - 0.34414 * u0 - 0.71414 * v0;
    float b0  = y1 + 1.772 * u0;

    *ru0 = limit_value_to_0_255(r0);
    *gu0 = limit_value_to_0_255(g0);
    *bu0 = limit_value_to_0_255(b0);

    /* calu Y1 */
    float r1  = y2 + 1.402 * v0;
    float g1  = y2 - 0.34414 * u0 - 0.71414 * v0;
    float b1  = y2 + 1.772 * u0;

    *ru1 = limit_value_to_0_255(r1);
    *gu1 = limit_value_to_0_255(g1);
    *bu1 = limit_value_to_0_255(b1);
}

YuvBool yuvToBgrByNorm(
    int yuvFormat, uint32_t width, uint32_t height,
    uint8_t* yuvBuffer, uint8_t* rgbBuffer, uint8_t* rgbBufferInterleaved)
{
    float y1, y2, u0, v0;

    uint8_t ru0, gu0, bu0;
    uint8_t ru1, gu1, bu1;
    uint32_t rgbPosition = 0;
    uint32_t rgbInterleavedPosition = 0;

    uint8_t* rgbBuffer0 = rgbBuffer;
    uint8_t* rgbBuffer1 = rgbBuffer + width * height;
    uint8_t* rgbBuffer2 = rgbBuffer + width * height * 2;

    /* size for yuyu/uyuv */
    uint32_t yuvFrameSize = width * height * 2;
    for (int i = 0; i < yuvFrameSize; i += 4)
    {
        /* extact Y1, Y2, U0, V0 */
        switch(yuvFormat)
        {
        case YUV_FMT_UYUV:
            y1 = yuvBuffer[i + 1];
            y2 = yuvBuffer[i + 3];
            u0 = yuvBuffer[i] - 128;
            v0 = yuvBuffer[i + 2] - 128;
            break;
        case YUV_FMT_YUYV:
            y1 = yuvBuffer[i];
            y2 = yuvBuffer[i + 2];
            u0 = yuvBuffer[i + 1] - 128;
            v0 = yuvBuffer[i + 3] - 128;
            break;
        default:
            return YUV_FALSE;
        }

        /* convert */
        yuvtoRgbByNormx2(&ru0, &gu0, &bu0, &ru1, &gu1, &bu1, y1, y2, u0, v0);

        /* store */
        if (rgbBuffer != NULL)
        {
            (rgbBuffer0 + rgbPosition)[0] = bu0;
            (rgbBuffer1 + rgbPosition)[0] = gu0;
            (rgbBuffer2 + rgbPosition)[0] = ru0;
            (rgbBuffer0 + rgbPosition)[1] = bu1;
            (rgbBuffer1 + rgbPosition)[1] = gu1;
            (rgbBuffer2 + rgbPosition)[1] = ru1;
            rgbPosition += 2;
        }
        if (rgbBufferInterleaved != NULL)
        {
            rgbBufferInterleaved[rgbInterleavedPosition + 0] = bu0;
            rgbBufferInterleaved[rgbInterleavedPosition + 1] = gu0;
            rgbBufferInterleaved[rgbInterleavedPosition + 2] = ru0;
            rgbBufferInterleaved[rgbInterleavedPosition + 3] = bu1;
            rgbBufferInterleaved[rgbInterleavedPosition + 4] = gu1;
            rgbBufferInterleaved[rgbInterleavedPosition + 5] = ru1;
            rgbInterleavedPosition += 6;
        }
    }
    return YUV_TRUE;
}

Sample Code for the Neon Conversion Method

/* uyuy */
static const uint8_t uyvy_y_indices[8] = {1, 3, 5, 7,  9, 11, 13, 15};
static const uint8_t uyvy_u_indices[8] = {0, 0, 4, 4,  8,  8, 12, 12};
static const uint8_t uyvy_v_indices[8] = {2, 2, 6, 6, 10, 10, 14, 14};
/* yuyv */
static const uint8_t yuyv_y_indices[8] = {0, 2, 4, 6,  8, 10, 12, 14};
static const uint8_t yuyv_u_indices[8] = {1, 1, 5, 5,  9,  9, 13, 13};
static const uint8_t yuyv_v_indices[8] = {3, 3, 7, 7, 11, 11, 15, 15};

YuvBool yuvToBgrByNeon(
    int yuvFormat, uint32_t width, uint32_t height,
    uint8_t* yuvBuffer, uint8_t* rgbBuffer, uint8_t* rgbBufferInterleaved)
{
    uint8_t* rgbBuffer0 = rgbBuffer;
    uint8_t* rgbBuffer1 = rgbBuffer + width * height;
    uint8_t* rgbBuffer2 = rgbBuffer + width * height * 2;

    uint8x8_t y8, u8, v8;
    uint32_t rgbPosition = 0;
    uint32_t rgbInterleavedPosition = 0;
    uint32_t yuvFrameSize = width * height * 2;

    for (int i = 0; i < yuvFrameSize; i += 16)
    {
        /* load data */
        uint8x16_t A = vld1q_u8(yuvBuffer + i);
        switch(yuvFormat)
        {
        case YUV_FMT_UYUV:
            y8 = vqtbl1_u8(A, vld1_u8(uyvy_y_indices));
            u8 = vqtbl1_u8(A, vld1_u8(uyvy_u_indices));
            v8 = vqtbl1_u8(A, vld1_u8(uyvy_v_indices));
            break;
        case YUV_FMT_YUYV:
            y8 = vqtbl1_u8(A, vld1_u8(yuyv_y_indices));
            u8 = vqtbl1_u8(A, vld1_u8(yuyv_u_indices));
            v8 = vqtbl1_u8(A, vld1_u8(yuyv_v_indices));
            break;
        default:
            return YUV_FALSE;
        }

        /* u8 to f16 */
        float16x8_t y = vcvtq_f16_u16(vmovl_u8(y8));
        float16x8_t u = vcvtq_f16_s16(vmovl_s8(vsub_s8(vreinterpret_s8_u8(u8), vdup_n_s8(128))));
        float16x8_t v = vcvtq_f16_s16(vmovl_s8(vsub_s8(vreinterpret_s8_u8(v8), vdup_n_s8(128))));

        /* calc yuyv to rgb */
        float16x8_t r = vaddq_f16(y, vmulq_n_f16(v, 1.402));
        float16x8_t g = vsubq_f16(vsubq_f16(y, vmulq_n_f16(v, 0.71414)), vmulq_n_f16(u, 0.34414));
        float16x8_t b = vaddq_f16(y, vmulq_n_f16(u, 1.772));

        /* limit rgb to 0-255 */
        r = vmaxq_f16(vminq_f16(r, vdupq_n_f16(255.0f)), vdupq_n_f16(0.0f));
        g = vmaxq_f16(vminq_f16(g, vdupq_n_f16(255.0f)), vdupq_n_f16(0.0f));
        b = vmaxq_f16(vminq_f16(b, vdupq_n_f16(255.0f)), vdupq_n_f16(0.0f));

        /* convert f16 to u8 */
        uint8x8_t r8 = vmovn_u16(vcvtq_u16_f16(r));
        uint8x8_t g8 = vmovn_u16(vcvtq_u16_f16(g));
        uint8x8_t b8 = vmovn_u16(vcvtq_u16_f16(b));

        /* store u8 to rgb buffer */
        if (rgbBuffer != NULL)
        {
            vst1_u8(rgbBuffer0 + rgbPosition, b8);
            vst1_u8(rgbBuffer1 + rgbPosition, g8);
            vst1_u8(rgbBuffer2 + rgbPosition, r8);
            rgbPosition += 8;
        }
        if (rgbBufferInterleaved != NULL)
        {
            uint8x8x3_t rgbu8 = {b8, g8, r8};
            vst3_u8(rgbBufferInterleaved + rgbInterleavedPosition, rgbu8);
            rgbInterleavedPosition += 24;
        }
    }
    return YUV_TRUE;
}

2. Matrix Multiplication

"Amxn X Bnxj = Cmj" refers to the multiplication of an m-by-n matrix A with an n-by-j matrix B, resulting in an m-by-j matrix C.

The optimization approach in this context differs from our customary programming practices, making it susceptible to negative optimization in the code.

The transition from matrix multiplication to vector multiplication

transition from matrix multiplication.png

Neon does not natively support matrix multiplication.
As shown in the above picture, the first two columns of the output matrix have been calculated using the traditional approach. And he following patterns/rules are identified:

The i-th column of the output matrix is obtained by multiplying the x-th column of matrix A with the x-th element of the i-th column of matrix B, where 0 <= i < 4 and 0 <= x < 4.
Construct vectors in column units, a0(a11, a21, a31, a41)..., b0(b11, b21, b31, b41).... The calculation can be written as follows:
a0 x b0[0] + a1 x b0[1] + a2 x b0[2] + a3 x b0[3] ->The first column of the resulting matrix
a0 x b1[0] + a1 x b1[1] + a2 x b1[2] + a3 x b1[3] ->The second column of the resulting matrix
....

In summary, we can get the following conclusion:

If the matrix is stored in memory in a column-major order, it is suitable for SIMD (Single Instruction, Multiple Data) computation.
If the matrix is stored in row-first order, neon also provides instructions for scatter load/store, making it easier to convert to column-first vector

If the original data is not in column-major order

neon provides scatter load/store instructions that make it easier to convert the data into column-major vectors.

This can be achieved using instructions such as vld4q_datatype and vst4q_datatype, as shown in the figure.

Sample Code:

int main(int argc, const char * argv[]) 
{
    uint32_t memoryA[16] = {'A', 'B', 'C', 'D', 'A', 'B', 'C', 'D', 'A', 'B', 'C', 'D', 'A', 'B', 'C', 'D'};

    uint32x4x4_t columnMajorMatrix = vld4q_u32(memoryA);
    for (int i = 0; i < 4; i++)
    {
        printf("The matrix.val/vector[%d]:", i);
        for (int j = 0; j < 4; j++) {
            printf("%c ", columnMajorMatrix.val[i][j]);
        }
        printf("\n");
    }
    return 0;
}

Output Result:

The matrix.val/vector[0]:A A A A 
The matrix.val/vector[1]:B B B B 
The matrix.val/vector[2]:C C C C 
The matrix.val/vector[3]:D D D D 
Program ended with exit code: 0

As you can see: each vector has been saved with the corresponding column data in column-major order.

The first version of multiplication

assuming A, B, and C are in column-major order, with C being the output matrix

float32_t A[16] = {0f, 1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f, 9f, 10f, 11f, 12f, 13f, 14f, 15f};
float32_t B[16] = {0f, 1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f, 9f, 10f, 11f, 12f, 13f, 14f, 15f};
float32_t C[16] = {};

float32x4_t a0 = vld1q_f32(A + 0);
float32x4_t a1 = vld1q_f32(A + 4);
float32x4_t a2 = vld1q_f32(A + 8);
float32x4_t a3 = vld1q_f32(A + 12);

// Compute the first column of matrix C and store it
float32x4_t bx = vld1q_f32(B + 0);
float32x4_t cx = vmulq_lane_f32(a0, vget_low_f32(bx), 0);
cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1);
cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0);
cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1);
vst1q_f32(C + 0, cx);

// Compute the second column of matrix C and store it
bx = vld1q_f32(B + 4);
cx  = vmulq_lane_f32(a0, vget_low_f32(bx), 0);
cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1);
cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0);
cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1);
vst1q_f32(C + 4, cx);

// Compute the third column of matrix C and store it
bx  = vld1q_f32(B + 8);
cx  = vmulq_lane_f32(a0, vget_low_f32(bx), 0);
cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1);
cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0);
cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1);
vst1q_f32(C + 8, cx);

// Compute the fourth column of matrix C and store it
bx  = vld1q_f32(B + 12);
cx  = vmulq_lane_f32(a0, vget_low_f32(bx), 0);
cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1);
cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0);
cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1);
vst1q_f32(C + 12, cx);

issues here：

The bx and cx vectors are used in each iteration, and to compute the n+1th column, we have to wait until the nth column is completely calculated. However, neon actually provides some channels that allow for parallel execution.

The second version of multiplication

float32_t A[16] = {0f, 1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f, 9f, 10f, 11f, 12f, 13f, 14f, 15f};
float32_t B[16] = {0f, 1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f, 9f, 10f, 11f, 12f, 13f, 14f, 15f};
float32_t C[16] = {};

float32x4_t a0 = vld1q_f32(A + 0);
float32x4_t a1 = vld1q_f32(A + 4);
float32x4_t a2 = vld1q_f32(A + 8);
float32x4_t a3 = vld1q_f32(A + 12);

// Compute the xth column of matrix C and store it.
float32x4_t bx, cx;
for (int i = 0; i <= 12; i+=4)
{
    bx = vld1q_f32(B + i);
    cx  = vmulq_lane_f32(a0, vget_low_f32(bx), 0);
    cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1);
    cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0);
    cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1);
    vst1q_f32(C + i, cx);
}