Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

This document describes how to use NEON on the SP7350 to accelerate data processing.We will describe what scenarios are suitable for using NEON, NEON learning materials, and examples of how to use NEON acceleration.

Table of contents

When to use NEON

When performing algorithmic calculations, the following questions need to be considered:

  1. Whether the input of the next calculation depends on the output of the previous calculation,and the calculation is single-instruction, single-data (SISD). if so, the algorithm may not be suitable for SIMD(neon on C3V).

  2. However, some SISD-form algorithms may be transformed or approached from a different perspective, potentially allowing them to be converted into SIMD form while still meeting the required precision loss tolerance.

When considering algorithm optimization, do we necessarily need NEON (SIMD)?

If you're not familiar with NEON, are there still ways to optimize algorithms? Here is a possible approach:

  1. Are there any loops in the algorithm? If not, can we make some improvements to introduce them?

  2. Within the loop body, do some inputs of the subsequent iteration depend on the outputs of the previous iteration? If so, can we eliminate this dependency?

  3. Can we extract the algorithm-related code into a separate source file and compile it with the O3 optimization flag in gcc (if O3 is not suitable for compiling all source files in the entire program)?

Suitable scenarios:

  1. When data in memory is stored in a certain format, such as (RGBRGB)

  2. Multiple data can be loaded into one or more variables regularly in a single operation.

  3. One or more variables can be stored in memory regularly in a single operation.

  4. Data types can be understood as 8-bit, 16-bit, 32-bit, and 64-bit signed, unsigned integers, or floating-point types.

  5. Corresponding calculations or algorithms can have multiple inputs and outputs in a single computation.

NEON Study Materials

Official Reference Documents

  1. Programming Guide: DEN0018A_neon_programmers_guide.pdf

Official Website Link: https://developer.arm.com/documentation/den0018/latest

  1. Function Quick Reference Manual: https://developer.arm.com/architectures/instruction-sets/intrinsics

  2. Quick Start: https://developer.arm.com/documentation/102159/latest

Official Documentation Learning Path

  1. Programming Guide

  2. Quick Start

  3. During programming, you may refer to the Function Quick Reference Manual to look up the functions you should use.

NEON Examples

1. YUV to BGR Conversion

YUV format

There are many formats for YUV, but the format used in the example code is limited to YUYV, which has the following format: YUYV YUYV YUYV ... YUYV. It has the following characteristics:

Each pair of Y components shares a pair of UV components, so every two pixels occupy 4 bytes. For a YUYV image with a width of W and a height of H, it occupies W * H bytes.

RGB format

Due to the instructions for decentralized storage loading in neon, there is basically no impact on performance depending on which type of RGB is stored, whether it is RGB, BGA, RGBA, BGRA, etc.

Analysis

We need to extract a certain amount of Y, a certain amount of V, and a certain amount of U data from the buffer for calculations. It is important to maximize the number of Y components participating in the calculations without causing any overflow. Given that we are dealing with 8-bit unsigned integers and considering the conversion algorithm, 16-bit float data should be sufficient. Here's how we proceed:

  1. When using Q registers, we can process 128 / 16 = 8 F16-type data at a time.

  2. Before the calculation, we need to load the data as follows:○ Load 8 Ys into a float16x8_t variable. The content will be:Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8○ Load the 4 Us shared by these 8 Ys into a float16x8_t variable. For convenient calculation, the content will be:U1 U1 U2 U2 U3 U3 U4 U4○ Load the 4 Vs shared by these 8 Ys into a float16x8_t variable. For convenient calculation, the content will be:V1 V1 V2 V2 V3 V3 V4 V4

  3. To load the data in this format, we need a specific NEON instruction: vqtbl1_u8. This instruction extracts 8 bytes from specified positions in a uint8x16 variable (the size of a Q register) and forms a uint8x8 variable (the size of a D register). Here's an example code snippet:

static const uint8_t yuyv_y_indices[8] = {0, 2, 4, 6,  8, 10, 12, 14};
static const uint8_t yuyv_u_indices[8] = {1, 1, 5, 5,  9,  9, 13, 13};
static const uint8_t yuyv_v_indices[8] = {3, 3, 7, 7, 11, 11, 15, 15};

... 
uint8x16_t A = vld1q_u8(yuvBuffer + i);
uint8x8_t y8 = vqtbl1_u8(A, vld1_u8(yuyv_y_indices));
uint8x8_t u8 = vqtbl1_u8(A, vld1_u8(yuyv_u_indices));
uint8x8_t v8 = vqtbl1_u8(A, vld1_u8(yuyv_v_indices));

... 
// Then convert uint8x8_t to float8x16_t and then apply the real convertion.  

Additional Notes

This is just an example NEON program. Under the condition of accepting a certain level of precision loss, there are better methods or optimization directions for converting YUV to RGB. You can search online for articles on YUV to RGB conversion, Alternative conversion methods include:

  1. Lookup Table Approach

  2. Using Integer Data in Calculations

Conversion Performance

  1. Using the gcc (O3) optimization option

convert norm_bgr succeed, takes 27
convert neon_bgr succeed, takes 13
  • NEON conversion takes 13ms

  • Traditional conversion takes 27ms

  1. Without the gcc (O3) optimization option

convert norm_bgr succeed, takes 125
convert neon_bgr succeed, takes 55
  • NEON conversion takes 55ms

  • Traditional conversion takes 125ms

Sample Code for the General Conversion Method

static inline uint8_t limit_value_to_0_255(float value)
{
    if (value < 0) {
        value = 0;
    }
    else if (value > 255) {
        value = 255;
    }
    return (uint8_t)value;
}

static inline void yuvtoRgbByNormx2(
    uint8_t* ru0, uint8_t* gu0, uint8_t* bu0,
    uint8_t* ru1, uint8_t* gu1, uint8_t* bu1,
    float y1, float y2, float u0, float v0)
{
    /* calu Y0 */
    float r0  = y1 + 1.402 * v0;
    float g0  = y1 - 0.34414 * u0 - 0.71414 * v0;
    float b0  = y1 + 1.772 * u0;

    *ru0 = limit_value_to_0_255(r0);
    *gu0 = limit_value_to_0_255(g0);
    *bu0 = limit_value_to_0_255(b0);

    /* calu Y1 */
    float r1  = y2 + 1.402 * v0;
    float g1  = y2 - 0.34414 * u0 - 0.71414 * v0;
    float b1  = y2 + 1.772 * u0;

    *ru1 = limit_value_to_0_255(r1);
    *gu1 = limit_value_to_0_255(g1);
    *bu1 = limit_value_to_0_255(b1);
}

YuvBool yuvToBgrByNorm(
    int yuvFormat, uint32_t width, uint32_t height,
    uint8_t* yuvBuffer, uint8_t* rgbBuffer, uint8_t* rgbBufferInterleaved)
{
    float y1, y2, u0, v0;

    uint8_t ru0, gu0, bu0;
    uint8_t ru1, gu1, bu1;
    uint32_t rgbPosition = 0;
    uint32_t rgbInterleavedPosition = 0;

    uint8_t* rgbBuffer0 = rgbBuffer;
    uint8_t* rgbBuffer1 = rgbBuffer + width * height;
    uint8_t* rgbBuffer2 = rgbBuffer + width * height * 2;

    /* size for yuyu/uyuv */
    uint32_t yuvFrameSize = width * height * 2;
    for (int i = 0; i < yuvFrameSize; i += 4)
    {
        /* extact Y1, Y2, U0, V0 */
        switch(yuvFormat)
        {
        case YUV_FMT_UYUV:
            y1 = yuvBuffer[i + 1];
            y2 = yuvBuffer[i + 3];
            u0 = yuvBuffer[i] - 128;
            v0 = yuvBuffer[i + 2] - 128;
            break;
        case YUV_FMT_YUYV:
            y1 = yuvBuffer[i];
            y2 = yuvBuffer[i + 2];
            u0 = yuvBuffer[i + 1] - 128;
            v0 = yuvBuffer[i + 3] - 128;
            break;
        default:
            return YUV_FALSE;
        }

        /* convert */
        yuvtoRgbByNormx2(&ru0, &gu0, &bu0, &ru1, &gu1, &bu1, y1, y2, u0, v0);

        /* store */
        if (rgbBuffer != NULL)
        {
            (rgbBuffer0 + rgbPosition)[0] = bu0;
            (rgbBuffer1 + rgbPosition)[0] = gu0;
            (rgbBuffer2 + rgbPosition)[0] = ru0;
            (rgbBuffer0 + rgbPosition)[1] = bu1;
            (rgbBuffer1 + rgbPosition)[1] = gu1;
            (rgbBuffer2 + rgbPosition)[1] = ru1;
            rgbPosition += 2;
        }
        if (rgbBufferInterleaved != NULL)
        {
            rgbBufferInterleaved[rgbInterleavedPosition + 0] = bu0;
            rgbBufferInterleaved[rgbInterleavedPosition + 1] = gu0;
            rgbBufferInterleaved[rgbInterleavedPosition + 2] = ru0;
            rgbBufferInterleaved[rgbInterleavedPosition + 3] = bu1;
            rgbBufferInterleaved[rgbInterleavedPosition + 4] = gu1;
            rgbBufferInterleaved[rgbInterleavedPosition + 5] = ru1;
            rgbInterleavedPosition += 6;
        }
    }
    return YUV_TRUE;
}

Sample Code for the Neon Conversion Method

/* uyuy */
static const uint8_t uyvy_y_indices[8] = {1, 3, 5, 7,  9, 11, 13, 15};
static const uint8_t uyvy_u_indices[8] = {0, 0, 4, 4,  8,  8, 12, 12};
static const uint8_t uyvy_v_indices[8] = {2, 2, 6, 6, 10, 10, 14, 14};
/* yuyv */
static const uint8_t yuyv_y_indices[8] = {0, 2, 4, 6,  8, 10, 12, 14};
static const uint8_t yuyv_u_indices[8] = {1, 1, 5, 5,  9,  9, 13, 13};
static const uint8_t yuyv_v_indices[8] = {3, 3, 7, 7, 11, 11, 15, 15};

YuvBool yuvToBgrByNeon(
    int yuvFormat, uint32_t width, uint32_t height,
    uint8_t* yuvBuffer, uint8_t* rgbBuffer, uint8_t* rgbBufferInterleaved)
{
    uint8_t* rgbBuffer0 = rgbBuffer;
    uint8_t* rgbBuffer1 = rgbBuffer + width * height;
    uint8_t* rgbBuffer2 = rgbBuffer + width * height * 2;

    uint8x8_t y8, u8, v8;
    uint32_t rgbPosition = 0;
    uint32_t rgbInterleavedPosition = 0;
    uint32_t yuvFrameSize = width * height * 2;

    for (int i = 0; i < yuvFrameSize; i += 16)
    {
        /* load data */
        uint8x16_t A = vld1q_u8(yuvBuffer + i);
        switch(yuvFormat)
        {
        case YUV_FMT_UYUV:
            y8 = vqtbl1_u8(A, vld1_u8(uyvy_y_indices));
            u8 = vqtbl1_u8(A, vld1_u8(uyvy_u_indices));
            v8 = vqtbl1_u8(A, vld1_u8(uyvy_v_indices));
            break;
        case YUV_FMT_YUYV:
            y8 = vqtbl1_u8(A, vld1_u8(yuyv_y_indices));
            u8 = vqtbl1_u8(A, vld1_u8(yuyv_u_indices));
            v8 = vqtbl1_u8(A, vld1_u8(yuyv_v_indices));
            break;
        default:
            return YUV_FALSE;
        }

        /* u8 to f16 */
        float16x8_t y = vcvtq_f16_u16(vmovl_u8(y8));
        float16x8_t u = vcvtq_f16_s16(vmovl_s8(vsub_s8(vreinterpret_s8_u8(u8), vdup_n_s8(128))));
        float16x8_t v = vcvtq_f16_s16(vmovl_s8(vsub_s8(vreinterpret_s8_u8(v8), vdup_n_s8(128))));

        /* calc yuyv to rgb */
        float16x8_t r = vaddq_f16(y, vmulq_n_f16(v, 1.402));
        float16x8_t g = vsubq_f16(vsubq_f16(y, vmulq_n_f16(v, 0.71414)), vmulq_n_f16(u, 0.34414));
        float16x8_t b = vaddq_f16(y, vmulq_n_f16(u, 1.772));

        /* limit rgb to 0-255 */
        r = vmaxq_f16(vminq_f16(r, vdupq_n_f16(255.0f)), vdupq_n_f16(0.0f));
        g = vmaxq_f16(vminq_f16(g, vdupq_n_f16(255.0f)), vdupq_n_f16(0.0f));
        b = vmaxq_f16(vminq_f16(b, vdupq_n_f16(255.0f)), vdupq_n_f16(0.0f));

        /* convert f16 to u8 */
        uint8x8_t r8 = vmovn_u16(vcvtq_u16_f16(r));
        uint8x8_t g8 = vmovn_u16(vcvtq_u16_f16(g));
        uint8x8_t b8 = vmovn_u16(vcvtq_u16_f16(b));

        /* store u8 to rgb buffer */
        if (rgbBuffer != NULL)
        {
            vst1_u8(rgbBuffer0 + rgbPosition, b8);
            vst1_u8(rgbBuffer1 + rgbPosition, g8);
            vst1_u8(rgbBuffer2 + rgbPosition, r8);
            rgbPosition += 8;
        }
        if (rgbBufferInterleaved != NULL)
        {
            uint8x8x3_t rgbu8 = {b8, g8, r8};
            vst3_u8(rgbBufferInterleaved + rgbInterleavedPosition, rgbu8);
            rgbInterleavedPosition += 24;
        }
    }
    return YUV_TRUE;
}

2. Matrix Multiplication

"Amxn X Bnxj = Cmj" refers to the multiplication of an m-by-n matrix A with an n-by-j matrix B, resulting in an m-by-j matrix C.

The optimization approach in this context differs from our customary programming practices, making it susceptible to negative optimization in the code.

The transition from matrix multiplication to vector multiplication

transition from matrix multiplication.png

  1. Neon does not natively support matrix multiplication.

  2. As shown in the above picture, the first two columns of the output matrix have been calculated using the traditional approach. And he following patterns/rules are identified:

  • The i-th column of the output matrix is obtained by multiplying the x-th column of matrix A with the x-th element of the i-th column of matrix B, where 0 <= i < 4 and 0 <= x < 4.

  • Construct vectors in column units, a0(a11, a21, a31, a41)..., b0(b11, b21, b31, b41).... The calculation can be written as follows:

  • a0 x b0[0] + a1 x b0[1] + a2 x b0[2] + a3 x b0[3] ->The first column of the resulting matrix

  • a0 x b1[0] + a1 x b1[1] + a2 x b1[2] + a3 x b1[3] ->The second column of the resulting matrix

  • ....

  1. In summary, we can get the following conclusion:

  • If the matrix is stored in memory in a column-major order, it is suitable for SIMD (Single Instruction, Multiple Data) computation.

  • If the matrix is stored in row-first order, neon also provides instructions for scatter load/store, making it easier to convert to column-first vector

If the original data is not in column-major order

neon provides scatter load/store instructions that make it easier to convert the data into column-major vectors.

This can be achieved using instructions such as vld4q_datatype and vst4q_datatype, as shown in the figure.

vector.png

Sample Code:

int main(int argc, const char * argv[]) 
{
    uint32_t memoryA[16] = {'A', 'B', 'C', 'D', 'A', 'B', 'C', 'D', 'A', 'B', 'C', 'D', 'A', 'B', 'C', 'D'};

    uint32x4x4_t columnMajorMatrix = vld4q_u32(memoryA);
    for (int i = 0; i < 4; i++)
    {
        printf("The matrix.val/vector[%d]:", i);
        for (int j = 0; j < 4; j++) {
            printf("%c ", columnMajorMatrix.val[i][j]);
        }
        printf("\n");
    }
    return 0;
}

Output Result:

The matrix.val/vector[0]:A A A A 
The matrix.val/vector[1]:B B B B 
The matrix.val/vector[2]:C C C C 
The matrix.val/vector[3]:D D D D 
Program ended with exit code: 0

As you can see: each vector has been saved with the corresponding column data in column-major order.

The first version of multiplication

assuming A, B, and C are in column-major order, with C being the output matrix

float32_t A[16] = {0f, 1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f, 9f, 10f, 11f, 12f, 13f, 14f, 15f};
float32_t B[16] = {0f, 1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f, 9f, 10f, 11f, 12f, 13f, 14f, 15f};
float32_t C[16] = {};

float32x4_t a0 = vld1q_f32(A + 0);
float32x4_t a1 = vld1q_f32(A + 4);
float32x4_t a2 = vld1q_f32(A + 8);
float32x4_t a3 = vld1q_f32(A + 12);

// Compute the first column of matrix C and store it
float32x4_t bx = vld1q_f32(B + 0);
float32x4_t cx = vmulq_lane_f32(a0, vget_low_f32(bx), 0);
cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1);
cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0);
cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1);
vst1q_f32(C + 0, cx);

// Compute the second column of matrix C and store it
bx = vld1q_f32(B + 4);
cx  = vmulq_lane_f32(a0, vget_low_f32(bx), 0);
cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1);
cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0);
cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1);
vst1q_f32(C + 4, cx);

// Compute the third column of matrix C and store it
bx  = vld1q_f32(B + 8);
cx  = vmulq_lane_f32(a0, vget_low_f32(bx), 0);
cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1);
cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0);
cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1);
vst1q_f32(C + 8, cx);

// Compute the fourth column of matrix C and store it
bx  = vld1q_f32(B + 12);
cx  = vmulq_lane_f32(a0, vget_low_f32(bx), 0);
cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1);
cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0);
cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1);
vst1q_f32(C + 12, cx);

issues here:

The bx and cx vectors are used in each iteration, and to compute the n+1th column, we have to wait until the nth column is completely calculated. However, neon actually provides some channels that allow for parallel execution.

The second version of multiplication

float32_t A[16] = {0f, 1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f, 9f, 10f, 11f, 12f, 13f, 14f, 15f};
float32_t B[16] = {0f, 1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f, 9f, 10f, 11f, 12f, 13f, 14f, 15f};
float32_t C[16] = {};

float32x4_t a0 = vld1q_f32(A + 0);
float32x4_t a1 = vld1q_f32(A + 4);
float32x4_t a2 = vld1q_f32(A + 8);
float32x4_t a3 = vld1q_f32(A + 12);

// Compute the xth column of matrix C and store it.
float32x4_t bx, cx;
for (int i = 0; i <= 12; i+=4)
{
    bx = vld1q_f32(B + i);
    cx  = vmulq_lane_f32(a0, vget_low_f32(bx), 0);
    cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1);
    cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0);
    cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1);
    vst1q_f32(C + i, cx);
}

issues here:

Similar to the previous version, but with added loops, reduced code volume, and improved readability.

The third version of multiplication

float32x4_t a0 = vld1q_f32(A + 0);
float32x4_t a1 = vld1q_f32(A + 4);
float32x4_t a2 = vld1q_f32(A + 8);
float32x4_t a3 = vld1q_f32(A + 12);

// Compute the xth column of matrix C and store it.
for (int i = 0; i <= 12; i+=4)
{
    float32x4_t bx = vld1q_f32(B + i);
    float32x4_t cx  = vmulq_lane_f32(a0, vget_low_f32(bx), 0);
    cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1);
    cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0);
    cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1);
    vst1q_f32(C + i, cx);
}

Features:

Compared to the second version, the internal variables within this version rely solely on the data within a certain iteration for the completion of the loop. For the loop processing, we recommend::

  1. It minimizes the dependence on external factors as much as possible.

  2. If it cannot be avoided, it may reduce such dependence by splitting the loop into multiple iterations, such as using nested loops, so that the inner loop responsible for more tasks does not have any dependencies.

The fourth version of multiplication

float32_t A[16] = {0f, 1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f, 9f, 10f, 11f, 12f, 13f, 14f, 15f};
float32_t B[16] = {0f, 1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f, 9f, 10f, 11f, 12f, 13f, 14f, 15f};
float32_t C[16] = {};

float32x4_t a0 = vld1q_f32(A + 0);
float32x4_t a1 = vld1q_f32(A + 4);
float32x4_t a2 = vld1q_f32(A + 8);
float32x4_t a3 = vld1q_f32(A + 12);

float32x4_t b1, b2, b3, b4;
float32x4_t c1, c2, c3, c4;

// Compute the first column of matrix C and store it
b1 = vld1q_f32(B + 0);
c1 = vmulq_lane_f32(a0, vget_low_f32(b1), 0);
c1 += vmlaq_lane_f32(a0, a1, vget_low_f32(b1), 1);
c1 += vmlaq_lane_f32(a0, a2, vget_high_f32(b1), 0);
c1 += vmlaq_lane_f32(a0, a3, vget_high_f32(b1), 1);
vst1q_f32(C + 0, c1);

// Compute the second column of matrix C and store it
b2 = vld1q_f32(B + 4);
c2  = vmulq_lane_f32(a0, vget_low_f32(b2), 0);
c2 += vmlaq_lane_f32(a0, a1, vget_low_f32(b2), 1);
c2 += vmlaq_lane_f32(a0, a2, vget_high_f32(b2), 0);
c2 += vmlaq_lane_f32(a0, a3, vget_high_f32(b2), 1);
vst1q_f32(C + 4, c2);

// Compute the third column of matrix C and store it
b3  = vld1q_f32(B + 8);
c3  = vmulq_lane_f32(a0, vget_low_f32(b3), 0);
c3 += vmlaq_lane_f32(a0, a1, vget_low_f32(b3), 1);
c3 += vmlaq_lane_f32(a0, a2, vget_high_f32(b3), 0);
c3 += vmlaq_lane_f32(a0, a3, vget_high_f32(b3), 1);
vst1q_f32(C + 8, c3);

// Compute the fourth column of matrix C and store it
b4  = vld1q_f32(B + 12);
c4  = vmulq_lane_f32(a0, vget_low_f32(b4), 0);
c4 += vmlaq_lane_f32(a0, a1, vget_low_f32(b4), 1);
c4 += vmlaq_lane_f32(a0, a2, vget_high_f32(b4), 0);
c4 += vmlaq_lane_f32(a0, a3, vget_high_f32(b4), 1);
vst1q_f32(C + 12, c4);

Features:

  1. bx is divided into b1, b2, b3, b4.

  2. cx is divided into c1, c2, c3, c4.

  3. As a result, the calculation of each column is independent and can be performed separately. In neon, it is possible for multiple instructions to be scheduled for simultaneous execution, such as the following: (Why? We don't know the exact reasons, but according to the official documentation, although there is no concept of kernel threads like in cuda, neon itself may have similar functional computation channels that allow for concurrent execution of memory loads/stores and computations.)

    • r0 = vmlaq_lane_f32(r0, a1, vget_low_f32(b0), 1);

    • r1 = vmlaq_lane_f32(r1, a1, vget_low_f32(b1), 1);

    • r2 = vmlaq_lane_f32(r2, a1, vget_low_f32(b2), 1);

    • r3 = vmlaq_lane_f32(r3, a1, vget_low_f32(b3), 1);

The main instructions used in the code

  1. vmulq_lan_f32(a0, a1, lane_idx)

    • Vector-scalar multiplication, where the scalar value comes from the lane_idx channel of a1.

    • Pseudocode:

      • result[0] = a0[0] x a1[lane_idx]

      • result[1] = a0[1] x a1[lane_idx]

      • result[2] = a0[2] x a1[lane_idx]

      • result[3] = a0[3] x a1[lane_idx]

  2. vmlaq_lane_f32(a0, a1, a2, lane_idx)

    1. Multiply-Add Instruction: Pseudocode a0 + a1 x a2[lane_idx]

    2. Note the data types. Of course, if there is an error, the compiler will prompt an error message.

    3. This instruction can be replaced with vaddq(a0, vmulq_lan_f32(a1, a2, lane_idx))

      1. It has been changed into two instructions with clear dependency relations.

      2. Therefore, the multiply-add instruction will achieve better performance.

  3. vget_low_f32Obtain the first two variables from the vector

a1, a2, a3, a4 --> a1, a2

  1. vget_high_f32Obtain the last two variables from the vector

a1, a2, a3, a4 --> a3, a4

Complete Sample Code

void matrix_mul_version1(float32_t* A, float32_t* B, float32_t* C)
{
    float32x4_t a0 = vld1q_f32(A + 0);
    float32x4_t a1 = vld1q_f32(A + 4);
    float32x4_t a2 = vld1q_f32(A + 8);
    float32x4_t a3 = vld1q_f32(A + 12);

    // Compute the first column of matrix C and store it
    float32x4_t bx = vld1q_f32(B + 0);
    float32x4_t cx = vmulq_lane_f32(a0, vget_low_f32(bx), 0);
    cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1);
    cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0);
    cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1);
    vst1q_f32(C + 0, cx);

    // Compute the second column of matrix C and store it
    bx = vld1q_f32(B + 4);
    cx  = vmulq_lane_f32(a0, vget_low_f32(bx), 0);
    cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1);
    cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0);
    cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1);
    vst1q_f32(C + 4, cx);

    // Compute the third column of matrix C and store it
    bx  = vld1q_f32(B + 8);
    cx  = vmulq_lane_f32(a0, vget_low_f32(bx), 0);
    cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1);
    cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0);
    cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1);
    vst1q_f32(C + 8, cx);

    // Compute the fourth column of matrix C and store it
    bx  = vld1q_f32(B + 12);
    cx  = vmulq_lane_f32(a0, vget_low_f32(bx), 0);
    cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1);
    cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0);
    cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1);
    vst1q_f32(C + 12, cx);
}

void matrix_mul_version2(float32_t* A, float32_t* B, float32_t* C)
{
    float32x4_t a0 = vld1q_f32(A + 0);
    float32x4_t a1 = vld1q_f32(A + 4);
    float32x4_t a2 = vld1q_f32(A + 8);
    float32x4_t a3 = vld1q_f32(A + 12);
    
    // Compute the x column of matrix C and store it
    float32x4_t bx, cx;
    for (int i = 0; i <= 12; i+=4)
    {
        bx = vld1q_f32(B + i);
        cx  = vmulq_lane_f32(a0, vget_low_f32(bx), 0);
        cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1);
        cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0);
        cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1);
        vst1q_f32(C + i, cx);
    }
}

void matrix_mul_version3(float32_t* A, float32_t* B, float32_t* C)
{
    float32x4_t a0 = vld1q_f32(A + 0);
    float32x4_t a1 = vld1q_f32(A + 4);
    float32x4_t a2 = vld1q_f32(A + 8);
    float32x4_t a3 = vld1q_f32(A + 12);
    
    // Compute the x column of matrix C and store it
    for (int i = 0; i <= 12; i+=4)
    {
        float32x4_t bx = vld1q_f32(B + i);
        float32x4_t cx  = vmulq_lane_f32(a0, vget_low_f32(bx), 0);
        cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1);
        cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0);
        cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1);
        vst1q_f32(C + i, cx);
    }
}

void matrix_mul_version4(float32_t* A, float32_t* B, float32_t* C)
{
    float32x4_t a0 = vld1q_f32(A + 0);
    float32x4_t a1 = vld1q_f32(A + 4);
    float32x4_t a2 = vld1q_f32(A + 8);
    float32x4_t a3 = vld1q_f32(A + 12);

    float32x4_t b1, b2, b3, b4;
    float32x4_t c1, c2, c3, c4;

    // Compute the first column of matrix C and store it
    b1 = vld1q_f32(B + 0);
    c1 = vmulq_lane_f32(a0, vget_low_f32(b1), 0);
    c1 += vmlaq_lane_f32(a0, a1, vget_low_f32(b1), 1);
    c1 += vmlaq_lane_f32(a0, a2, vget_high_f32(b1), 0);
    c1 += vmlaq_lane_f32(a0, a3, vget_high_f32(b1), 1);
    vst1q_f32(C + 0, c1);

    // Compute the second column of matrix C and store it
    b2 = vld1q_f32(B + 4);
    c2  = vmulq_lane_f32(a0, vget_low_f32(b2), 0);
    c2 += vmlaq_lane_f32(a0, a1, vget_low_f32(b2), 1);
    c2 += vmlaq_lane_f32(a0, a2, vget_high_f32(b2), 0);
    c2 += vmlaq_lane_f32(a0, a3, vget_high_f32(b2), 1);
    vst1q_f32(C + 4, c2);

    // Compute the third column of matrix C and store it
    b3  = vld1q_f32(B + 8);
    c3  = vmulq_lane_f32(a0, vget_low_f32(b3), 0);
    c3 += vmlaq_lane_f32(a0, a1, vget_low_f32(b3), 1);
    c3 += vmlaq_lane_f32(a0, a2, vget_high_f32(b3), 0);
    c3 += vmlaq_lane_f32(a0, a3, vget_high_f32(b3), 1);
    vst1q_f32(C + 8, c3);

    // Compute the fourth column of matrix C and store it
    b4  = vld1q_f32(B + 12);
    c4  = vmulq_lane_f32(a0, vget_low_f32(b4), 0);
    c4 += vmlaq_lane_f32(a0, a1, vget_low_f32(b4), 1);
    c4 += vmlaq_lane_f32(a0, a2, vget_high_f32(b4), 0);
    c4 += vmlaq_lane_f32(a0, a3, vget_high_f32(b4), 1);
    vst1q_f32(C + 12, c4);
}

static void print_matrix_memory(const char* tag, float32_t* matrix)
{
    printf("output matrix content for %s: \n", tag);
    printf("%.4f %.4f %.4f %.4f\n",   matrix[0],  matrix[1],  matrix[2],  matrix[3]);
    printf("%.4f %.4f %.4f %.4f\n",   matrix[4],  matrix[5],  matrix[6],  matrix[7]);
    printf("%.4f %.4f %.4f %.4f\n",   matrix[8],  matrix[9],  matrix[10], matrix[11]);
    printf("%.4f %.4f %.4f %.4f\n\n", matrix[12], matrix[13], matrix[14], matrix[15]);
}

int main(int argc, const char * argv[])
{
    float32_t A[16] = {0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f, 9.0f, 10.0f, 11.0f, 12.0f, 13.0f, 14.0f, 15.0f};
    float32_t B[16] = {0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f, 9.0f, 10.0f, 11.0f, 12.0f, 13.0f, 14.0f, 15.0f};
    float32_t C[16] = {};
    
    matrix_mul_version1(A, B, C);
    print_matrix_memory("verision1C", C);
    matrix_mul_version2(A, B, C);
    print_matrix_memory("verision2C", C);
    matrix_mul_version3(A, B, C);
    print_matrix_memory("verision3C", C);
    matrix_mul_version4(A, B, C);
    print_matrix_memory("verision4C", C);

    return 0;
}
  • No labels