Using ARM NEON Instructions to Accelerate Data Processing

This document describes how to use NEON on the SP7350 to accelerate data processing.We will describe what scenarios are suitable for using NEON, NEON learning materials, and examples of how to use NEON acceleration.

Table of contents

When to use NEON

When performing algorithmic calculations, the following questions need to be considered:

  1. Whether the input of the next calculation depends on the output of the previous calculation,and the calculation is single-instruction, single-data (SISD). if so, the algorithm may not be suitable for SIMD(neon on C3V).

  2. However, some SISD-form algorithms may be transformed or approached from a different perspective, potentially allowing them to be converted into SIMD form while still meeting the required precision loss tolerance.

When considering algorithm optimization, do we necessarily need NEON (SIMD)?

If you're not familiar with NEON, are there still ways to optimize algorithms? Here is a possible approach:

  1. Are there any loops in the algorithm? If not, can we make some improvements to introduce them?

  2. Within the loop body, do some inputs of the subsequent iteration depend on the outputs of the previous iteration? If so, can we eliminate this dependency?

  3. Can we extract the algorithm-related code into a separate source file and compile it with the O3 optimization flag in gcc (if O3 is not suitable for compiling all source files in the entire program)?

Suitable scenarios:

  1. When data in memory is stored in a certain format, such as (RGBRGB)

  2. Multiple data can be loaded into one or more variables regularly in a single operation.

  3. One or more variables can be stored in memory regularly in a single operation.

  4. Data types can be understood as 8-bit, 16-bit, 32-bit, and 64-bit signed, unsigned integers, or floating-point types.

  5. Corresponding calculations or algorithms can have multiple inputs and outputs in a single computation.

NEON Study Materials

Official Reference Documents

  1. Programming Guide: DEN0018A_neon_programmers_guide.pdf

Official Website Link: https://developer.arm.com/documentation/den0018/latest

  1. Function Quick Reference Manual: https://developer.arm.com/architectures/instruction-sets/intrinsics

  2. Quick Start: https://developer.arm.com/documentation/102159/latest

Official Documentation Learning Path

  1. Programming Guide

  2. Quick Start

  3. During programming, you may refer to the Function Quick Reference Manual to look up the functions you should use.

NEON Examples

1. YUV to BGR Conversion

YUV format

There are many formats for YUV, but the format used in the example code is limited to YUYV, which has the following format: YUYV YUYV YUYV ... YUYV. It has the following characteristics:

Each pair of Y components shares a pair of UV components, so every two pixels occupy 4 bytes. For a YUYV image with a width of W and a height of H, it occupies W * H bytes.

RGB format

Due to the instructions for decentralized storage loading in neon, there is basically no impact on performance depending on which type of RGB is stored, whether it is RGB, BGA, RGBA, BGRA, etc.

Analysis

We need to extract a certain amount of Y, a certain amount of V, and a certain amount of U data from the buffer for calculations. It is important to maximize the number of Y components participating in the calculations without causing any overflow. Given that we are dealing with 8-bit unsigned integers and considering the conversion algorithm, 16-bit float data should be sufficient. Here's how we proceed:

  1. When using Q registers, we can process 128 / 16 = 8 F16-type data at a time.

  2. Before the calculation, we need to load the data as follows:â—‹ Load 8 Ys into a float16x8_t variable. The content will be:Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8â—‹ Load the 4 Us shared by these 8 Ys into a float16x8_t variable. For convenient calculation, the content will be:U1 U1 U2 U2 U3 U3 U4 U4â—‹ Load the 4 Vs shared by these 8 Ys into a float16x8_t variable. For convenient calculation, the content will be:V1 V1 V2 V2 V3 V3 V4 V4

  3. To load the data in this format, we need a specific NEON instruction: vqtbl1_u8. This instruction extracts 8 bytes from specified positions in a uint8x16 variable (the size of a Q register) and forms a uint8x8 variable (the size of a D register). Here's an example code snippet:

static const uint8_t yuyv_y_indices[8] = {0, 2, 4, 6, 8, 10, 12, 14}; static const uint8_t yuyv_u_indices[8] = {1, 1, 5, 5, 9, 9, 13, 13}; static const uint8_t yuyv_v_indices[8] = {3, 3, 7, 7, 11, 11, 15, 15}; ... uint8x16_t A = vld1q_u8(yuvBuffer + i); uint8x8_t y8 = vqtbl1_u8(A, vld1_u8(yuyv_y_indices)); uint8x8_t u8 = vqtbl1_u8(A, vld1_u8(yuyv_u_indices)); uint8x8_t v8 = vqtbl1_u8(A, vld1_u8(yuyv_v_indices)); ... // Then convert uint8x8_t to float8x16_t and then apply the real convertion.

Additional Notes

This is just an example NEON program. Under the condition of accepting a certain level of precision loss, there are better methods or optimization directions for converting YUV to RGB. You can search online for articles on YUV to RGB conversion, Alternative conversion methods include:

  1. Lookup Table Approach

  2. Using Integer Data in Calculations

Conversion Performance

  1. Using the gcc (O3) optimization option

convert norm_bgr succeed, takes 27 convert neon_bgr succeed, takes 13
  • NEON conversion takes 13ms

  • Traditional conversion takes 27ms

  1. Without the gcc (O3) optimization option

convert norm_bgr succeed, takes 125 convert neon_bgr succeed, takes 55
  • NEON conversion takes 55ms

  • Traditional conversion takes 125ms

Sample Code for the General Conversion Method

Sample Code for the Neon Conversion Method

2. Matrix Multiplication

"Amxn X Bnxj = Cmj" refers to the multiplication of an m-by-n matrix A with an n-by-j matrix B, resulting in an m-by-j matrix C.

The optimization approach in this context differs from our customary programming practices, making it susceptible to negative optimization in the code.

The transition from matrix multiplication to vector multiplication

 

transition from matrix multiplication.png

 

  1. Neon does not natively support matrix multiplication.

  2. As shown in the above picture, the first two columns of the output matrix have been calculated using the traditional approach. And he following patterns/rules are identified:

    • The i-th column of the output matrix is obtained by multiplying the x-th column of matrix A with the x-th element of the i-th column of matrix B, where 0 <= i < 4 and 0 <= x < 4.

    • Construct vectors in column units, a0(a11, a21, a31, a41)..., b0(b11, b21, b31, b41).... The calculation can be written as follows:

    • a0 x b0[0] + a1 x b0[1] + a2 x b0[2] + a3 x b0[3] ->The first column of the resulting matrix

    • a0 x b1[0] + a1 x b1[1] + a2 x b1[2] + a3 x b1[3] ->The second column of the resulting matrix

    • ....

  1. In summary, we can get the following conclusion:

    • If the matrix is stored in memory in a column-major order, it is suitable for SIMD (Single Instruction, Multiple Data) computation.

    • If the matrix is stored in row-first order, neon also provides instructions for scatter load/store, making it easier to convert to column-first vector

If the original data is not in column-major order

neon provides scatter load/store instructions that make it easier to convert the data into column-major vectors.

This can be achieved using instructions such as vld4q_datatype and vst4q_datatype, as shown in the figure.

 

vector.png

Sample Code:

Output Result:

As you can see: each vector has been saved with the corresponding column data in column-major order.

The first version of multiplication

assuming A, B, and C are in column-major order, with C being the output matrix

issues here:

The bx and cx vectors are used in each iteration, and to compute the n+1th column, we have to wait until the nth column is completely calculated. However, neon actually provides some channels that allow for parallel execution.

The second version of multiplication

issues here:

Similar to the previous version, but with added loops, reduced code volume, and improved readability.

The third version of multiplication

Features:

Compared to the second version, the internal variables within this version rely solely on the data within a certain iteration for the completion of the loop. For the loop processing, we recommend::

  1. It minimizes the dependence on external factors as much as possible.

  2. If it cannot be avoided, it may reduce such dependence by splitting the loop into multiple iterations, such as using nested loops, so that the inner loop responsible for more tasks does not have any dependencies.

The fourth version of multiplication

Features:

  1. bx is divided into b1, b2, b3, b4.

  2. cx is divided into c1, c2, c3, c4.

  3. As a result, the calculation of each column is independent and can be performed separately. In neon, it is possible for multiple instructions to be scheduled for simultaneous execution, such as the following: (Why? We don't know the exact reasons, but according to the official documentation, although there is no concept of kernel threads like in cuda, neon itself may have similar functional computation channels that allow for concurrent execution of memory loads/stores and computations.)

    • r0 = vmlaq_lane_f32(r0, a1, vget_low_f32(b0), 1);

    • r1 = vmlaq_lane_f32(r1, a1, vget_low_f32(b1), 1);

    • r2 = vmlaq_lane_f32(r2, a1, vget_low_f32(b2), 1);

    • r3 = vmlaq_lane_f32(r3, a1, vget_low_f32(b3), 1);

The main instructions used in the code

  1. vmulq_lan_f32(a0, a1, lane_idx)

    • Vector-scalar multiplication, where the scalar value comes from the lane_idx channel of a1.

    • Pseudocode:

      • result[0] = a0[0] x a1[lane_idx]

      • result[1] = a0[1] x a1[lane_idx]

      • result[2] = a0[2] x a1[lane_idx]

      • result[3] = a0[3] x a1[lane_idx]

  2. vmlaq_lane_f32(a0, a1, a2, lane_idx)

    1. Multiply-Add Instruction: Pseudocode a0 + a1 x a2[lane_idx]

    2. Note the data types. Of course, if there is an error, the compiler will prompt an error message.

    3. This instruction can be replaced with vaddq(a0, vmulq_lan_f32(a1, a2, lane_idx))

      1. It has been changed into two instructions with clear dependency relations.

      2. Therefore, the multiply-add instruction will achieve better performance.

  3. vget_low_f32Obtain the first two variables from the vector

a1, a2, a3, a4 --> a1, a2

  1. vget_high_f32Obtain the last two variables from the vector

a1, a2, a3, a4 --> a3, a4

Complete Sample Code