This document describes how to use NEON on the SP7350 to accelerate data processing.We will describe what scenarios are suitable for using NEON, NEON learning materials, and examples of how to use NEON acceleration.

Table of contents

1 When to use NEON
2 NEON Study Materials
- 2.1 Official Reference Documents
- 2.2 Official Documentation Learning Path
3 NEON Examples
- 3.1 1. YUV to BGR Conversion
  - 3.1.1 YUV format
  - 3.1.2 RGB format
  - 3.1.3 Analysis
  - 3.1.4 Additional Notes
  - 3.1.5 Conversion Performance
  - 3.1.6 Sample Code for the General Conversion Method
  - 3.1.7 Sample Code for the Neon Conversion Method
- 3.2 2. Matrix Multiplication

When to use NEON

When performing algorithmic calculations, the following questions need to be considered:

Whether the input of the next calculation depends on the output of the previous calculation，and the calculation is single-instruction, single-data (SISD). if so, the algorithm may not be suitable for SIMD(neon on C3V).
However, some SISD-form algorithms may be transformed or approached from a different perspective, potentially allowing them to be converted into SIMD form while still meeting the required precision loss tolerance.

When considering algorithm optimization, do we necessarily need NEON (SIMD)?

If you're not familiar with NEON, are there still ways to optimize algorithms? Here is a possible approach:

Are there any loops in the algorithm? If not, can we make some improvements to introduce them?
Within the loop body, do some inputs of the subsequent iteration depend on the outputs of the previous iteration? If so, can we eliminate this dependency?
Can we extract the algorithm-related code into a separate source file and compile it with the O3 optimization flag in gcc (if O3 is not suitable for compiling all source files in the entire program)?

Suitable scenarios:

When data in memory is stored in a certain format, such as (RGBRGB)
Multiple data can be loaded into one or more variables regularly in a single operation.
One or more variables can be stored in memory regularly in a single operation.
Data types can be understood as 8-bit, 16-bit, 32-bit, and 64-bit signed, unsigned integers, or floating-point types.
Corresponding calculations or algorithms can have multiple inputs and outputs in a single computation.

NEON Study Materials

Official Reference Documents

Programming Guide: DEN0018A_neon_programmers_guide.pdf

Official Website Link: https://developer.arm.com/documentation/den0018/latest

Function Quick Reference Manual: https://developer.arm.com/architectures/instruction-sets/intrinsics
Quick Start: https://developer.arm.com/documentation/102159/latest

Official Documentation Learning Path

Programming Guide
Quick Start
During programming, you may refer to the Function Quick Reference Manual to look up the functions you should use.

NEON Examples

1. YUV to BGR Conversion

YUV format

There are many formats for YUV, but the format used in the example code is limited to YUYV, which has the following format: YUYV YUYV YUYV ... YUYV. It has the following characteristics：

Each pair of Y components shares a pair of UV components, so every two pixels occupy 4 bytes. For a YUYV image with a width of W and a height of H, it occupies W * H bytes.

RGB format

Due to the instructions for decentralized storage loading in neon, there is basically no impact on performance depending on which type of RGB is stored, whether it is RGB, BGA, RGBA, BGRA, etc.

Analysis

We need to extract a certain amount of Y, a certain amount of V, and a certain amount of U data from the buffer for calculations. It is important to maximize the number of Y components participating in the calculations without causing any overflow. Given that we are dealing with 8-bit unsigned integers and considering the conversion algorithm, 16-bit float data should be sufficient. Here's how we proceed:

When using Q registers, we can process 128 / 16 = 8 F16-type data at a time.
Before the calculation, we need to load the data as follows:○ Load 8 Ys into a float16x8_t variable. The content will be:Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8○ Load the 4 Us shared by these 8 Ys into a float16x8_t variable. For convenient calculation, the content will be:U1 U1 U2 U2 U3 U3 U4 U4○ Load the 4 Vs shared by these 8 Ys into a float16x8_t variable. For convenient calculation, the content will be:V1 V1 V2 V2 V3 V3 V4 V4
To load the data in this format, we need a specific NEON instruction: vqtbl1_u8. This instruction extracts 8 bytes from specified positions in a uint8x16 variable (the size of a Q register) and forms a uint8x8 variable (the size of a D register). Here's an example code snippet:

static const uint8_t yuyv_y_indices[8] = {0, 2, 4, 6,  8, 10, 12, 14};
static const uint8_t yuyv_u_indices[8] = {1, 1, 5, 5,  9,  9, 13, 13};
static const uint8_t yuyv_v_indices[8] = {3, 3, 7, 7, 11, 11, 15, 15};

... 
uint8x16_t A = vld1q_u8(yuvBuffer + i);
uint8x8_t y8 = vqtbl1_u8(A, vld1_u8(yuyv_y_indices));
uint8x8_t u8 = vqtbl1_u8(A, vld1_u8(yuyv_u_indices));
uint8x8_t v8 = vqtbl1_u8(A, vld1_u8(yuyv_v_indices));

... 
// Then convert uint8x8_t to float8x16_t and then apply the real convertion.

Additional Notes

This is just an example NEON program. Under the condition of accepting a certain level of precision loss, there are better methods or optimization directions for converting YUV to RGB. You can search online for articles on YUV to RGB conversion, Alternative conversion methods include:

Lookup Table Approach
Using Integer Data in Calculations

Conversion Performance

Using the gcc (O3) optimization option

convert norm_bgr succeed, takes 27
convert neon_bgr succeed, takes 13

NEON conversion takes 13ms
Traditional conversion takes 27ms

Without the gcc (O3) optimization option

convert norm_bgr succeed, takes 125
convert neon_bgr succeed, takes 55

NEON conversion takes 55ms
Traditional conversion takes 125ms

Sample Code for the General Conversion Method

Sample Code for the Neon Conversion Method

2. Matrix Multiplication

"Amxn X Bnxj = Cmj" refers to the multiplication of an m-by-n matrix A with an n-by-j matrix B, resulting in an m-by-j matrix C.

The optimization approach in this context differs from our customary programming practices, making it susceptible to negative optimization in the code.

The transition from matrix multiplication to vector multiplication

transition from matrix multiplication.png

Neon does not natively support matrix multiplication.
As shown in the above picture, the first two columns of the output matrix have been calculated using the traditional approach. And he following patterns/rules are identified:
- The i-th column of the output matrix is obtained by multiplying the x-th column of matrix A with the x-th element of the i-th column of matrix B, where 0 <= i < 4 and 0 <= x < 4.
- Construct vectors in column units, a0(a11, a21, a31, a41)..., b0(b11, b21, b31, b41).... The calculation can be written as follows:
- a0 x b0[0] + a1 x b0[1] + a2 x b0[2] + a3 x b0[3] ->The first column of the resulting matrix
- a0 x b1[0] + a1 x b1[1] + a2 x b1[2] + a3 x b1[3] ->The second column of the resulting matrix
- ....

In summary, we can get the following conclusion:
- If the matrix is stored in memory in a column-major order, it is suitable for SIMD (Single Instruction, Multiple Data) computation.
- If the matrix is stored in row-first order, neon also provides instructions for scatter load/store, making it easier to convert to column-first vector

If the original data is not in column-major order

neon provides scatter load/store instructions that make it easier to convert the data into column-major vectors.

This can be achieved using instructions such as vld4q_datatype and vst4q_datatype, as shown in the figure.

Sample Code:

Output Result:

As you can see: each vector has been saved with the corresponding column data in column-major order.

The first version of multiplication

assuming A, B, and C are in column-major order, with C being the output matrix

issues here：

The bx and cx vectors are used in each iteration, and to compute the n+1th column, we have to wait until the nth column is completely calculated. However, neon actually provides some channels that allow for parallel execution.

The second version of multiplication