This document describes how to use NEON on the SP7350 to accelerate data processing.We will describe what scenarios are suitable for using NEON, NEON learning materials, and examples of how to use NEON acceleration.
...
Table of Contents | ||
---|---|---|
|
When to use NEON
When performing algorithmic calculations, the following questions need to be considered:
Whether the input of the next calculation depends on the output of the previous calculation,and the calculation is single-instruction, single-data (SISD). if so, the algorithm may not be suitable for SIMD(neon on C3V).
However, some SISD-form algorithms may be transformed or approached from a different perspective, potentially allowing them to be converted into SIMD form while still meeting the required precision loss tolerance.
When considering algorithm optimization, do we necessarily need NEON (SIMD)?
If you're not familiar with NEON, are there still ways to optimize algorithms? Here is a possible approach:
Are there any loops in the algorithm? If not, can we make some improvements to introduce them?
Within the loop body, do some inputs of the subsequent iteration depend on the outputs of the previous iteration? If so, can we eliminate this dependency?
Can we extract the algorithm-related code into a separate source file and compile it with the O3 optimization flag in gcc (if O3 is not suitable for compiling all source files in the entire program)?
Suitable scenarios:
When data in memory is stored in a certain format, such as (RGBRGB)
Multiple data can be loaded into one or more variables regularly in a single operation.
One or more variables can be stored in memory regularly in a single operation.
Data types can be understood as 8-bit, 16-bit, 32-bit, and 64-bit signed, unsigned integers, or floating-point types.
Corresponding calculations or algorithms can have multiple inputs and outputs in a single computation.
NEON Study Materials
Official Reference Documents
Programming Guide: DEN0018A_neon_programmers_guide.pdf
...
Function Quick Reference Manual: https://developer.arm.com/architectures/instruction-sets/intrinsics
Quick Start: https://developer.arm.com/documentation/102159/latest
Official Documentation Learning Path
Programming Guide
Quick Start
During programming, you may refer to the Function Quick Reference Manual to look up the functions you should use.
NEON Examples
1. YUV to BGR Conversion
YUV format
There are many formats for YUV, but the format used in the example code is limited to YUYV, which has the following format: YUYV YUYV YUYV ... YUYV. It has the following characteristics:
Each pair of Y components shares a pair of UV components, so every two pixels occupy 4 bytes. For a YUYV image with a width of W and a height of H, it occupies W * H bytes.
RGB format
Due to the instructions for decentralized storage loading in neon, there is basically no impact on performance depending on which type of RGB is stored, whether it is RGB, BGA, RGBA, BGRA, etc.
Analysis
We need to extract a certain amount of Y, a certain amount of V, and a certain amount of U data from the buffer for calculations. It is important to maximize the number of Y components participating in the calculations without causing any overflow. Given that we are dealing with 8-bit unsigned integers and considering the conversion algorithm, 16-bit float data should be sufficient. Here's how we proceed:
...
Code Block | ||
---|---|---|
| ||
static const uint8_t yuyv_y_indices[8] = {0, 2, 4, 6, 8, 10, 12, 14}; static const uint8_t yuyv_u_indices[8] = {1, 1, 5, 5, 9, 9, 13, 13}; static const uint8_t yuyv_v_indices[8] = {3, 3, 7, 7, 11, 11, 15, 15}; ... uint8x16_t A = vld1q_u8(yuvBuffer + i); uint8x8_t y8 = vqtbl1_u8(A, vld1_u8(yuyv_y_indices)); uint8x8_t u8 = vqtbl1_u8(A, vld1_u8(yuyv_u_indices)); uint8x8_t v8 = vqtbl1_u8(A, vld1_u8(yuyv_v_indices)); ... // Then convert uint8x8_t to float8x16_t and then apply the real convertion. |
Additional Notes
This is just an example NEON program. Under the condition of accepting a certain level of precision loss, there are better methods or optimization directions for converting YUV to RGB. You can search online for articles on YUV to RGB conversion, Alternative conversion methods include:
Lookup Table Approach
Using Integer Data in Calculations
Conversion Performance
Using the gcc (O3) optimization option
...
NEON conversion takes 55ms
Traditional conversion takes 125ms
Sample Code for the General Conversion Method
Code Block | ||
---|---|---|
| ||
static inline uint8_t limit_value_to_0_255(float value) { if (value < 0) { value = 0; } else if (value > 255) { value = 255; } return (uint8_t)value; } static inline void yuvtoRgbByNormx2( uint8_t* ru0, uint8_t* gu0, uint8_t* bu0, uint8_t* ru1, uint8_t* gu1, uint8_t* bu1, float y1, float y2, float u0, float v0) { /* calu Y0 */ float r0 = y1 + 1.402 * v0; float g0 = y1 - 0.34414 * u0 - 0.71414 * v0; float b0 = y1 + 1.772 * u0; *ru0 = limit_value_to_0_255(r0); *gu0 = limit_value_to_0_255(g0); *bu0 = limit_value_to_0_255(b0); /* calu Y1 */ float r1 = y2 + 1.402 * v0; float g1 = y2 - 0.34414 * u0 - 0.71414 * v0; float b1 = y2 + 1.772 * u0; *ru1 = limit_value_to_0_255(r1); *gu1 = limit_value_to_0_255(g1); *bu1 = limit_value_to_0_255(b1); } YuvBool yuvToBgrByNorm( int yuvFormat, uint32_t width, uint32_t height, uint8_t* yuvBuffer, uint8_t* rgbBuffer, uint8_t* rgbBufferInterleaved) { float y1, y2, u0, v0; uint8_t ru0, gu0, bu0; uint8_t ru1, gu1, bu1; uint32_t rgbPosition = 0; uint32_t rgbInterleavedPosition = 0; uint8_t* rgbBuffer0 = rgbBuffer; uint8_t* rgbBuffer1 = rgbBuffer + width * height; uint8_t* rgbBuffer2 = rgbBuffer + width * height * 2; /* size for yuyu/uyuv */ uint32_t yuvFrameSize = width * height * 2; for (int i = 0; i < yuvFrameSize; i += 4) { /* extact Y1, Y2, U0, V0 */ switch(yuvFormat) { case YUV_FMT_UYUV: y1 = yuvBuffer[i + 1]; y2 = yuvBuffer[i + 3]; u0 = yuvBuffer[i] - 128; v0 = yuvBuffer[i + 2] - 128; break; case YUV_FMT_YUYV: y1 = yuvBuffer[i]; y2 = yuvBuffer[i + 2]; u0 = yuvBuffer[i + 1] - 128; v0 = yuvBuffer[i + 3] - 128; break; default: return YUV_FALSE; } /* convert */ yuvtoRgbByNormx2(&ru0, &gu0, &bu0, &ru1, &gu1, &bu1, y1, y2, u0, v0); /* store */ if (rgbBuffer != NULL) { (rgbBuffer0 + rgbPosition)[0] = bu0; (rgbBuffer1 + rgbPosition)[0] = gu0; (rgbBuffer2 + rgbPosition)[0] = ru0; (rgbBuffer0 + rgbPosition)[1] = bu1; (rgbBuffer1 + rgbPosition)[1] = gu1; (rgbBuffer2 + rgbPosition)[1] = ru1; rgbPosition += 2; } if (rgbBufferInterleaved != NULL) { rgbBufferInterleaved[rgbInterleavedPosition + 0] = bu0; rgbBufferInterleaved[rgbInterleavedPosition + 1] = gu0; rgbBufferInterleaved[rgbInterleavedPosition + 2] = ru0; rgbBufferInterleaved[rgbInterleavedPosition + 3] = bu1; rgbBufferInterleaved[rgbInterleavedPosition + 4] = gu1; rgbBufferInterleaved[rgbInterleavedPosition + 5] = ru1; rgbInterleavedPosition += 6; } } return YUV_TRUE; } |
Sample Code for the Neon Conversion Method
Code Block | ||
---|---|---|
| ||
/* uyuy */ static const uint8_t uyvy_y_indices[8] = {1, 3, 5, 7, 9, 11, 13, 15}; static const uint8_t uyvy_u_indices[8] = {0, 0, 4, 4, 8, 8, 12, 12}; static const uint8_t uyvy_v_indices[8] = {2, 2, 6, 6, 10, 10, 14, 14}; /* yuyv */ static const uint8_t yuyv_y_indices[8] = {0, 2, 4, 6, 8, 10, 12, 14}; static const uint8_t yuyv_u_indices[8] = {1, 1, 5, 5, 9, 9, 13, 13}; static const uint8_t yuyv_v_indices[8] = {3, 3, 7, 7, 11, 11, 15, 15}; YuvBool yuvToBgrByNeon( int yuvFormat, uint32_t width, uint32_t height, uint8_t* yuvBuffer, uint8_t* rgbBuffer, uint8_t* rgbBufferInterleaved) { uint8_t* rgbBuffer0 = rgbBuffer; uint8_t* rgbBuffer1 = rgbBuffer + width * height; uint8_t* rgbBuffer2 = rgbBuffer + width * height * 2; uint8x8_t y8, u8, v8; uint32_t rgbPosition = 0; uint32_t rgbInterleavedPosition = 0; uint32_t yuvFrameSize = width * height * 2; for (int i = 0; i < yuvFrameSize; i += 16) { /* load data */ uint8x16_t A = vld1q_u8(yuvBuffer + i); switch(yuvFormat) { case YUV_FMT_UYUV: y8 = vqtbl1_u8(A, vld1_u8(uyvy_y_indices)); u8 = vqtbl1_u8(A, vld1_u8(uyvy_u_indices)); v8 = vqtbl1_u8(A, vld1_u8(uyvy_v_indices)); break; case YUV_FMT_YUYV: y8 = vqtbl1_u8(A, vld1_u8(yuyv_y_indices)); u8 = vqtbl1_u8(A, vld1_u8(yuyv_u_indices)); v8 = vqtbl1_u8(A, vld1_u8(yuyv_v_indices)); break; default: return YUV_FALSE; } /* u8 to f16 */ float16x8_t y = vcvtq_f16_u16(vmovl_u8(y8)); float16x8_t u = vcvtq_f16_s16(vmovl_s8(vsub_s8(vreinterpret_s8_u8(u8), vdup_n_s8(128)))); float16x8_t v = vcvtq_f16_s16(vmovl_s8(vsub_s8(vreinterpret_s8_u8(v8), vdup_n_s8(128)))); /* calc yuyv to rgb */ float16x8_t r = vaddq_f16(y, vmulq_n_f16(v, 1.402)); float16x8_t g = vsubq_f16(vsubq_f16(y, vmulq_n_f16(v, 0.71414)), vmulq_n_f16(u, 0.34414)); float16x8_t b = vaddq_f16(y, vmulq_n_f16(u, 1.772)); /* limit rgb to 0-255 */ r = vmaxq_f16(vminq_f16(r, vdupq_n_f16(255.0f)), vdupq_n_f16(0.0f)); g = vmaxq_f16(vminq_f16(g, vdupq_n_f16(255.0f)), vdupq_n_f16(0.0f)); b = vmaxq_f16(vminq_f16(b, vdupq_n_f16(255.0f)), vdupq_n_f16(0.0f)); /* convert f16 to u8 */ uint8x8_t r8 = vmovn_u16(vcvtq_u16_f16(r)); uint8x8_t g8 = vmovn_u16(vcvtq_u16_f16(g)); uint8x8_t b8 = vmovn_u16(vcvtq_u16_f16(b)); /* store u8 to rgb buffer */ if (rgbBuffer != NULL) { vst1_u8(rgbBuffer0 + rgbPosition, b8); vst1_u8(rgbBuffer1 + rgbPosition, g8); vst1_u8(rgbBuffer2 + rgbPosition, r8); rgbPosition += 8; } if (rgbBufferInterleaved != NULL) { uint8x8x3_t rgbu8 = {b8, g8, r8}; vst3_u8(rgbBufferInterleaved + rgbInterleavedPosition, rgbu8); rgbInterleavedPosition += 24; } } return YUV_TRUE; } |
2. Matrix Multiplication
"Amxn X Bnxj = Cmj" refers to the multiplication of an m-by-n matrix A with an n-by-j matrix B, resulting in an m-by-j matrix C.
The optimization approach in this context differs from our customary programming practices, making it susceptible to negative optimization in the code.
The transition from matrix multiplication to vector multiplication
Neon does not natively support matrix multiplication.
As shown in the above picture, the first two columns of the output matrix have been calculated using the traditional approach. And he following patterns/rules are identified:
The i-th column of the output matrix is obtained by multiplying the x-th column of matrix A with the x-th element of the i-th column of matrix B, where 0 <= i < 4 and 0 <= x < 4.
Construct vectors in column units, a0(a11, a21, a31, a41)..., b0(b11, b21, b31, b41).... The calculation can be written as follows:
a0 x b0[0] + a1 x b0[1] + a2 x b0[2] + a3 x b0[3] ->The first column of the resulting matrix
a0 x b1[0] + a1 x b1[1] + a2 x b1[2] + a3 x b1[3] ->The second column of the resulting matrix
....
In summary, we can get the following conclusion:
If the matrix is stored in memory in a column-major order, it is suitable for SIMD (Single Instruction, Multiple Data) computation.
If the matrix is stored in row-first order, neon also provides instructions for scatter load/store, making it easier to convert to column-first vector
If the original data is not in column-major order
neon provides scatter load/store instructions that make it easier to convert the data into column-major vectors.
...
As you can see: each vector has been saved with the corresponding column data in column-major order.
The first version of multiplication
assuming A, B, and C are in column-major order, with C being the output matrix
...
The bx and cx vectors are used in each iteration, and to compute the n+1th column, we have to wait until the nth column is completely calculated. However, neon actually provides some channels that allow for parallel execution.
The second version of multiplication
Code Block | ||
---|---|---|
| ||
float32_t A[16] = {0f, 1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f, 9f, 10f, 11f, 12f, 13f, 14f, 15f}; float32_t B[16] = {0f, 1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f, 9f, 10f, 11f, 12f, 13f, 14f, 15f}; float32_t C[16] = {}; float32x4_t a0 = vld1q_f32(A + 0); float32x4_t a1 = vld1q_f32(A + 4); float32x4_t a2 = vld1q_f32(A + 8); float32x4_t a3 = vld1q_f32(A + 12); // Compute the xth column of matrix C and store it. float32x4_t bx, cx; for (int i = 0; i <= 12; i+=4) { bx = vld1q_f32(B + i); cx = vmulq_lane_f32(a0, vget_low_f32(bx), 0); cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1); cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0); cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1); vst1q_f32(C + i, cx); } |
...
Similar to the previous version, but with added loops, reduced code volume, and improved readability.
The third version of multiplication
Code Block | ||
---|---|---|
| ||
float32x4_t a0 = vld1q_f32(A + 0); float32x4_t a1 = vld1q_f32(A + 4); float32x4_t a2 = vld1q_f32(A + 8); float32x4_t a3 = vld1q_f32(A + 12); // Compute the xth column of matrix C and store it. for (int i = 0; i <= 12; i+=4) { float32x4_t bx = vld1q_f32(B + i); float32x4_t cx = vmulq_lane_f32(a0, vget_low_f32(bx), 0); cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1); cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0); cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1); vst1q_f32(C + i, cx); } |
...
It minimizes the dependence on external factors as much as possible.
If it cannot be avoided, it may reduce such dependence by splitting the loop into multiple iterations, such as using nested loops, so that the inner loop responsible for more tasks does not have any dependencies.
The fourth version of multiplication
Code Block | ||
---|---|---|
| ||
float32_t A[16] = {0f, 1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f, 9f, 10f, 11f, 12f, 13f, 14f, 15f}; float32_t B[16] = {0f, 1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f, 9f, 10f, 11f, 12f, 13f, 14f, 15f}; float32_t C[16] = {}; float32x4_t a0 = vld1q_f32(A + 0); float32x4_t a1 = vld1q_f32(A + 4); float32x4_t a2 = vld1q_f32(A + 8); float32x4_t a3 = vld1q_f32(A + 12); float32x4_t b1, b2, b3, b4; float32x4_t c1, c2, c3, c4; // Compute the first column of matrix C and store it b1 = vld1q_f32(B + 0); c1 = vmulq_lane_f32(a0, vget_low_f32(b1), 0); c1 += vmlaq_lane_f32(a0, a1, vget_low_f32(b1), 1); c1 += vmlaq_lane_f32(a0, a2, vget_high_f32(b1), 0); c1 += vmlaq_lane_f32(a0, a3, vget_high_f32(b1), 1); vst1q_f32(C + 0, c1); // Compute the second column of matrix C and store it b2 = vld1q_f32(B + 4); c2 = vmulq_lane_f32(a0, vget_low_f32(b2), 0); c2 += vmlaq_lane_f32(a0, a1, vget_low_f32(b2), 1); c2 += vmlaq_lane_f32(a0, a2, vget_high_f32(b2), 0); c2 += vmlaq_lane_f32(a0, a3, vget_high_f32(b2), 1); vst1q_f32(C + 4, c2); // Compute the third column of matrix C and store it b3 = vld1q_f32(B + 8); c3 = vmulq_lane_f32(a0, vget_low_f32(b3), 0); c3 += vmlaq_lane_f32(a0, a1, vget_low_f32(b3), 1); c3 += vmlaq_lane_f32(a0, a2, vget_high_f32(b3), 0); c3 += vmlaq_lane_f32(a0, a3, vget_high_f32(b3), 1); vst1q_f32(C + 8, c3); // Compute the fourth column of matrix C and store it b4 = vld1q_f32(B + 12); c4 = vmulq_lane_f32(a0, vget_low_f32(b4), 0); c4 += vmlaq_lane_f32(a0, a1, vget_low_f32(b4), 1); c4 += vmlaq_lane_f32(a0, a2, vget_high_f32(b4), 0); c4 += vmlaq_lane_f32(a0, a3, vget_high_f32(b4), 1); vst1q_f32(C + 12, c4); |
...
bx is divided into b1, b2, b3, b4.
cx is divided into c1, c2, c3, c4.
As a result, the calculation of each column is independent and can be performed separately. In neon, it is possible for multiple instructions to be scheduled for simultaneous execution, such as the following: (Why? We don't know the exact reasons, but according to the official documentation, although there is no concept of kernel threads like in cuda, neon itself may have similar functional computation channels that allow for concurrent execution of memory loads/stores and computations.)
r0 = vmlaq_lane_f32(r0, a1, vget_low_f32(b0), 1);
r1 = vmlaq_lane_f32(r1, a1, vget_low_f32(b1), 1);
r2 = vmlaq_lane_f32(r2, a1, vget_low_f32(b2), 1);
r3 = vmlaq_lane_f32(r3, a1, vget_low_f32(b3), 1);
The main instructions used in the code
vmulq_lan_f32(a0, a1, lane_idx)
Vector-scalar multiplication, where the scalar value comes from the lane_idx channel of a1.
Pseudocode:
result[0] = a0[0] x a1[lane_idx]
result[1] = a0[1] x a1[lane_idx]
result[2] = a0[2] x a1[lane_idx]
result[3] = a0[3] x a1[lane_idx]
vmlaq_lane_f32(a0, a1, a2, lane_idx)
Multiply-Add Instruction: Pseudocode
a0 + a1 x a2[lane_idx]
Note the data types. Of course, if there is an error, the compiler will prompt an error message.
This instruction can be replaced with
vaddq(a0, vmulq_lan_f32(a1, a2, lane_idx))
It has been changed into two instructions with clear dependency relations.
Therefore, the multiply-add instruction will achieve better performance.
vget_low_f32
Obtain the first two variables from the vector
a1, a2, a3, a4 --> a1, a2
vget_high_f32
Obtain the last two variables from the vector
a1, a2, a3, a4 --> a3, a4
Complete Sample Code
Code Block | ||
---|---|---|
| ||
void matrix_mul_version1(float32_t* A, float32_t* B, float32_t* C) { float32x4_t a0 = vld1q_f32(A + 0); float32x4_t a1 = vld1q_f32(A + 4); float32x4_t a2 = vld1q_f32(A + 8); float32x4_t a3 = vld1q_f32(A + 12); // Compute the first column of matrix C and store it float32x4_t bx = vld1q_f32(B + 0); float32x4_t cx = vmulq_lane_f32(a0, vget_low_f32(bx), 0); cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1); cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0); cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1); vst1q_f32(C + 0, cx); // Compute the second column of matrix C and store it bx = vld1q_f32(B + 4); cx = vmulq_lane_f32(a0, vget_low_f32(bx), 0); cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1); cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0); cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1); vst1q_f32(C + 4, cx); // Compute the third column of matrix C and store it bx = vld1q_f32(B + 8); cx = vmulq_lane_f32(a0, vget_low_f32(bx), 0); cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1); cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0); cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1); vst1q_f32(C + 8, cx); // Compute the fourth column of matrix C and store it bx = vld1q_f32(B + 12); cx = vmulq_lane_f32(a0, vget_low_f32(bx), 0); cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1); cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0); cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1); vst1q_f32(C + 12, cx); } void matrix_mul_version2(float32_t* A, float32_t* B, float32_t* C) { float32x4_t a0 = vld1q_f32(A + 0); float32x4_t a1 = vld1q_f32(A + 4); float32x4_t a2 = vld1q_f32(A + 8); float32x4_t a3 = vld1q_f32(A + 12); // Compute the x column of matrix C and store it float32x4_t bx, cx; for (int i = 0; i <= 12; i+=4) { bx = vld1q_f32(B + i); cx = vmulq_lane_f32(a0, vget_low_f32(bx), 0); cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1); cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0); cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1); vst1q_f32(C + i, cx); } } void matrix_mul_version3(float32_t* A, float32_t* B, float32_t* C) { float32x4_t a0 = vld1q_f32(A + 0); float32x4_t a1 = vld1q_f32(A + 4); float32x4_t a2 = vld1q_f32(A + 8); float32x4_t a3 = vld1q_f32(A + 12); // Compute the x column of matrix C and store it for (int i = 0; i <= 12; i+=4) { float32x4_t bx = vld1q_f32(B + i); float32x4_t cx = vmulq_lane_f32(a0, vget_low_f32(bx), 0); cx += vmlaq_lane_f32(a0, a1, vget_low_f32(bx), 1); cx += vmlaq_lane_f32(a0, a2, vget_high_f32(bx), 0); cx += vmlaq_lane_f32(a0, a3, vget_high_f32(bx), 1); vst1q_f32(C + i, cx); } } void matrix_mul_version4(float32_t* A, float32_t* B, float32_t* C) { float32x4_t a0 = vld1q_f32(A + 0); float32x4_t a1 = vld1q_f32(A + 4); float32x4_t a2 = vld1q_f32(A + 8); float32x4_t a3 = vld1q_f32(A + 12); float32x4_t b1, b2, b3, b4; float32x4_t c1, c2, c3, c4; // Compute the first column of matrix C and store it b1 = vld1q_f32(B + 0); c1 = vmulq_lane_f32(a0, vget_low_f32(b1), 0); c1 += vmlaq_lane_f32(a0, a1, vget_low_f32(b1), 1); c1 += vmlaq_lane_f32(a0, a2, vget_high_f32(b1), 0); c1 += vmlaq_lane_f32(a0, a3, vget_high_f32(b1), 1); vst1q_f32(C + 0, c1); // Compute the second column of matrix C and store it b2 = vld1q_f32(B + 4); c2 = vmulq_lane_f32(a0, vget_low_f32(b2), 0); c2 += vmlaq_lane_f32(a0, a1, vget_low_f32(b2), 1); c2 += vmlaq_lane_f32(a0, a2, vget_high_f32(b2), 0); c2 += vmlaq_lane_f32(a0, a3, vget_high_f32(b2), 1); vst1q_f32(C + 4, c2); // Compute the third column of matrix C and store it b3 = vld1q_f32(B + 8); c3 = vmulq_lane_f32(a0, vget_low_f32(b3), 0); c3 += vmlaq_lane_f32(a0, a1, vget_low_f32(b3), 1); c3 += vmlaq_lane_f32(a0, a2, vget_high_f32(b3), 0); c3 += vmlaq_lane_f32(a0, a3, vget_high_f32(b3), 1); vst1q_f32(C + 8, c3); // Compute the fourth column of matrix C and store it b4 = vld1q_f32(B + 12); c4 = vmulq_lane_f32(a0, vget_low_f32(b4), 0); c4 += vmlaq_lane_f32(a0, a1, vget_low_f32(b4), 1); c4 += vmlaq_lane_f32(a0, a2, vget_high_f32(b4), 0); c4 += vmlaq_lane_f32(a0, a3, vget_high_f32(b4), 1); vst1q_f32(C + 12, c4); } static void print_matrix_memory(const char* tag, float32_t* matrix) { printf("output matrix content for %s: \n", tag); printf("%.4f %.4f %.4f %.4f\n", matrix[0], matrix[1], matrix[2], matrix[3]); printf("%.4f %.4f %.4f %.4f\n", matrix[4], matrix[5], matrix[6], matrix[7]); printf("%.4f %.4f %.4f %.4f\n", matrix[8], matrix[9], matrix[10], matrix[11]); printf("%.4f %.4f %.4f %.4f\n\n", matrix[12], matrix[13], matrix[14], matrix[15]); } int main(int argc, const char * argv[]) { float32_t A[16] = {0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f, 9.0f, 10.0f, 11.0f, 12.0f, 13.0f, 14.0f, 15.0f}; float32_t B[16] = {0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f, 9.0f, 10.0f, 11.0f, 12.0f, 13.0f, 14.0f, 15.0f}; float32_t C[16] = {}; matrix_mul_version1(A, B, C); print_matrix_memory("verision1C", C); matrix_mul_version2(A, B, C); print_matrix_memory("verision2C", C); matrix_mul_version3(A, B, C); print_matrix_memory("verision3C", C); matrix_mul_version4(A, B, C); print_matrix_memory("verision4C", C); return 0; } |