NN Models' Performance with NPU

The NPU operates at clock rates of 900 MHz, delivering computing performance of up to 4.1 TOPS (Trillion Operations Per Second). Optimized for AI models based on convolutional neural networks, it includes a Parallel Processing Unit (PPU) with 32-bit floating-point pipelining and threading.

Here are 20 typical models based on NPU on C3V platform, and some basic performance data have been run for your reference.

Here is our test environment:

  • C3V CPU uses Quad-CA55@1.5GHz with 4GB DRAM.

  • C3V NPU uses VIP9000@900MHz with 128MB reserved memory.

  • NN Tools: ACUITY v6.30.x

  • NPU Kernel driver: v6.4.18.5

  • NN Model's quantize type: int8

Here are the test results:

Model

Model Size

Input shape [n, c, h, w]

Total (DDR) read BW

Total (DDR) write BW

Average inference time

Frame rate without other latency

AlexNet (.onnx)

233 MB

[1, 3, 224, 224]

47.03 (MBytes)

1.31 (MBytes)

8.41ms

118.91 (fps)

Inception-v1 (.onnx)

27 MB

[1, 3, 224, 224]

16.7 (MBytes)

5.24 (MBytes)

3.97ms

251.89 (fps)

Inception-v2 (.onnx)

43 MB

[1, 3, 224, 224]

14.47 (MBytes)

1.84 (MBytes)

7.68ms

130.21 (fps)

MobileNet-v2 (.onnx)

14 MB

[1, 3, 224, 224]

5.25 (MBytes)

1.24 (MBytes)

1.94ms

515.46 (fps)

EfficientNet-Lite4 (.onnx)

50 MB

[1, 3, 224, 224]

15.69 (MBytes)

4.68 (MBytes)

5.00ms

200.00 (fps)

ResNet-50 (.onnx)

98 MB

[1, 3, 224, 224]

39.61 (MBytes)

13.28 (MBytes)

16.29ms

61.39 (fps)

SqueezeNet (.onnx)

4.8 MB

[1, 3, 224, 224]

2.33 (MBytes)

0.37 (MBytes)

1.29ms

775.19 (fps)

VGG-16 (.onnx)

528 MB

[1, 3, 224, 224]

121.06 (MBytes)

6.97 (MBytes)

22.26ms

44.92 (fps)

DenseNet-121 (.onnx)

32 MB

[1, 3, 224, 224]

26.55 (MBytes)

8.86 (MBytes)

21.12ms

47.35 (fps)

GoogleNet (.onnx)

27 MB

[1, 3, 224, 224]

15.02 (MBytes)

4.89 (MBytes)

3.64ms

274.73 (fps)

CaffeNet (.onnx)

233 MB

[1, 3, 224, 224]

46.13 (MBytes)

0.37 (MBytes)

7.09ms

141.04 (fps)

ShuffleNet-v2 (.onnx)

8.8 MB

[1, 3, 224, 224]

4.14 (MBytes)

1.93 (MBytes)

2.09ms

478.47 (fps)

SSD-MobilenetV1 (.tflite)

26.2 MB

[1, 320, 320, 3]

11.34 (MBytes)

5.21 (MBytes)

5.97ms

167.50 (fps)

SSD-MobilenetV2 (.tflite)

17.1 MB

[1, 320, 320, 3]

12.21 (MBytes)

6.04 (MBytes)

5.17ms

193.42 (fps)

YOLO-v2 (.onnx)

203.9 MB

[1, 3, 416, 416]

47.16 (MBytes)

6.70 (MBytes)

11.50ms

86.96 (fps)

YOLO-v5s (.onnx)

27.9 MB

[1, 3, 640, 640]

87.91 (MBytes)

46.65 (MBytes)

43.64ms

22.91 (fps)

YOLO-v5s-seg (.onnx)

29.4 MB

[1, 3, 640, 640]

130.79 (MBytes)

78.22 (MBytes)

58.46ms

17.11 (fps)

YOLO-v8s-seg (.onnx)

45 MB

[1, 3, 640, 640]

163.19 (MBytes)

101.29 (MBytes)

64.45ms

15.52 (fps)

ArcFace (.onnx)

248.9 MB

[1, 3, 112, 112]

46.19 (MBytes)

5.32 (MBytes)

17.37ms

57.57 (fps)

DeepLab-v3p (.onnx)

22.1 MB

[1, 3, 640, 640]

385.65 (MBytes)

129.15 (MBytes)

107.76ms

9.28 (fps)

3DDFA (.onnx)

12.4 MB

[1, 3, 120, 120]

2.03 (MBytes)

0.35 (MBytes)

0.55ms

1818.18 (fps)

YOLO-v10n (.onnx)

9.39 MB

[1, 3, 640, 640]

3204.12 (MBytes)

3186.14 (MBytes)

6477.36ms

0.15 (fps)

YOLO-v10s (.onnx)

29.2 MB

[1, 3, 640, 640]

3258.21 (MBytes)

3219.47 (MBytes)

6513.48ms

0.15 (fps)

YOLO-v10n - postprocess

9.39 MB - postprocess

[1, 3, 640, 640]

46.92 (MBytes)

33.88 (MBytes)

36.31ms

27.54 (fps)

YOLO-v10s - postprocess

29.2 MB - postprocess

[1, 3, 640, 640]

102.23 (MBytes)

68.58 (MBytes)

68.81ms

14.53 (fps)

Note:

  1. “xxx - postprocess” means removing post-processing (--outputs set to '/model.23/Transpose_output_0').

  2. If you want to refer to more detailed performance data about YOLOV8, please refer here.