NN Models' Performance with NPU

The NPU operates at clock rates of 900 MHz, delivering computing performance of up to 4.1 TOPS (Trillion Operations Per Second). Optimized for AI models based on convolutional neural networks, it includes a Parallel Processing Unit (PPU) with 32-bit floating-point pipelining and threading.

Here are 20 typical models based on NPU on C3V platform, and some basic performance data have been run for your reference.

Here is our test environment:

C3V CPU uses Quad-CA55@1.5GHz with 4GB DRAM.
C3V NPU uses VIP9000@900MHz with 128MB reserved memory.
NN Tools: ACUITY v6.30.x
NPU Kernel driver: v6.4.18.5
NN Model's quantize type: int8

Here is our test results:

Model	Model Size	Input shape [n, c, h, w]	Total (DDR) read BW	Total (DDR) write BW	Average inference time	Frame rate without other latency
AlexNet (.onnx)	233 MB	[1, 3, 224, 224]	47.03 (MBytes)	1.31 (MBytes)	8.41ms	118.91 (fps)
Inception-v1 (.onnx)	27 MB	[1, 3, 224, 224]	16.7 (MBytes)	5.24 (MBytes)	3.97ms	251.89 (fps)
Inception-v2 (.onnx)	43 MB	[1, 3, 224, 224]	14.47 (MBytes)	1.84 (MBytes)	7.68ms	130.21 (fps)
MobileNet-v2 (.onnx)	14 MB	[1, 3, 224, 224]	5.25 (MBytes)	1.24 (MBytes)	1.94ms	515.46 (fps)
EfficientNet-Lite4 (.onnx)	50 MB	[1, 3, 224, 224]	15.69 (MBytes)	4.68 (MBytes)	5.00ms	200.00 (fps)
ResNet-50 (.onnx)	98 MB	[1, 3, 224, 224]	39.61 (MBytes)	13.28 (MBytes)	16.29ms	61.39 (fps)
SqueezeNet (.onnx)	4.8 MB	[1, 3, 224, 224]	2.33 (MBytes)	0.37 (MBytes)	1.29ms	775.19 (fps)
VGG-16 (.onnx)	528 MB	[1, 3, 224, 224]	121.06 (MBytes)	6.97 (MBytes)	22.26ms	44.92 (fps)
DenseNet-121 (.onnx)	32 MB	[1, 3, 224, 224]	26.55 (MBytes)	8.86 (MBytes)	21.12ms	47.35 (fps)
GoogleNet (.onnx)	27 MB	[1, 3, 224, 224]	15.02 (MBytes)	4.89 (MBytes)	3.64ms	274.73 (fps)
CaffeNet (.onnx)	233 MB	[1, 3, 224, 224]	46.13 (MBytes)	0.37 (MBytes)	7.09ms	141.04 (fps)
ShuffleNet-v2 (.onnx)	8.8 MB	[1, 3, 224, 224]	4.14 (MBytes)	1.93 (MBytes)	2.09ms	478.47 (fps)
SSD-MobilenetV1 (.tflite)	26.2 MB	[1, 320, 320, 3]	11.34 (MBytes)	5.21 (MBytes)	5.97ms	167.50 (fps)
SSD-MobilenetV2 (.tflite)	17.1 MB	[1, 320, 320, 3]	12.21 (MBytes)	6.04 (MBytes)	5.17ms	193.42 (fps)
YOLO-v2 (.onnx)	203.9 MB	[1, 3, 416, 416]	47.16 (MBytes)	6.70 (MBytes)	11.50ms	86.96 (fps)
YOLO-v5s (.onnx)	27.9 MB	[1, 3, 640, 640]	87.91 (MBytes)	46.65 (MBytes)	43.64ms	22.91 (fps)
YOLO-v5s-seg (.onnx)	29.4 MB	[1, 3, 640, 640]	130.79 (MBytes)	78.22 (MBytes)	58.46ms	17.11 (fps)
YOLO-v8s-seg (.onnx)	45 MB	[1, 3, 640, 640]	163.19 (MBytes)	101.29 (MBytes)	64.45ms	15.52 (fps)
ArcFace (.onnx)	248.9 MB	[1, 3, 112, 112]	46.19 (MBytes)	5.32 (MBytes)	17.37ms	57.57 (fps)
DeepLab-v3p (.onnx)	22.1 MB	[1, 3, 640, 640]	385.65 (MBytes)	129.15 (MBytes)	107.76ms	9.28 (fps)
3DDFA (.onnx)	12.4 MB	[1, 3, 120, 120]	2.03 (MBytes)	0.35 (MBytes)	0.55ms	1818.18 (fps)
YOLO-v10n (.onnx)	9.39 MB	[1, 3, 640, 640]	3204.12 (MBytes)	3186.14 (MBytes)	6477.36ms	0.15 (fps)
YOLO-v10s (.onnx)	29.2 MB	[1, 3, 640, 640]	3258.21 (MBytes)	3219.47 (MBytes)	6513.48ms	0.15 (fps)
YOLO-v10n - postprocess	9.39 MB - postprocess	[1, 3, 640, 640]	46.92 (MBytes)	33.88 (MBytes)	36.31ms	27.54 (fps)
YOLO-v10s - postprocess	29.2 MB - postprocess	[1, 3, 640, 640]	102.23 (MBytes)	68.58 (MBytes)	68.81ms	14.53 (fps)

Note:

“xxx - postprocess” means removing post-processing (--outputs set to '/model.23/Transpose_output_0').
If you want to refer to more detailed performance data about YOLOV8, please refer here.