Microsoft on Friday released version 1.14 of their ONNX Runtime, a cross-platform, high performance machine learning inferencing and training accelerator.
With ONNX Runtime 1.14 it adds support for ONNX 1.13, which is the standard set for machine learning interoperability. ONNX 1.13 released in December with Python 3.11 support, Apple Silicon M1/M2 processor support, new operators, extensions to existing operators, and other enhancements.
ONNX Runtime 1.14 also has improvements around threading and makes its ORT thread-pool now NUMA aware, a refactoring of the multi-stream execution provider, NVIDIA CUDA EP performance improvements, and performance improvements to various operators.
On the performance front one of the most exciting changes is support for quantization with Advanced Matrix Extensions (AMX) for new 4th Gen Intel Xeon Scalable “Sapphire Rapids” processors. ONNX Runtime can already be built with Intel’s oneDNN library that supports AMX (using the –use_dnnl build argument for the ONNX Runtime) while with ONNX Runtime 1.14 there is support for making use of AMX instructions directly as part of the quantization code. This commit shows the big AMX speed-ups in the quantized GEMM code compared to just using AVX-512 VNNI.
AMX instructions accelerate quantized GEMM significantly:
Prepacked B perf numbers (latency in ns)
GEMM Config | AVX512Vnni | AMX
— | –: | –:
M:384/N:1024/K:1024/Batch:1/Threads:4 | 1057511 | 285393
M:384/N:1024/K:3072/Batch:1/Threads:4 | 2643929 | 700397
M:384/N:1024/K:4096/Batch:1/Threads:4 | 3784750 | 890701
M:384/N:4096/K:1024/Batch:1/Threads:4 | 2378139 | 887251
M:384/N:1024/K:1024/Batch:1/Threads:16 | 307137 | 138481
M:384/N:1024/K:3072/Batch:1/Threads:16 | 855730 | 295027
M:384/N:1024/K:4096/Batch:1/Threads:16 | 1126878 | 317395
M:384/N:4096/K:1024/Batch:1/Threads:16 | 781963 | 237014
M:1536/N:1024/K:1024/Batch:1/Threads:16 | 538864 | 181459
M:1536/N:1024/K:3072/Batch:1/Threads:16 | 1681002 | 561600
M:1536/N:1024/K:4096/Batch:1/Threads:16 | 2158127 | 717470
M:1536/N:4096/K:1024/Batch:1/Threads:16 | 2428622 | 896140
M:3072/N:1024/K:1024/Batch:1/Threads:16 | 1058029 | 357031
M:3072/N:1024/K:3072/Batch:1/Threads:16 | 3138504 | 1095857
M:3072/N:1024/K:4096/Batch:1/Threads:16 | 4155640 | 1386183
M:3072/N:4096/K:1024/Batch:1/Threads:16 | 4679030 | 1778624
In case you missed my Sapphire Rapids AMX benchmark article a few weeks ago, see Intel Advanced Matrix Extensions [AMX] Performance With Xeon Scalable Sapphire Rapids. I’ll be working on some ONNX Runtime 1.14 benchmarks soon.
ONNX Runtime 1.14 also adds AMD ROCm 5.4 support, OpenVINO 2022.3 support, DirectML 1.10.1 support, and a variety of other changes.
More details on Microsoft’s ONNX Runtime 1.14 open-source release via GitHub.