资源与支持

SiFive 博客

来自 RISC-V 专家的最新洞察与深度技术解析

December 09, 2024

SiFive Accelerates RISC-V Vector Integration in XNNPACK for Optimized AI Inference

In this blog, we’ll begin by introducing XNNPACK and exploring its current status on RISC-V platforms. Next, to bridge the technical gap for contributors, we will provide a step-by-step guide on integrating RVV-optimized microkernels into XNNPACK, using F32-GEMM (single-precision floating-point general matrix multiplication) as a practical example. Finally, we will highlight the performance improvements achieved through these optimizations.

XNNPACK and status of RVV backend

XNNPACK is a crucial library-based solution for neural network inference on ARM, x86, and RISC-V platforms. It serves as a low-level acceleration backend for machine learning frameworks, such as TensorFlow Lite, PyTorch, ONNX Runtime, and MediaPipe. XNNPACK enhances performance by decomposing operations into microkernels and applying target-specific optimizations for each architecture.

Historically, XNNPACK offered limited support for RISC-V Vector (RVV) extension, providing only a small number of RVV-optimized microkernels. As a result, RISC-V users often had to rely on generic C implementations. To utilize vector hardware, users can only rely on auto-vectorizers for parallelization or translation headers or tools to adapt intrinsics from other platforms.

To enhance AI inference performance on RISC-V, SiFive has contributed several RVV-optimized floating-point microkernels to XNNPACK. They are listed in Table 1. These contributions significantly improve performance, making RISC-V a more viable platform for neural network inference. We welcome everyone to join us in XNNPACK RVV backend contributions.

Microkernel NameDescription
F32-GEMMFloat 32-bit general matrix multiplication microkernel
F32-IGEMMFloat 32-bit indirect general matrix multiplication microkernel which is for Convolution
X32-PackWFloat or integer 32-bit weight packing
F32-rminFloat 32-bit reduced min
F32-rmaxFloat 32-bit reduced max
F32-rminmaxFloat 32-bit reduced min and max
F32-vaddFloat 32-bit binary addition
F32-vsubFloat 32-bit binary subtract
F32-vrsubFloat 32-bit binary reverse-subtract
F32-vmulFloat 32-bit binary multiplication
F32-vdivFloat 32-bit binary divide
F32-vrdivFloat 32-bit binary reverse-divide
F32-vmaxFloat 32-bit binary max
F32-vminFloat 32-bit binary min
F32-vsqrdiffFloat 32-bit binary squared-of-difference
X32_transposeFloat or integer 32-bit transpose
F32_raddstoreexpminusmaxFloat 32-bit exponent for softmax operation
Table 1. List of RVV-optimized microkernels contributed by SiFive.

How to contribute the RVV microkernel to XNNPACK?

XNNPACK offers hundreds of microkernels to support neural network inference, with many neural network operations relying on one or more of these microkernels. For example, a fully_connected operator uses both weight-packing and GEMM (General Matrix Multiplication) microkernels. XNNPACK allows developers to implement target-specific microkernels, such as RVV (RISC-V Vector) versions of GEMM. If RVV and matrix extensions are available, alternative versions of GEMM microkernels can be developed.

Steps to contribute RVV microkernels:

  1. Identify the Operation and Relevant Microkernels
    Start by determining which neural network operation and data type to optimize. Then, identify the relevant microkernels for that operation.
  2. Analyze the Microkernel Requirements
    Understand the data layout assumptions and structure of the microkernel. Reviewing the scalar version can reveal important details. Note that certain data layouts are adjustable using microkernel parameters.
  3. Plan Vectorization Strategy and Select microkernel parameters
    Develop a vectorization strategy and decide the suitable microkernel parameters to maximize performance.
  4. Implement the Target-Specific Microkernel Template
    Rather than directly implementing the microkernel in C/C++, XNNPACK designs to create a target-specific template to increase flexibility. For example, design a template that supports different LMUL settings or register-tiling configurations. Once the template is complete, use a Python-based generator to produce C/C++ microkernels.
  5. Test and Benchmark the Microkernels
    Validate each generated microkernel with unit tests and benchmark to find the most efficient version.
  6. Enable the Optimal Microkernel in Configuration
    After identifying the best-performing microkernel, enable it in the configuration to ensure the operation uses the optimal microkernel at runtime.

Example: XNNPACK F32-GEMM optimization with RVV.

In this section, let’s take a floating-point fully_connected operation as an example. It is widely used in the Large language model and also the performance bottleneck.
We follow the steps in the last section:

  1. Identify the Operation and Relevant Microkernels

    Figure 1 The flow chart of floating-point fully_connected operation.

    Based on Figure 1, there are two kinds of microkernels used in floating-point fully_connected operation. The first one is F32-GEMM. The other one is X32-PACKW, which stands for 32-bit weight packing. To simplify the article, we’ll focus on F32-GEMM in this blog.

  2. Analyze the Microkernel Requirements
    The GEMM microkernel has two input data layouts that must be understood for optimal implementation. Additionally, there are four key microkernel parameters—mr, nr, kr, and sr—that define its operation.

  • mr and nr determine the maximum output size for each iteration through the k dimension.

  • The Left-Hand Side (LHS) input, A', has a data layout of [mr, k]. This means the F32-GEMM microkernel processes only a partial segment of A, with the m dimension of the input always being less than or equal to mr.

  • The Right-Hand Side (RHS) input, packed_weight, has a layout influenced by nr, kr, and sr. Each packed_weight tile begins with nr bias values, followed by [k, nr] weight data, with its layout determined by kr and sr.

  • Figure 2 illustrates the data layout of packed_weight when nr=8 and kr=sr=1. In this configuration, there are round_up(n, nr) packed_weight tiles. Each tile begins with nr bias values, followed by [k, nr] weights arranged in a row-major layout.

    Figure 2 The data layout of packed_weight when nr=8 and kr=sr=1 consists of round_up(n, nr) tiles.

  1. Plan Vectorization Strategy and Select microkernel parameters
    Here, we aim to implement the outer product GEMM, setting kr = sr = 1. Our vectorization strategy focuses on vectorization along the n dimension of the packed weight and output matrices. We use mr vector register groups as accumulators, along with one vector register group to load the packed weights. The application’s vector length(avl) is determined by nr, which is calculated based on the vector length (VLEN) and vector length multiplier (LMUL) settings:
nr = VLEN * LMUL / 32.

The values of `mr` and `LMUL` are defined by the input arguments of the template.

Let’s explain the algorithm with an example: assume mr = 7, nr = 64 (with LMUL = 4, VLEN = 512), and k = 3. After iterating through all values of K, we aim to produce mr x nr results.

Step 1 (Figure 3):

The first step is to load the nr bias values into the vector register group. These bias values are then duplicated across the mr accumulator vector register groups.

Figure 3 RVV F32-GEMM outer product algorithm(part1).

Step 2 (Figure 4):

In each iteration over K, we load the mr scalar values into scalar registers (a0 to a6). Then, we perform a vector load of the weight (v_w_k), followed by several vector floating-point multiplication-and-accumulation(vfmacc) instructions to compute the partial sum of the output.

Figure 4 RVV F32-GEMM outer product algorithm(part2).

Step 3 (Figure 5):

After iterating through all values of K, we obtain output results with dimensions [mr, nr]. These results are stored in mr vector register groups, with each register containing nr elements. Finally, the results can be written to the output.

Figure 5 RVV F32-GEMM outer product algorithm(part3).

Steps 1 to 3 describe the procedure for calculating [mr, nr] results. This process needs to be repeated for all the tiles in packed_weight to obtain the entire output [mr, n].

Note: If input N cannot be evenly divided by nr, we need to modify the avl from nr to N mod nr when tackling the last output tile.

  1. Implement the Target-Specific Microkernel Template
    We developed an RVV f32-GEMM template that uses mr and LMUL as configurable parameters.
  2. Test and Benchmark the Microkernels
    By generating code with various combinations of mr and LMUL, we could comprehensively test and benchmark performance across multiple configurations.
  3. Enable the Optimal Microkernel in Configuration
    Based on benchmark results, we chose to enable an implementation with mr=7 and LMUL=4, which maximizes vector register utilization and delivers optimal performance, in gemm-config. However, gemm-config is relatively complicated.If we want to enable RVV F32-gemm in configure. We also need to implement f32-igemm(indirect gemm) and corresponding x32-packw(weight packing) microkernels.

Performance Results

This section presents two types of experiments. The first involves microkernel-level benchmarks using the XNNPACK benchmark framework, while the second focuses on end-to-end model-level benchmarks through TensorFlow Lite. Both sets of experiments are conducted on the SiFive Intelligence X390 which supports RISC-V Vector and vector length (VLEN) is 1024. Three settings are evaluated:

  • Setting 1: Pure scalar microkernels without auto-vectorization
  • Setting 2: Pure scalar microkernels with auto-vectorization
  • Setting 3: RVV intrinsic microkernels

OP-level


Figure 6. XNNPACK microkernel benchmark: performance speedup under different configurations on the SiFive Intelligence X390.

From the results in Figure 6, we observe that auto-vectorization provides only limited performance improvement. The primary reason is that most scalar code performs inner loop unrolling with a small size to emulate SIMD behavior commonly seen on x86 or ARM platforms. For example, in one of the reduced max microkernel calls, xnn_f32_rmax_ukernel__scalar_u4_acc4, the inner loop performs 4 binary max in parallel. This results in the auto-vectorizer generating code with a very small application vector length (AVL) of 4, which underutilizes the vector unit if vector length is large. To enhance auto-vectorization performance, rewriting the scalar source code is necessary. In contrast, the handwritten RVV-optimized code demonstrates better vector utilization and achieves a significant speedup compared to other approaches.

End-to-end benchmark using TFLite


Figure 7. Performance speedup of using TFLite with XNNPACK across different ML models.

Figure 7 demonstrates performance speedups. It’s consistent with those observed in the XNNPACK benchmark. Specifically, transitioning from scalar to RVV achieves a speedup of over 45x, aligning with expected outcomes. Because the tested models are dominant to either F32-IGEMM or F32-GEMM. Both in the microkernel benchmark show a speedup of approximately 40x. Based on the performance results, we can conclude that optimizing XNNPACK RVV backend is a promising path.

Conclusion and Future Work

SiFive has provided optimizations for most of the critical single precision floating-point microkernels. The performance results are promising. However, XNNPACK isn't just for single precision floating-point neural network inference. There are still lots of microkernels that need to be optimized, such as the int8 version of GEMM, IGEMM and so on. We hope that RISC-V developers can join us in XNNPACK RVV backend contributions.

Reference

XNNPACK
https://github.com/google/XNNPACK
XNNPACK F32-GEMM pull request 1
https://github.com/google/XNNPACK/pull/5893
XNNPACK F32-GEMM pull request 2
https://github.com/google/XNNPACK/pull/6411
XNNPACK F32-GEMM pull request 3
https://github.com/google/XNNPACK/pull/7035
RISC-V Vector Extension specification
https://github.com/riscvarchive/riscv-v-spec/blob/master/v-spec.adoc#vector-byte-length-vlenb

Read more Insights from the RISC-V Experts

SiFive Performance™ P570 Gen 3 深度解析:面向下一代消费级与商用应用的高性能能效设计
最新文章
SiFive Performance™ P570 Gen 3 深度解析:面向下一代消费级与商用应用的高性能能效设计
SiFive 的核心是 RISC-V,这是 SiFive 创始人在公司成立 5 年前发明的指令集架构 (ISA)。SiFive 正持续演进基于 RISC-V 的 IP 基础模块,重新定义并推动各类计算平台的普及化发展。在技术领域,演进并非一串随机变化的时间线,而是一系列精心规划、环环相扣的里程碑。每一步演进都会创造一系列新的环境条件,从而推动下一次更复杂的跨越成为必然。要赢得这场竞赛,关键在于具备适应变化的灵活性与持续创新能力,而这两点正是 SiFive 与 RISC-V 的核心价值观所在。
P570 Gen 3:系统视角
最新文章
P570 Gen 3:系统视角
然而,CPU 的需求横跨性能、功耗和成本等多个维度。在某些细分市场中,需要在不同的功耗与成本约束下实现性能提升。基于这类 CPU 的系统需要可信赖的产品路线图,才能切实交付新的系统能力。尽管部分供应商已退出“低端市场”,SiFive 仍坚持在整条性能曲线上持续创新。本次发布的 P570 Gen 3 Performance IP,旨在为中低端、具备 Linux 能力的系统提供显著的性价比与能效比提升。
全力投入:开启增长新篇章
最新文章
全力投入:开启增长新篇章
我们自信地宣布公司发展历程中最重要的里程碑之一:完成 4 亿美元 的融资。本轮融资由 Atreides Management 领投,其他顶级投资机构\*包括 Apollo Global Management、NVIDIA(英伟达)、Point72 Turion 和 T. Rowe Price Investment Management, Inc.,以及现有投资者 Prosperity7 Ventures 和 Sutter Hill Ventures 参投。此次融资使公司估值达到 36.5 亿美元,并将加速 SiFive 的 RISC-V CPU 及 AI IP 解决方案推向数据中心和 AI 基础设施市场的核心地带。