资源与支持

SiFive 博客

来自 RISC-V 专家的最新洞察与深度技术解析

December 11, 2017

All Aboard, Part 9: Paging and the MMU in the RISC-V Linux Kernel

This entry will cover the RISC-V port of Linux's memory management subsystem. Since the vast majority of the memory management code in Linux is architecture-independent, the vast majority of our memory management code handles interfacing with our MMU, defining our page table format, and interfacing with drivers that have memory allocation constraints.

I will refrain from discussing the RISC-V memory model in this blog, both because it isn't yet finished and because it's complicated enough to warrant its own series of blog posts.

Also, as a side note: for those of you not following the internet, we've gotten our core architecture port into Linus' tree and are slated to release as part of 4.15. This won't be a fully bootable RISC-V system, but it's at least a good first step!

Privilege Levels in RISC-V Systems

The RISC-V ISA defines a stack of execution environments. While each environment is designed to be classically virtualizable, in standard systems each level in the stack is designed to provide the next level's execution environment. Starting from the least-privileged level in the stack, the execution environments are:

  • User-mode software executes in an AEE (Application Execution Environment). On Linux systems, the AEE is also known as the user ABI: the set of system calls supported by the kernel. The AEE also includes the entire user ISA, since user-mode programs are expected to be able to execute instructions other than just scall.
  • Supervisor-mode software executes in an SEE (Supervisor Execution Environment). This environment consists of the supervisor-mode instructions and CSRs defined by the privileged ISA document, along with the SBI. Supervisor-mode software is expected to provide multiple AEEs, Linux provides one AEE to each process.
  • Hypervisor-mode software executes in an HEE (Hypervisor Execution Environment), and is expected to provide multiple SEEs. The hypervisor-mode section of the privileged ISA document is still being written, so we'll ignore this for now.
  • Machine-mode software executes in an MEE (Machine Execution Environment), and is expected to provide one higher-level execution context. Since the privileged mode ISA makes implementing the U, S, and H extensions optional (thus allowing for M, M+U, M+S+U, and M+H+S+U systems), it's expected that different machine-mode software implementation will provide either an HEE, SEE, or AEE.

While it's fairly standard to provide execution environment stacks that do not match this hierarchy, the software executing in each environment can't tell the difference. That's not to say that user-mode software is entirely portable: for example, the Linux AEE is different than the FreeBSD AEE because they provide different system calls. The intent is simply that programs written to execute in the Linux AEE can't tell if they're executing on Linux on hardware, in Linux running in Spike, or in QEMU's user-mode emulation. None of this is a new concept, it's just a bit more explicitly stated in the RISC-V ISA specification than it is on many other architectures.

Since RISC-V is classically virtualizable at every level of the privilege stack, no explicit hardware support is necessary to provide any execution environment: for example, QEMU's user-mode emulation provides an AEE on systems that have no hardware support for any RISC-V ISA. While these systems can be made reasonably performant, the main purpose of RISC-V is to enable hardware implementations of the ISA -- for example, even though a Xeon running QEMU will probably be the fastest implementation of the RISC-V Linux AEE for the foreseeable future it would be more appropriate to run a hardware implementation of RISC-V on my wristwatch.

The RISC-V ISA documents are designed to allow the software implementations at various levels of the privilege stack to use the execution environment they're written against in order to provide the execution environment above them. Some of these are so obvious you probably haven't noticed that we've been talking about them for a while: for example, it's assumed that the hardware handles executing an addi instruction in userspace without the kernel's intervention because it would be silly not to.

We've been able to put off the discussion of privileged levels until now because the vast majority of the design is obvious: all of the user instructions are handled by the hardware without supervisor intervention except for scall, which just transfers control to the kernel's single trap entry point -- see my previous blog post All Aboard, Part 7: Entering and Exiting the Linux Kernel on RISC-V for details. Since application execution environments provide the illusion that user-space programs have access to a big flat address space we've been able to more or less ignore memory when discussing user applications. As is common with computing systems, memory is the hard part -- thus, we only really need to discuss RISC-V privilege modes when talking about paging.

For this blog post, we'll be focusing on supervisor code running on reasonable systems -- thus we won't do things like emulating unsupported instructions from userspace. The focus of the blog will be on how to provide a RISC-V application execution environment given a supervisor execution environment.

The RISC-V Application-Class Supervisor Execution Environments

Supervisor programs, like Linux, execute on a supervisor execution environment. Much like how the user-level ISA leaves many of the specifics of the AEE to be implemented in different ways on different platforms (system calls on Linux vs BSD, for example), the privileged ISA doesn't specify all the details of the SEE that application-class supervisors (like Linux or BSD) can expect -- those will be specified as part of the platform specification.

In this blog, I'll quickly go over a few key aspects of the RISC-V supervisor execution environment for application-class supervisors. This environment is designed to support UNIX-style operating systems running in supervisor mode, emulating POSIX-compliant application execution environments. A highlight of the proposed (with the caveat that I'm not in the platform specification working group, so this is all just my guess) requirements of the application-class SEE are:

  • Either the RV32I or RV64I base ISAs, along with the M, A, and C extensions. The F and D extensions are optional but paired together, leaving the possible standard ISAs for application-class SEEs as RV32IMAC, RV32IMAFDC (RV32GC), RV64IMAC, and RV64IMAFDC (RV64GC).
  • On RV32I-based systems, support for Sv32 page-based virtual memory.
  • On RV64I-based systems, support for at least Sv48 page-based virtual memory.
  • Upon entering the SEE, the PMAs are set such that memory accesses are point-to-point strongly ordered between harts and devices.
  • An SBI that implements various fences, timers, and a console.

The application-class SEE, as specified by the upcoming RISC-V platform specification, in the contract between standard Linux distributions and hardware vendors -- of course these restrictions don't apply for the embedded space, where many of them would be onerous. In practice: if you expect users to be able to swap out the boot media on your platform, then you should meet the requirements of the application-class SEE.

The RISC-V Linux Application Execution Environments

Supervisor-mode software on RISC-V uses a supervisor execution environment in order to provide one or multiple application execution environments. Fundamentally, an AEE (like any execution environment) is simply the definition of the next state of the machine upon every instruction's execution. On RISC-V systems, the AEE depends on:

  • The ISA string, which determines what the vast majority of instructions do as well as which registers constitute the machine's current state.
  • The supervisor's user-visible ABI, which determines what the scall instruction does. This is different than the C compiler's ABI, which defines the interface between different components of the application.
  • The contents of the entire memory address space.

In an idealized world, each process consists of its own independent AEE, with Linux multiplexing these on top of a single SEE. Of course, there's all sort of problems to this model in practice, but none of this is specific to RISC-V systems. The concept of a self-contained and well-defined AEE is still useful from a standards standpoint, and we hope to progress on properly specifying the RISC-V Linux AEE family (as well as AEEs for other POSIX-like systems) as we progress with our ports.

Paging on RISC-V Systems

After that lengthy divergence into the definition of RISC-V's privileged modes, we can finally get to the whole point of this blog post: paging on RISC-V systems. Paging is the main mechanism used to provide user mode with the illusion of having an AEE -- like most things in computer architecture, it turns out that memory is the tricky part.

One of the nice things about designing an ISA at the time RISC-V was designed is that so many different solutions to difficult problems have been tried that we pretty much know what to do now. Thus, we arrived at a pretty standard page-based virtual memory system when designing the RISC-V's supervisor virtual memory interface. The exact page table formats and such are listed in the relevant RISC-V ISA manuals so I won't go through them here, but there are a few highlights:

  • Pages are 4KiB at the leaf node, and it's possible to map large contiguous regions with every level of the page table.
  • RV32I-based systems can have up to 34-bit physical addresses with a three level page table.
  • RV64I-based systems can have multiple virtual address widths, starting with 39-bit and extending up to 64-bit in increments of 9 bits.
  • Mappings must be synchronized via the sfence.vma instruction.
  • There are bits for global mappings, supervisor-only, read/write/execute, and accessed/dirty.
  • There is a single valid bit, which allows storing XLEN-1 bits of flags in an otherwise unused page tables. Additionally, there are two bits of software flags in mapped pages.
  • Address space identifiers are 9 bits or RV32I and 16 bits on RV64I, and they're a hint so a valid implementation is to ignore them.
  • The accessed and dirty bits are strongly ordered with respect to accesses from the same hart, but are optional (with a trap-based mechanism when unsupported).

The Linux implementation of paging is functional but not complete: we're missing support for ASIDs, for example. Like many things in our port, these extra features will come with time.

Handling Device DMA

RISC-V does not currently define an IOMMU, so device accesses are performed in a single linear address space provided by the SEE (aka, physical memory). Combined with the lack of a mechanism to modify PMAs, this makes device IO on RISC-V very simple: we essentially just don't do anything specific to our ISA.

Handling 32-bit DMA Regions

Some devices only support 32-bit addressing even when attached to a system with longer physical addresses. Since RISC-V lacks an IOMMU, we handle these devices by using kernel bounce buffers. This is correct but slow: while it may be fine for SoC-style systems where the set of devices is well known at elaboration time, as more complicated RISC-V systems become available we will eventually need to standardize a mechanism for virtualizing device addressing.

Our bounce buffer mechanism simply uses the standard mechanisms provided by Linux, so there isn't anything RISC-V specific about it. We provide a 32-bit ZONE_DMA, allow allocating from that, and use bounce buffers to handle ioremap() for already-allocated pages outside the legal region.


Read more of the All Aboard blog series:

Palmer Dabbelt
Palmer Dabbelt

Palmer Dabbelt is an Engineer for Meta, having prior experience as an Engineer at Rivos and Software Engineer at Google. Prior to that he worked at SiFive as Director of Software Engineering and as an Engineer.

Read more Insights from the RISC-V Experts

P570 Gen 3:系统视角
最新文章
P570 Gen 3:系统视角
然而,CPU 的需求横跨性能、功耗和成本等多个维度。在某些细分市场中,需要在不同的功耗与成本约束下实现性能提升。基于这类 CPU 的系统需要可信赖的产品路线图,才能切实交付新的系统能力。尽管部分供应商已退出“低端市场”,SiFive 仍坚持在整条性能曲线上持续创新。本次发布的 P570 Gen 3 Performance IP,旨在为中低端、具备 Linux 能力的系统提供显著的性价比与能效比提升。
SiFive Performance™ P570 Gen 3 深度解析:面向下一代消费级与商用应用的高性能能效设计
最新文章
SiFive Performance™ P570 Gen 3 深度解析:面向下一代消费级与商用应用的高性能能效设计
SiFive 的核心是 RISC-V,这是 SiFive 创始人在公司成立 5 年前发明的指令集架构 (ISA)。SiFive 正持续演进基于 RISC-V 的 IP 基础模块,重新定义并推动各类计算平台的普及化发展。在技术领域,演进并非一串随机变化的时间线,而是一系列精心规划、环环相扣的里程碑。每一步演进都会创造一系列新的环境条件,从而推动下一次更复杂的跨越成为必然。要赢得这场竞赛,关键在于具备适应变化的灵活性与持续创新能力,而这两点正是 SiFive 与 RISC-V 的核心价值观所在。
全力投入:开启增长新篇章
最新文章
全力投入:开启增长新篇章
我们自信地宣布公司发展历程中最重要的里程碑之一:完成 4 亿美元 的融资。本轮融资由 Atreides Management 领投,其他顶级投资机构\*包括 Apollo Global Management、NVIDIA(英伟达)、Point72 Turion 和 T. Rowe Price Investment Management, Inc.,以及现有投资者 Prosperity7 Ventures 和 Sutter Hill Ventures 参投。此次融资使公司估值达到 36.5 亿美元,并将加速 SiFive 的 RISC-V CPU 及 AI IP 解决方案推向数据中心和 AI 基础设施市场的核心地带。