Welcome to hvisor!

Hello~

Welcome to hvisor!

hvisor is a lightweight Type-1 virtual machine monitor written in Rust, offering efficient resource management and low-overhead virtualization performance.

Features

  1. Cross-platform support: Supports multiple architectures including AARCH64, RISC-V, and LoongArch.
  2. Lightweight: Focuses on core virtualization features, avoiding unnecessary complexity found in traditional virtualization solutions, suitable for resource-constrained environments.
  3. Efficient: Runs directly on hardware without going through an OS layer, providing near-native performance.
  4. Security: Rust is known for its memory safety and concurrent programming model, helping to reduce common system-level programming errors such as memory leaks and data races.
  5. Fast startup: Designed to be simple with a short startup time, suitable for scenarios that require rapid deployment of virtualization.

Main Functions

  1. Virtual Machine Management: Provides basic management functions for creating, starting, stopping, and deleting virtual machines.
  2. Resource Allocation and Isolation: Supports efficient allocation and management of CPU, memory, and I/O devices, using virtualization technology to ensure isolation between different virtual machines, enhancing system security and stability.

Use Cases

  1. Edge Computing: Suitable for running on edge devices, providing virtualization support for IoT and edge computing scenarios.
  2. Development and Testing: Developers can quickly create and destroy virtual machine environments for software development and testing.
  3. Security Research: Provides an isolated environment for security research and malware analysis.

hvisor supported hardware platforms

2025.3.18

aarch64

  • QEMU virt aarch64
  • NXP i.MX8MP
  • Xilinx Ultrascale+ MPSoC ZCU102
  • Rockchip RK3588
  • Rockchip RK3568
  • Forlinx OK6254-C

riscv64

  • QEMU virt riscv64
  • FPGA Xiangshan (Kunming Lake) on S2C Prodigy S7-19PS-2
  • FPGA RocketChip on Xilinx Ultrascale+ MPSoC ZCU102

loongarch64

  • Loongson 3A5000+7A2000
  • Loongson 3A6000

hvisor Hardware Adaptation Development Manual 🧑🏻‍💻

wheatfox (wheatfox17@icloud.com) 2025.3.17

Design Principles

  1. Code and board configuration separation: No platform_xxx related cfg should appear inside src of hvisor itself.
  2. Platform independence: Introduce the previous hvisor-deploy architecture, orderly store information about various architectures and boards in the platform directory.
  3. Board directory index:
    • Uniformly use platform/$ARCH/$BOARD as the dedicated directory for the board.
    • Each board's unique BID (Board ID) adopts the ARCH/BOARD format, such as aarch64/qemu-gicv3.
  4. Simplified compilation: Support using BID=xxx/xxx to directly specify the board, while also compatible with ARCH=xxx BOARD=xxx style.
  5. Structured configuration: Each board directory contains the following files:
    • linker.ld - Link script
    • platform.mk - QEMU startup Makefile and hvisor.bin handling
    • board.rs - Board definition Rust code
    • configs/ - JSON configurations for hvisor-tool startup zones
    • cargo/
      • features - Specific cargo features corresponding to the board, including drivers, functions, etc.
      • config.template.toml - Template for .cargo/config, maintained by each board
    • test/ - (Optional) QEMU related test code, including unit tests, system tests, etc.
    • image/ - Startup file directory, containing multiple subdirectories:
      • bootloader/ - (Optional) Used for local QEMU operation and unittest/systemtest testing
      • dts/ - (Optional) Device tree source files for zones 0, 1, 2, …
      • its/ - (Optional) Used for generating U-Boot FIT image (hvisor aarch64 zcu102)
      • acpi/ - (Optional) ACPI device tree source code for x86 platform (hvisor x86_64)
      • kernel/ - (Optional) Kernel Image suitable for the target platform
      • virtdisk/ - (Optional) Virtual disk files, such as rootfs, etc.

Code Implementation Details

Auto-generate .cargo/config.toml

  • Generated by tools/gen_cargo_config.sh, ensuring dynamic updates to the linker.ld configuration.
  • config.template.toml uses placeholders like __ARCH__, __BOARD__, replaced by gen_cargo_config.sh to generate .cargo/config.toml.
  • build.rs is responsible for symlinking platform/$ARCH/$BOARD/board.rs to src/platform/__board.rs.
  • Avoids Makefile handling, triggered only when env variables change, reducing unnecessary full recompilations.

Select drivers through Cargo features

  • Avoid platform_xxx directly appearing in src/, switch to configuration based on features.
  • cargo/features uniformly stores configurations of board drivers, functions, etc.

Overview of features Corresponding to Each Board

BOARD IDFEATURES
aarch64/qemu-gicv3gicv3 pl011 iommu pci pt_layout_qemu
aarch64/qemu-gicv2gicv2 pl011 iommu pci pt_layout_qemu
aarch64/imx8mpgicv3 imx_uart
aarch64/zcu102gicv2 xuartps
riscv64/qemu-plicplic
riscv64/qemu-aiaaia
loongarch64/ls3a5000loongson_chip_7a2000 loongson_uart loongson_cpu_3a5000
loongarch64/ls3a6000loongson_chip_7a2000 loongson_uart loongson_cpu_3a6000
aarch64/rk3588gicv3 uart_16550 uart_addr_rk3588 pt_layout_rk
aarch64/rk3568gicv3 uart_16550 uart_addr_rk3568 pt_layout_rk
x86_64/qemu

Development and Compilation Guide

Compile Different Boards

make ARCH=aarch64 BOARD=qemu-gicv3
make BID=aarch64/qemu-gicv3  # Use BID shorthand
make BID=aarch64/imx8mp
make BID=loongarch64/ls3a5000
make BID=x86_64/qemu

Adapt New Boards

  1. Determine features: Refer to existing features for classification, add required drivers and configurations.
  2. Create platform/$ARCH/$BOARD directory:
    • Add linker.ld, board.rs, features, etc.
  3. Compile Test:
make BID=xxx/new_board

features Design Principles

  • Minimize hierarchy:
    • For example, cpu-a72 instead of board_xxx, to facilitate reuse across multiple boards.
  • Clear driver/function classification:
    • irqchip (gicv3, plic, ...)
    • uart (pl011, imx_uart, ...)
    • iommu, pci, pt_layout_xxx, ...

Running hvisor on QEMU

1. Install Cross Compiler aarch64-none-linux-gnu-10.3

URL: https://developer.arm.com/downloads/-/gnu-a

Tool selection: AArch64 GNU/Linux target (aarch64-none-linux-gnu)

Download link: https://developer.arm.com/-/media/Files/downloads/gnu-a/10.3-2021.07/binrel/gcc-arm-10.3-2021.07-x86_64-aarch64-none-linux-gnu.tar.xz?rev=1cb9c51b94f54940bdcccd791451cec3&hash=B380A59EA3DC5FDC0448CA6472BF6B512706F8EC

wget https://armkeil.blob.core.windows.net/developer/Files/downloads/gnu-a/10.3-2021.07/binrel/gcc-arm-10.3-2021.07-x86_64-aarch64-none-linux-gnu.tar.xz
tar xvf gcc-arm-10.3-2021.07-x86_64-aarch64-none-linux-gnu.tar.xz
ls gcc-arm-10.3-2021.07-x86_64-aarch64-none-linux-gnu/bin/

Installation complete, remember the path, for example: /home/tools/gcc-arm-10.3-2021.07-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-, this path will be used later.

2. Compile and Install QEMU 9.0.1

Note, QEMU needs to be switched from 7.2.12 to 9.0.1 for proper use of PCI virtualization

# Install required dependencies for compilation
sudo apt install autoconf automake autotools-dev curl libmpc-dev libmpfr-dev libgmp-dev \
              gawk build-essential bison flex texinfo gperf libtool patchutils bc \
              zlib1g-dev libexpat-dev pkg-config  libglib2.0-dev libpixman-1-dev libsdl2-dev \
              git tmux python3 python3-pip ninja-build
# Download source code
wget https://download.qemu.org/qemu-9.0.1.tar.xz
# Extract
tar xvJf qemu-9.0.1.tar.xz
cd qemu-9.0.1
# Generate configuration file
./configure --enable-kvm --enable-slirp --enable-debug --target-list=aarch64-softmmu,x86_64-softmmu
# Compile
make -j$(nproc)

Then edit the ~/.bashrc file, add a few lines at the end of the file:

# Please note, the parent directory of qemu-9.0.1 can be flexibly adjusted according to your actual installation location. Also, it needs to be placed at the beginning of the $PATH variable.
export PATH=/path/to/qemu-7.2.12/build:$PATH

Afterward, you can update the system path in the current terminal by source ~/.bashrc or simply restart a new terminal. At this point, you can confirm the qemu version, if it displays qemu-9.0.1, it means the installation was successful:

qemu-system-aarch64 --version   # Check version

Note, the above dependency packages may not be complete, for example:

  • If ERROR: pkg-config binary 'pkg-config' not found appears, you can install the pkg-config package;
  • If ERROR: glib-2.48 gthread-2.0 is required to compile QEMU appears, you can install the libglib2.0-dev package;
  • If ERROR: pixman >= 0.21.8 not present appears, you can install the libpixman-1-dev package.

If you encounter an error ERROR: Dependency "slirp" not found, tried pkgconfig during the generation of the configuration file:

Download the https://gitlab.freedesktop.org/slirp/libslirp package and install it according to the readme.

3. Compile Linux Kernel 5.4

Before compiling the root linux image, change the CONFIG_IPV6 and CONFIG_BRIDGE config in the .config file to y, to support creating bridges and tap devices in root linux. The specific operations are as follows:

git clone https://github.com/torvalds/linux -b v5.4 --depth=1
cd linux
git checkout v5.4
# Modify the CROSS_COMPILE path according to the path of the cross compiler installed in the first step
make ARCH=arm64 CROSS_COMPILE=/root/gcc-arm-10.3-2021.07-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu- defconfig
# Add a line in .config
CONFIG_BLK_DEV_RAM=y
# Modify two CONFIG parameters in .config
CONFIG_IPV6=y
CONFIG_BRIDGE=y
# Compile, modify the CROSS_COMPILE path according to the path of the cross compiler installed in the first step
make ARCH=arm64 CROSS_COMPILE=/root/gcc-arm-10.3-2021.07-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu- Image -j$(nproc)

If you encounter an error during the compilation of linux:

/usr/bin/ld: scripts/dtc/dtc-parser.tab.o:(.bss+0x20): multiple definition of `yylloc'; scripts/dtc/dtc-lexer.lex.o:(.bss+0x0): first defined here

Then modify scripts/dtc/dtc-lexer.lex.c under the linux folder, add extern before YYLTYPE yylloc;. Recompile, if you encounter an error: openssl/bio.h: No such file or directory, then execute sudo apt install libssl-dev

After compilation, the kernel file is located at: arch/arm64/boot/Image. Remember the entire linux folder's path, for example: home/korwylee/lgw/hypervisor/linux, we will use this path again in step 7.

4. Build File System Based on Ubuntu 22.04 Arm64 Base

You can skip this part and directly download the ready-made disk image for use. https://blog.syswonder.org/#/2024/20240415_Virtio_devices_tutorial

We use ubuntu 22.04 to build the root file system.

Ubuntu 20.04 can also be used, but it will report an error of low glibc version when running, you can refer to the solution in the comment section of ARM64-qemu-jailhouse.

wget http://cdimage.ubuntu.com/ubuntu-base/releases/22.04/release/ubuntu-base-22.04.5-base-arm64.tar.gz

mkdir rootfs
# Create a 1G size ubuntu.img, you can modify the count to change the img size
dd if=/dev/zero of=rootfs1.img bs=1M count=1024 oflag=direct
mkfs.ext4 rootfs1.img
# Put ubuntu.tar.gz into the ubuntu.img that has been mounted to rootfs
sudo mount -t ext4 rootfs1.img rootfs/
sudo tar -xzf ubuntu-base-22.04.5-base-arm64.tar.gz -C rootfs/

# Let rootfs bind and get some information and hardware from the physical machine
# qemu-path is your qemu path
sudo cp qemu-path/build/qemu-system-aarch64 rootfs/usr/bin/
sudo cp /etc/resolv.conf rootfs/etc/resolv.conf
sudo mount -t proc /proc rootfs/proc
sudo mount -t sysfs /sys rootfs/sys
sudo mount -o bind /dev rootfs/dev
sudo mount -o bind /dev/pts rootfs/dev/pts

# Executing this command may report an error, please refer to the solution below
sudo chroot rootfs
apt-get update
apt-get install git sudo vim bash-completion \
		kmod net-tools iputils-ping resolvconf ntpdate screen

# The following content surrounded by # can be done or not done
###################
adduser arm64
adduser arm64 sudo
echo "kernel-5_4" >/etc/hostname
echo "127.0.0.1 localhost" >/etc/hosts
echo "127.0.0.1 kernel-5_4">>/etc/hosts
dpkg-reconfigure resolvconf
dpkg-reconfigure tzdata
###################
exit

sudo umount rootfs/proc
sudo umount rootfs/sys
sudo umount rootfs/dev/pts
sudo umount rootfs/dev
sudo umount rootfs

Finally, unmount the mounts to complete the production of the root file system.

When executing sudo chroot ., if an error chroot: failed to run command ‘/bin/bash’: Exec format error occurs, you can execute the command:

sudo apt-get install qemu-user-static
sudo update-binfmts --enable qemu-aarch64

5. Rust Environment Configuration

Please refer to: Rust Language Bible

6. Compile and Run hvisor

First, pull the hvisor code repository to the local machine, then switch to the dev branch, and in the hvisor/images/aarch64 folder, put the previously compiled root file system and Linux kernel image respectively in the virtdisk and kernel directories, and rename them to rootfs1.ext4 and Image respectively.

Second, prepare the configuration files. Take the virtio-blk&console example as an example, this directory contains 6 files, perform the following operations on these 6 files:

  • linux1.dts: Root Linux's device tree, hvisor will use it when starting.
  • linux2.dts: Zone1 Linux's device tree, hvisor-tool will need it when starting zone1. Replace linux1.dts and linux2.dts in the devicetree directory with the same name files, and execute make all to compile, obtaining linux1.dtb and linux2.dtb.
  • qemu_aarch64.rs, qemu-aarch64.mk directly replace the files with the same name in the hvisor repository.

Then, in the hvisor directory, execute:

make ARCH=aarch64 LOG=info BOARD=qemu-gicv3 run # or use BOARD=qemu-gicv2

Afterward, you will enter the uboot startup interface, under this interface execute:

bootm 0x40400000 - 0x40000000

This boot command will start hvisor from the physical address 0x40400000, 0x40000000 is essentially no longer used, but is still retained for historical reasons. When hvisor starts, it will automatically start root linux (used for management), and enter the shell interface of root linux, root linux is zone0, taking on management tasks.

When prompted for missing dtc, you can execute the command:

sudo apt install device-tree-compiler

7. Start zone1-linux Using hvisor-tool

First, complete the compilation of the latest version of hvisor-tool. For specific instructions, please refer to the README of hvisor-tool. For example, if you want to compile a command-line tool for arm64, and the source code of the Linux image in the Hvisor environment is located at ~/linux, you can execute

make all ARCH=arm64 LOG=LOG_WARN KDIR=~/linux

Please make sure that the Root Linux image in Hvisor is compiled from the Linux source directory specified in the options when compiling hvisor-tool.

After compilation, copy driver/hvisor.ko, tools/hvisor to the directory where zone1 linux starts in the image/virtdisk/rootfs1.ext4 root file system (for example, /same_path/); then put the kernel image of zone1 (if it is the same Linux as zone0, just copy a copy of image/aarch64/kernel/Image), and the device tree (image/aarch64/linux2.dtb) in the same directory (/same_path/), and rename them to Image and linux2.dtb.

Then you need to create a root file system for Zone1 linux. You can copy a copy of rootfs1.ext4 in image/aarch64/virtdisk, or repeat step 4 (preferably reduce the image size), and rename it to rootfs2.etx4. Then put rootfs2.ext4 in the same directory as rootfs1.ext4 (/same_path/).

If the capacity of rootfs1.ext4 is not enough, you can refer to img expansion to expand rootfs1.ext4.

Then on QEMU, you can start zone1-linux through root linux-zone0.

For detailed steps to start zone1-linux, refer to the README of hvisor-tool and the startup example

Install qemu

Install QEMU 9.0.2:

wget https://download.qemu.org/qemu-9.0.2.tar.xz
# Unzip
tar xvJf qemu-9.0.2.tar.xz
cd qemu-9.0.2
# Configure Riscv support
./configure --target-list=riscv64-softmmu,riscv64-linux-user 
make -j$(nproc)
# Add to environment variable
export PATH=$PATH:/path/to/qemu-9.0.2/build
# Test if installation was successful
qemu-system-riscv64 --version

Install cross-compiler

The Riscv cross-compiler needs to be obtained and compiled from riscv-gnu-toolchain.

# Install necessary tools
sudo apt-get install autoconf automake autotools-dev curl python3 python3-pip libmpc-dev libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo gperf libtool patchutils bc zlib1g-dev libexpat-dev ninja-build git cmake libglib2.0-dev libslirp-dev

git clone https://github.com/riscv/riscv-gnu-toolchain
cd riscv-gnu-toolchain
git rm qemu 
git submodule update --init --recursive
# The above operation will occupy more than 5GB of disk space
# If git reports a network error, you can execute:
git config --global http.postbuffer 524288000

Then start compiling the toolchain:

cd riscv-gnu-toolchain
mkdir build
cd build
../configure --prefix=/opt/riscv64
sudo make linux -j $(nproc)
# After compilation, add the toolchain to the environment variable
echo 'export PATH=/opt/riscv64/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

This will get the riscv64-unknown-linux-gnu toolchain.

Compile Linux

git clone https://github.com/torvalds/linux -b v6.2 --depth=1
cd linux
git checkout v6.2
make ARCH=riscv CROSS_COMPILE=riscv64-unknown-linux-gnu- defconfig
make ARCH=riscv CROSS_COMPILE=riscv64-unknown-linux-gnu- modules -j$(nproc)
# Start compiling
make ARCH=riscv CROSS_COMPILE=riscv64-unknown-linux-gnu- Image -j$(nproc)

Make Ubuntu root filesystem

wget http://cdimage.ubuntu.com/ubuntu-base/releases/20.04/release/ubuntu-base-20.04.2-base-riscv64.tar.gz
mkdir rootfs
dd if=/dev/zero of=riscv_rootfs.img bs=1M count=1024 oflag=direct
mkfs.ext4 riscv_rootfs.img
sudo mount -t ext4 riscv_rootfs.img rootfs/
sudo tar -xzf ubuntu-base-20.04.2-base-riscv64.tar.gz -C rootfs/

sudo cp /path-to-qemu/build/qemu-system-riscv64 rootfs/usr/bin/
sudo cp /etc/resolv.conf rootfs/etc/resolv.conf
sudo mount -t proc /proc rootfs/proc
sudo mount -t sysfs /sys rootfs/sys
sudo mount -o bind /dev rootfs/dev
sudo mount -o bind /dev/pts rootfs/dev/pts
sudo chroot rootfs 
# After entering chroot, install necessary packages:
apt-get update
apt-get install git sudo vim bash-completion \
    kmod net-tools iputils-ping resolvconf ntpdate
exit

sudo umount rootfs/proc
sudo umount rootfs/sys
sudo umount rootfs/dev/pts
sudo umount rootfs/dev
sudo umount rootfs

Run hvisor

Place the prepared root filesystem and Linux kernel image in the specified location in the hvisor directory, then execute make run ARCH=riscv64 in the hvisor root directory

By default, it uses PLIC, execute make run ARCH=riscv64 IRQ=aia to enable AIA specification

Start non-root linux

Use hvisor-tool to generate the hvisor.ko file, then you can start zone1-linux through root linux-zone0 on QEMU.

After starting root linux, execute in the /home directory

sudo insmod hvisor.ko
rm nohup.out
mkdir -p /dev/pts
mount -t devpts devpts /dev/pts
nohup ./hvisor zone start linux2-aia.json && cat nohup.out | grep "char device" && script /dev/null

Booting hvisor on NXP-IMX8MP

Date: 2024/2/25

Updated: 2025/3/7

Authors: Yang Junyi, Chen Xingyu, Li Guowei, Chen Linkun

1. Download the Linux source code provided by the manufacturer

https://pan.baidu.com/s/1XimrhPBQIG5edY4tPN9_pw?pwd=kdtk Extraction code: kdtk

Enter the Linux/sources/ directory, download the three compressed files OK8MP-linux-sdk.tar.bz2.0*, and after downloading, execute:

cd Linux/sources

# Merge split compressed files
cat OK8MP-linux-sdk.tar.bz2.0* > OK8MP-linux-sdk.tar.bz2

# Unzip the merged compressed file
tar -xvjf OK8MP-linux-sdk.tar.bz2

After unzipping, the OK8MP-linux-kernel directory is the Linux source code directory.

2. Compile Linux source code

Install cross-compilation tools

  1. Download the cross-compilation toolchain:

    wget https://armkeil.blob.core.windows.net/developer/Files/downloads/gnu-a/10.3-2021.07/binrel/gcc-arm-10.3-2021.07-x86_64-aarch64-none-linux-gnu.tar.xz
    
  2. Unzip the toolchain:

    tar xvf gcc-arm-10.3-2021.07-x86_64-aarch64-none-linux-gnu.tar.xz
    
  3. Add the path so that aarch64-none-linux-gnu-* can be used directly, modify the ~/.bashrc file:

    echo 'export PATH=$PWD/gcc-arm-10.3-2021.07-x86_64-aarch64-none-linux-gnu/bin:$PATH' >> ~/.bashrc
    source ~/.bashrc
    

Compile Linux

  1. Switch to the Linux kernel source code directory:

    cd Linux/sources/OK8MP-linux-sdk
    
  2. Execute the compilation command:

    # Set Linux kernel configuration
    make OK8MP-C_defconfig ARCH=arm64 CROSS_COMPILE=aarch64-none-linux-gnu-
    
    # Compile the Linux kernel
    make ARCH=arm64 CROSS_COMPILE=aarch64-none-linux-gnu- Image -j$(nproc)
    
    # Copy the compiled image to the tftp directory
    cp arch/arm64/boot/Image ~/tftp/
    

Create a tftp directory here for later image organization and for using tftp to transfer images as mentioned in the appendix.

3. Prepare the SD card

  1. Insert the SD card into the card reader and connect it to the host.

  2. Switch to the Linux/Images directory.

  3. Execute the following commands for partitioning:

    fdisk <$DRIVE>
    d  # Delete all partitions
    n  # Create a new partition
    p  # Choose primary partition
    1  # Partition number 1
    16384  # Starting sector
    t  # Change partition type
    83  # Select Linux filesystem (ext4)
    w  # Save and exit
    
  4. Write the boot file to the SD card boot disk:

    dd if=imx-boot_4G.bin of=<$DRIVE> bs=1K seek=32 conv=fsync
    
  5. Format the first partition of the SD card boot disk as ext4:

    mkfs.ext4 <$DRIVE>1
    
  6. Remove the SD card reader and reconnect. Extract the root file system rootfs.tar to the first partition of the SD card. The rootfs.tar can be made by referring to qemu-aarch64 or using the image below.

    tar -xvf rootfs.tar -C <path/to/mounted/SD/card/partition>
    

rootfs.tar download address:

https://disk.pku.edu.cn/link/AADFFFE8F568DE4E73BE24F5AED54B00EB
Filename: rootfs.tar
  1. After completion, eject the SD card.

4. Compile hvisor

  1. Organize the configuration files

Place the configuration files where they belong. Sample configuration files can be referred to here.

  1. Compile hvisor

Enter the hvisor directory, switch to the main branch or dev branch, and execute the compilation command:

make ARCH=aarch64 FEATURES=platform_imx8mp,gicv3 LOG=info all

# Put the compiled hvisor image into tftp
make cp

5. Boot hvisor and root linux

Before booting the NXP board, transfer the files from the tftp directory to the SD card, such as to the /home/arm64 directory on the SD card. The files in the tftp directory include:

  • Image: root linux image, can also be used as non-root linux image
  • linux1.dtb, linux2.dtb: device trees for root linux and non-root linux
  • hvisor.bin: hvisor image
  • OK8MP-C.dtb: This is used for some checks during uboot boot, essentially useless, can be obtained from here OK8MP-C.dts

Boot the NXP board:

  1. Adjust the dip switches to enable SD card boot mode: (1,2,3,4) = (ON,ON,OFF,OFF).
  2. Insert the SD card into the SD slot.
  3. Connect the development board to the host using a serial cable.
  4. Open the serial port with terminal software

After booting the NXP board, there should be output on the serial port. Restart the development board, immediately press and hold the spacebar to make uboot enter the command line terminal, and execute the following command:

setenv loadaddr 0x40400000; setenv fdt_addr 0x40000000; setenv zone0_kernel_addr 0xa0400000; setenv zone0_fdt_addr 0xa0000000; ext4load mmc 1:1 ${loadaddr} /home/arm64/hvisor.bin; ext4load mmc 1:1 ${fdt_addr} /home/arm64/OK8MP-C.dtb; ext4load mmc 1:1 ${zone0_kernel_addr} /home/arm64/Image; ext4load mmc 1:1 ${zone0_fdt_addr} /home/arm64/linux1.dtb; bootm ${loadaddr} - ${fdt_addr};

After execution, hvisor should boot and automatically enter root linux.

6. Boot non-root linux

Booting non-root linux requires hvisor-tool. For details, please refer to the README of hvisor-tool.

Appendix. Convenient image transfer using tftp

Tftp facilitates data transfer between the development board and the host without the need to plug and unplug the SD card each time. The specific steps are as follows:

For Ubuntu systems

If you are using Ubuntu, execute the following steps in sequence:

  1. Install TFTP server software package

    sudo apt-get update
    sudo apt-get install tftpd-hpa tftp-hpa
    
  2. Configure TFTP server

    Create TFTP root directory and set permissions:

    mkdir -p ~/tftp
    sudo chown -R $USER:$USER ~/tftp
    sudo chmod -R 755 ~/tftp
    

    Edit the tftpd-hpa configuration file:

    sudo nano /etc/default/tftpd-hpa
    

    Modify as follows:

    # /etc/default/tftpd-hpa
    
    TFTP_USERNAME="tftp"
    TFTP_DIRECTORY="/home/<your-username>/tftp"
    TFTP_ADDRESS=":69"
    TFTP_OPTIONS="-l -c -s"
    

    Replace <your-username> with your actual username.

  3. Start/restart TFTP service

    sudo systemctl restart tftpd-hpa
    
  4. Verify TFTP server

    echo "TFTP Server Test" > ~/tftp/testfile.txt
    
    tftp localhost
    tftp> get testfile.txt
    tftp> quit
    cat testfile.txt
    

    If "TFTP Server Test" is displayed, the TFTP server is working properly.

  5. Configure to start on boot:

    sudo systemctl enable tftpd-hpa
    
  6. Connect the development board's network port (there are two, please choose the one below) to the host using a network cable. And configure the host's wired network card, ip: 192.169.137.2, netmask: 255.255.255.0.

After booting the development board, enter the uboot command line, and the command becomes:

setenv serverip 192.169.137.2; setenv ipaddr 192.169.137.3; setenv loadaddr 0x40400000; setenv fdt_addr 0x40000000; setenv zone0_kernel_addr 0xa0400000; setenv zone0_fdt_addr 0xa0000000; tftp ${loadaddr} ${serverip}:hvisor.bin; tftp ${fdt_addr} ${serverip}:OK8MP-C.dtb; tftp ${zone0_kernel_addr} ${serverip}:Image; tftp ${zone0_fdt_addr} ${serverip}:linux1.dtb; bootm ${loadaddr} - ${fdt_addr};

Explanation:

  • setenv serverip 192.169.137.2: Set the IP address of the tftp server.
  • setenv ipaddr 192.169.137.3: Set the IP address of the development board.
  • setenv loadaddr 0x40400000: Set the load address for the hvisor image.
  • setenv fdt_addr 0x40000000: Set the load address for the device tree file.
  • setenv zone0_kernel_addr 0xa0400000: Set the load address for the guest Linux image.
  • setenv zone0_fdt_addr 0xa0000000: Set the load address for the root Linux device tree file.
  • tftp ${loadaddr} ${serverip}:hvisor.bin: Download the hvisor image from the tftp server to the hvisor load address.
  • tftp ${fdt_addr} ${serverip}:OK8MP-C.dtb: Download the device tree file from the tftp server to the device tree file load address.
  • tftp ${zone0_kernel_addr} ${serverip}:Image: Download the guest Linux image from the tftp server to the guest Linux image load address.
  • tftp ${zone0_fdt_addr} ${serverip}:linux1.dtb: Download the root Linux device tree file from the tftp server to the root Linux device tree file load address.
  • bootm ${loadaddr} - ${fdt_addr}: Boot hvisor, load the hvisor image and device tree file.

For Windows systems

You can refer to this article: https://blog.csdn.net/qq_52192220/article/details/142693036

FPGA zcu102

Author: 杨竣轶 (Jerry) github.com/comet959

# Before, Install vivado 2022.2 software
# Ubuntu 20.04 can work fine
sudo apt update

git clone https://github.com/U-interrupt/uintr-rocket-chip.git
cd uintr-rocket-chip
git submodule update --init --recursive
export RISCV=/opt/riscv64
git checkout 98e9e41
vim digilent-vivado-script/config.ini # Env Config

make checkout
make clean
make build

# Use vivado to open the vivado project, then change the top file, run synthesis, run implementation, generate bitstream.
# Connect the zcu102 - Jtag and Uart on your PC.
# Use dd command to flash the image include boot and rootfs part.
# Change the boot button mode to (On Off Off Off)
# Boot the power.

sudo screen /dev/ttyUSB0 115200 # Aarch64 Core Uart
sudo screen /dev/ttyUSB2 115200 # Riscv Core Uart

# On /dev/ttyUSB0
cd uintr-rocket-chip
./load-and-reset.sh

# Focus on ttyUSB2, then you will see the Riscv Linux Boot Msg.

Enable H extension in RocketChip

vim path/to/repo/common/src/main/scala/Configs.scala
// change
class UintrConfig extends Config(
  new WithNBigCores(4) ++
    new WithNExtTopInterrupts(6) ++
    new WithTimebase((BigInt(10000000))) ++ // 10 MHz
    new WithDTS("freechips.rocketchip-unknown", Nil) ++
    new WithUIPI ++
    new WithCustomBootROM(0x10000, "../common/boot/bootrom/bootrom.img") ++
    new WithDefaultMemPort ++
    new WithDefaultMMIOPort ++
    new WithDefaultSlavePort ++
    new WithoutTLMonitors ++
    new WithCoherentBusTopology ++
    new BaseSubsystemConfig
)

// to

class UintrConfig extends Config(
  new WithHypervisor ++
  new WithNBigCores(4) ++
    new WithNExtTopInterrupts(6) ++
    new WithTimebase((BigInt(10000000))) ++ // 10 MHz
    new WithDTS("freechips.rocketchip-unknown", Nil) ++
    new WithUIPI ++
    new WithCustomBootROM(0x10000, "../common/boot/bootrom/bootrom.img") ++
    new WithDefaultMemPort ++
    new WithDefaultMMIOPort ++
    new WithDefaultSlavePort ++
    new WithoutTLMonitors ++
    new WithCoherentBusTopology ++
    new BaseSubsystemConfig
)

Booting hvisor on Loongson 3A5000 motherboard (7A2000)

Han Yulu wheatfox17@icloud.com

Updated: 2025.3.24

Step 1: Obtain hvisor source code and compile

First, you need to install the loongarch64-unknown-linux-gnu- toolchain, please download and extract it from https://github.com/sunhaiyong1978/CLFS-for-LoongArch/releases/download/8.0/loongarch64-clfs-8.0-cross-tools-gcc-full.tar.xz, then add the cross-tools/bin directory to your PATH environment variable to ensure tools like loongarch64-unknown-linux-gnu-gcc can be directly called by the shell.

Then clone the code locally:

git clone -b dev https://github.com/syswonder/hvisor
make BID=loongarch64/ls3a5000

After compiling, you can find the stripped hvisor.bin file in the target directory.

Step 2 (without compiling buildroot/linux, etc.): Obtain rootfs/kernel image

Please download the latest released hvisor default Loongson Linux image from https://github.com/enkerewpo/linux-hvisor-loongarch64/releases (including root linux kernel+root linux dtb+root linux rootfs, where root linux rootfs includes non-root linux+nonroot linux dtb+nonroot linux rootfs). The rootfs is already packaged with the non-root startup json as well as hvisor-tool, kernel modules, etc.

Step 2 (compiling buildroot/linux, etc. yourself): Fully compile rootfs/kernel image

If you need to compile it yourself, this process will be more complex, and the details are as follows:

1. Prepare the environment

Create a working directory (optional):

mkdir workspace && cd workspace

git clone -b dev https://github.com/syswonder/hvisor
git clone https://github.com/enkerewpo/buildroot-loongarch64
git clone https://github.com/enkerewpo/linux-hvisor-loongarch64 hvisor-la64-linux
git clone https://github.com/enkerewpo/hvisor-tool
git clone https://github.com/enkerewpo/hvisor_uefi_packer

2. Prepare the buildroot environment

Since buildroot will download source code packages from various places when it cannot find the package to compile, I have prepared a pre-downloaded image:

https://pan.baidu.com/s/1sVPRt0JiExUxFm2QiCL_nA?pwd=la64

After downloading, place the dl directory in the root directory of buildroot-loongarch64, or you can let buildroot download it automatically (which may be very slow). If you still need to download packages after extracting the dl directory, that is normal.

3. Compile buildroot

cd buildroot-loongarch64
make loongson3a5000_hvisor_defconfig

make menuconfig # Please set Toolchain/Toolchain path prefix to your local loongarch64 toolchain path and prefix
# Then select save in the bottom right corner to save to the .config file

make -j$(nproc)

Please note

This process may take several hours, depending on your machine performance and network environment.

4. Compile linux for the first time (to prepare for subsequent make world)

cd hvisor-la64-linux # Currently using linux 6.13.7 by default
./build def # Generate the default root linux defconfig
# ./build nonroot_def # Generate the default nonroot linux defconfig

# ./build menuconfig # If you want to customize the kernel configuration, you can use this command
# (It will modify the .config file in the current directory, please be aware of whether you are modifying the root linux or nonroot linux configuration,
# You can check the content of the .flag file in the root directory is ROOT or NONROOT)

./build kernel # Compile the kernel corresponding to the current .config (may be root linux
# or nonroot linux, depending on ./build def and ./build nonroot_def)

Please note

This process may take several tens of minutes, depending on your machine performance.

5. Execute the make world process through hvisor uefi packer

First, you need to modify the Makefile.1 file in the hvisor_uefi_packer directory, changing variables like HVISOR_LA64_LINUX_DIR to the actual paths:

HVISOR_LA64_LINUX_DIR = ../hvisor-la64-linux
BUILDROOT_DIR = ../buildroot-loongarch64
HVISOR_TOOL_DIR = ../hvisor-tool

Then run:

cd hvisor_uefi_packer
./make_world

A brief introduction to the make_world script process, for specific commands please refer to the Makefile.1 file:

  1. Compile hvisor-tool, since the kernel module of hvisor-tool needs to be consistent with the kernel version of root linux, so you need to manually compile root linux once first, then make world can successfully compile hvisor-tool.
  2. Copy the related files of hvisor-tool to the rootfs overlay of buildroot, located in $(BUILDROOT_DIR)/board/loongson/ls3a5000/rootfs_ramdisk_overlay.
  3. Compile nonroot linux (nonroot currently does not use buildroot, but a simple busybox rootfs), note that the generated vmlinux includes nonroot dtb and busybox rootfs(initramfs) (all embedded in the kernel), and move vmlinux.bin to the rootfs overlay of buildroot. Remember the entry address of this nonroot linux vmlinux, later you can modify the linux2.json file in the buildroot overlay, writing this entry address in.
  4. Compile buildroot rootfs, this time the rootfs includes the previously compiled nonroot linux vmlinux, as well as the related files of hvisor-tool.
  5. Compile root linux, the generated vmlinux includes root linux dtb and buildroot rootfs (initramfs), please record this root linux vmlinux entry address and file path, which will be used later in hvisor and hvisor uefi packer.
  6. Finish, what we ultimately need is this root linux vmlinux.bin.

Step 3: Compile UEFI image

Since the 3A5000 and later 3 series CPUs' motherboards use UEFI boot, hvisor can only be booted through the efi image method.

Continuing from the previous step, in the hvisor uefi packer directory, first modify the ./make_image script's HVISOR_SRC_DIR to the actual path where you saved the hvisor source code, then run the compilation script:

make menuconfig # Configure for your local loongarch64 gcc toolchain prefix, hvisor.bin path, vmlinux.bin path

# 1. Modify make_image's HVISOR_SRC_DIR=../hvisor to your actual saved hvisor source code path, then run the script
# 2. Modify BOARD=ls3a5000/ls3a6000 (choose according to your actual board model), the BOARD in the env mentioned later is the same

# ./make_world # See the previous step's description, this step can be skipped if you do not need to recompile buildroot/linux

ARCH=loongarch64 BOARD=ls3a5000 ./make_image
# make_image only compiles hvisor and BOOTLOONGARCH64.EFI

At this point, BOOTLOONGARCH64.EFI will be generated in the hvisor_uefi_packer directory, place it in the /EFI/BOOT/BOOTLOONGARCH64.EFI location of the first FAT32 partition of the USB drive.

Please note

When you compile root and nonroot linux yourself, please manually use readelf to obtain the entry addresses of the two vmlinux files, and write them correspondingly in board.rs and linux2.json, otherwise it will definitely fail to boot.

Step 4: Boot on board

Power on the motherboard, press F12 to enter the UEFI Boot Menu, select to boot from the USB drive, and you will enter hvisor, then automatically enter root linux.

Start nonroot

If you are using the related images provided in the release, after booting, enter in the bash of root linux:

./daemon.sh
./start.sh # Start nonroot, then please manually run screen /dev/pts/0
./start.sh -s # Start nonroot and automatically enter screen

Afterward, nonroot will automatically start (some related configuration files are located in the /tool directory of root linux, including the nonroot zone configuration json provided to hvisor-tool and the virtio configuration json files), then a screen process connected to the nonroot linux virtio-console will automatically open, you will see a bash printed with the nonroot label appear, you can use the CTRL+A D shortcut key to detach during screen (please remember the displayed screen session name / ID), at this time you will return to root linux, if you wish to return to nonroot linux, run

screen -r {the full name of the session just now or just enter the ID at the front}

Afterward, you will return to the bash of nonroot linux.

This catalog is mainly related to ZCU102, and the introduction is as follows:

  1. How to use Qemu to simulate Xilinx ZynqMP ZCU102
  2. How to boot hvisor root linux and nonroot linux on Qemu ZCU102 and ZCU102 physical development board.

Qemu ZCU102 hvisor Startup

Install Petalinux

  1. Install Petalinux 2024.1 Please note that this article uses 2024.1 as an example for introduction. It does not mean that other versions cannot be used, but other versions have not been verified, and it has been found in testing that Petalinux has a strong dependency on the operating system. Please install the version of Petalinux suitable for your own operating system.
  2. Place the downloaded petalinux.run file in the directory where you want to install it, add execution permissions, and then directly run the installer with ./petalinux.run.
  3. The installer will automatically detect the required environment, and if it does not meet the requirements, it will prompt for the missing environment, just apt install them one by one.
  4. After installation, you need to enter the installation directory and manually source settings.sh to add environment variables before using Petalinux each time. If it is too troublesome, you can add this command to ~/.bashrc.

Install ZCU102 BSP

  1. Download the BSP corresponding to the Petalinux version, in the example it is ZCU102 BSP 2024.1
  2. Activate the Petalinux environment, i.e., in the Petalinux installation directory, source settings.sh.
  3. Create a Petalinux Project based on the BSP: petalinux-create -t project -s xilinx-zcu102-v2024.1-05230256.bsp
  4. This will create a xilinx-zcu102-2024.1 folder, which contains the parameters required for QEMU to simulate ZCU102 (device tree), as well as precompiled Linux images, device trees, Uboot, etc., that can be directly loaded onto the board.

Compile Hvisor

Refer to "Running Hvisor on Qemu" to configure the environment required for compiling Hvisor, then in the hvisor directory, execute:

make ARCH=aarch64 LOG=info BOARD=zcu102 cp

to perform the compilation. The directory /target/aarch64-unknown-none (may vary)/debug/hvisor will contain the required hvisor image.

Prepare Device Tree

Use Existing Device Tree

In the Hvisor image/devicetree directory, there is a zcu102-root-aarch64.dts, which is a device tree file that has been tested for booting RootLinux. Compile it as follows:

dtc -I dts -O dtb -o zcu102-root-aarch64.dtb zcu102-root-aarch64.dts

If the dtc command is invalid, install device-tree-compiler.

sudo apt-get install device-tree-compiler

Prepare Device Tree Yourself

If you have custom requirements for the device, it is recommended to prepare the device tree yourself. You can decompile the pre-built/linux/images/system.dtb in the ZCU102 BSP to get the complete device tree, based on zcu102-root-aarch64.dts for additions and deletions.

Prepare Image

Use Existing Image

It is recommended to directly use the pre-built/linux/images/Image from the ZCU102 BSP as the Linux kernel to boot on ZCU102, as its driver configuration is complete.

Compile Yourself

Through testing, it has been found that the support for ZYNQMP in the Linux source code before version 5.15 is not comprehensive, so it is not recommended to use versions before this for compilation. When compiling with later versions, you can follow the general compilation process as the basic support for ZYNQMP in the source code is enabled by default. Specific compilation operations are as follows:

  1. Visit the linux-xlnx official website to download the Linux source code, it is best to download zynqmp-soc-for-v6.3.
  2. tar -xvf zynqmp-soc-for-v6.3 to extract the source code.
  3. Enter the extracted directory, execute the following command using the default configuration, make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- defconfig
  4. Compile: make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- Image -j$(nproc)
  5. After compilation, the directory arch/arm64/boot/Image will contain the required image.

Enable QEMU Simulation

  1. Activate the Petalinux environment, i.e., in the Petalinux installation directory, source settings.sh.
  2. Enter the xilinx-zcu102-2024.1 folder, use the following command to start hvisor on the QEMU-simulated ZCU102, the file paths need to be modified according to your actual situation.
# QEMU parameter passing
petalinux-boot --qemu --prebuilt 2 --qemu-args '-device loader,file=hvisor,addr=0x40400000,force-raw=on -device loader,
file=zcu102-root-aarch64.dtb,addr=0x40000000,force-raw=on -device loader,file=zcu102-root-aarch64.dtb,addr=0x04000000,
force-raw=on -device loader,file=/home/hangqi-ren/Image,addr=0x00200000,force-raw=on -drive if=sd,format=raw,index=1,
file=rootfs.ext4' 
# Start hvisor
bootm 0x40400000 - 0x40000000

ZCU102 Board hvisor Multi-mode Boot

Booting Hvisor on ZCU102 Development Board in SD mode

Prepare SD Card

  1. Prepare a standard SD card, partition it into a Boot partition (FAT32) and the rest as file system partitions (EXT4). For partitioning in Windows, you can use DiskGenius, and for Linux, you can use fdisk or mkfs.
  2. Prepare a file system and copy its contents into any file system partition. You can refer to "NXPIMX8" for creating an Ubuntu file system or directly use the file system from the ZCU102 BSP.
  3. Copy zcu102-root-aarch64.dtb, Image, and hvisor to the Boot partition.
  4. In SD mode, it is necessary to provide ATF and Uboot from the SD card, therefore copy pre-built/linux/images/boot.scr and BOOT.BIN from the ZCU102 BSP to the BOOT partition.

Booting ZCU102

  1. Set the ZCU102 to SD mode, insert the SD card, connect the serial port, and power on.
  2. Press any key to interrupt the Uboot auto script execution and run the following commands to boot hvisor and root linux:
fatload mmc 0:1 0x40400000 hvisor;fatload mmc 0:1 0x40000000 zcu102-root-aarch64.dtb
fatload mmc 0:1 0x04000000 zcu102-root-aarch64.dtb;fatload mmc 0:1 0x00200000 Image;bootm 0x40400000 - 0x40000000
  1. If successfully booted, you will see hvisor and linux information on the serial port and eventually enter the file system.

Booting Hvisor on ZCU102 Development Board in Jtag mode

First, connect the two cables that come with the board to the JTAG and UART interfaces of the board, and the other end to the PC via USB.

Then, open a petalinux project in the command line, ensure the project has been compiled and has generated the corresponding boot files (vmlinux, BOOT.BIN, etc.), and then run from the project root directory:

petalinux-boot --jtag --prebuilt 2

Where prebuilt represents the boot level:

  • Level 1: Only download the FPGA bitstream, boot FSBL and PMUFW
  • Level 2: Download FPGA bitstream and boot UBOOT, and start FSBL, PMUFW, and TF-A (Trusted Firmware-A)
  • Level 3: Download and boot linux, and load or boot FPGA bitstream, FSBL, PMUFW, TF-A, UBOOT

Afterwards, JTAG will download the corresponding files to the board (save to the designated memory address) and boot the corresponding bootloader. For the default UBOOT script by the official, refer to the boot.scr file in the project image directory.

Since hvisor requires a separate UBOOT command and a custom-made fitImage to boot, please refer to UBOOT FIT Image Creation, Loading, and Booting.

After creating the fitImage, replace the files in the petalinux images generation directory (Image.ub), so that JTAG loads our custom-made fitImage to the default FIT image load address configured in the petalinux project. This way, when JTAG boots, our fitImage will be loaded through the JTAG line to the corresponding address in the board memory, then extracted and booted through the uboot command line.

Another UART cable can be used to observe the output from the ZCU102 board (including FSBL, UBOOT, linux, etc.), which can be viewed through serial port tools such as screen, gtkterm, termius, or minicom.

Please Note

Since petalinux has designated some fixed memory addresses, such as the default loading addresses for the linux kernel, fitImage, and DTB (configurable during petalinux project compilation), because we need to load and boot a custom-made fitImage, a problem currently identified is if the root linux dtb's load address in its matches the petalinux compilation load address, it will cause the dtb to be overwritten by the default petalinux dtb, leading to the root linux receiving an incorrect dtb and failing to boot. Therefore, it is necessary to specify a different address from the petalinux default dtb/fitImage load address during compilation to prevent other issues.

References

[1] PetaLinux Tools Documentation: Reference Guide (UG1144).https://docs.amd.com/r/2023.1-English/ug1144-petalinux-tools-reference-guide/Booting-a-PetaLinux-Image-on-Hardware-with-JTAG [2] Trusted Firmware-A Documentation.https://trustedfirmware-a.readthedocs.io/en/latest/

ZCU102 NonRoot Boot

  1. Use the Linux kernel source code used during the Root boot to compile hvisor-tool, and the detailed compilation process can be found in Readme.
  2. Prepare the virtio_cfg.json and zone1_linux.json needed to boot NonRoot. You can directly use the example/zcu102-aarch64 in the hvisor-tool directory, which has been verified to ensure it can boot.
  3. Prepare the Linux kernel Image, file system rootfs, and device tree linux1.dtb needed for NonRoot. The kernel and file system can be the same as Root, and Linux1.dtb can be configured as needed, or you can use the images/aarch64/devicetree/zcu102-nonroot-aarch64.dts in the hvisor directory.
  4. Copy hvisor.ko, hvisor, virtio_cfg, zone1_linux.json, linux1.dtb, Image, rootfs.ext4 to the file system used by Root Linux.
  5. Enter the following commands in RootLinux to start NonRoot:
# Load the kernel module
insmod hvisor.ko
# Create virtio device
nohup ./hvisor virtio start virtio_cfg.json &
# Start NonRoot based on the json configuration file
./hvisor zone start zone1_linux.json 
# View the output of NonRoot and interact.
screen /dev/pts/0

For more operation details, refer to hvisor-tool Readme

UBOOT FIT Image Creation, Loading, and Booting

wheatfox (wheatfox17@icloud.com)

This article introduces the basic knowledge related to FIT images, as well as how to create, load, and boot FIT images.

ITS Source File

ITS is the source code used by uboot to generate FIT images (FIT Image), which is Image Tree Source, using Device Tree Source (DTS) syntax format. FIT images can be generated using the mkimage tool provided by uboot. In the ZCU102 port of hvisor, a FIT image is used to package hvisor, root linux, root dtb, etc. into one fitImage, which facilitates booting on QEMU and actual hardware. The ITS file for the ZCU102 platform is located at scripts/zcu102-aarch64-fit.its:

/dts-v1/;
/ {
    description = "FIT image for HVISOR with Linux kernel, root filesystem, and DTB";
    images {
        root_linux {
            description = "Linux kernel";
            data = /incbin/("__ROOT_LINUX_IMAGE__");
            type = "kernel";
            arch = "arm64";
            os = "linux";
            ...
        };
        ...
        root_dtb {
            description = "Device Tree Blob";
            data = /incbin/("__ROOT_LINUX_DTB__");
            type = "flat_dt";
            ...
        };
        hvisor {
            description = "Hypervisor";
            data = /incbin/("__HVISOR_TMP_PATH__");
            type = "kernel";
            arch = "arm64";
            ...
        };
    };

    configurations {
        default = "config@1";
        config@1 {
            description = "default";
            kernel = "hvisor";
            fdt = "root_dtb";
        };
    };
};

Here, __ROOT_LINUX_IMAGE__, __ROOT_LINUX_DTB__, __HVISOR_TMP_PATH__ will be replaced with actual paths by the sed command in the Makefile. In the ITS source code, it is mainly divided into images and configurations sections. The images section defines the files to be packaged, and the configurations section defines how to combine these files. At UBOOT boot time, the files specified in the configurations will be automatically loaded to the designated address according to the default configuration, and multiple configurations can be set to support loading different configurations of images at boot time.

Makefile mkimage corresponding command:

.PHONY: gen-fit
gen-fit: $(hvisor_bin) dtb
	@if [ ! -f scripts/zcu102-aarch64-fit.its ]; then \
		echo "Error: ITS file scripts/zcu102-aarch64-fit.its not found."; \
		exit 1; \
	fi
	$(OBJCOPY) $(hvisor_elf) --strip-all -O binary $(HVISOR_TMP_PATH)
# now we need to create the vmlinux.bin
	$(GCC_OBJCOPY) $(ROOT_LINUX_IMAGE) --strip-all -O binary $(ROOT_LINUX_IMAGE_BIN)
	@sed \
		-e "s|__ROOT_LINUX_IMAGE__|$(ROOT_LINUX_IMAGE_BIN)|g" \
		-e "s|__ROOT_LINUX_ROOTFS__|$(ROOT_LINUX_ROOTFS)|g" \
		-e "s|__ROOT_LINUX_DTB__|$(ROOT_LINUX_DTB)|g" \
		-e "s|__HVISOR_TMP_PATH__|$(HVISOR_TMP_PATH)|g" \
		scripts/zcu102-aarch64-fit.its > temp-fit.its
	@mkimage -f temp-fit.its $(TARGET_FIT_IMAGE)
	@echo "Generated FIT image: $(TARGET_FIT_IMAGE)"

Booting hvisor and root linux through FIT image in petalinux qemu

Since a fitImage includes all the necessary files, for qemu, it only needs to load this file into an appropriate position in memory through the loader.

Then, when qemu starts and enters UBOOT, you can use the following command to boot (please modify the specific address according to the actual situation, when actually using, you can write all lines into one line and copy it to UBOOT for booting, or save it to the environment variable bootcmd, which requires UBOOT to mount a persistent flash for environment variable storage):

setenv fit_addr 0x10000000; setenv root_linux_load 0x200000;
imxtract ${fit_addr} root_linux ${root_linux_load}; bootm ${fit_addr};

References

[1] Flat Image Tree (FIT). https://docs.u-boot.org/en/stable/usage/fit/

How to Compile

Compile using Docker

1. Install Docker

sudo snap install docker

You can also refer to the Docker Official Documentation to install Docker.

2. Build the Image

make build_docker

This step builds a Docker image, automatically compiling all required dependencies.

3. Run the Container

make docker

This step starts a container, mounts the current directory into the container, and enters the container's shell.

4. Compile

Execute the following command in the container to compile.

make all

Compile using the local environment

1. Install RustUp and Cargo

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | \
    sh -s -- -y --no-modify-path --profile minimal --default-toolchain nightly

2. Install the Toolchain

The toolchain currently used by the project includes:

You can check if these tools are installed yourself, or use the following commands to install them:

(1) Install toml-cli and cargo-binutils

cargo install toml-cli cargo-binutils

(2) Install the cross-compilation toolchain for the target platform

rustup target add aarch64-unknown-none

(3) Parse rust-toolchain.toml to install the Rust toolchain

RUST_VERSION=$(toml get -r rust-toolchain.toml toolchain.channel) && \
Components=$(toml get -r rust-toolchain.toml toolchain.components | jq -r 'join(" ")') && \
rustup install $RUST_VERSION && \
rustup component add --toolchain $RUST_VERSION $Components

(4) Compile

make all

How to Start Root Linux

QEMU

Install Dependencies

1. Install Dependencies

apt-get install -y jq wget build-essential \
 libglib2.0-0 libfdt1 libpixman-1-0 zlib1g \
 libfdt-dev libpixman-1-dev libglib2.0-dev \
 zlib1g-dev ninja-build

1. Download and Extract QEMU

wget https://download.qemu.org/qemu-7.0.0.tar.xz
tar -xvf qemu-${QEMU_VERSION}.tar.xz

2. Conditionally Compile and Install QEMU

Here we only compile QEMU for emulating aarch64, if you need QEMU for other architectures, refer to QEMU Official Documentation.

cd qemu-7.0.0 && \
./configure --target-list=aarch64-softmmu,aarch64-linux-user && \
make -j$(nproc) && \
make install

3. Test if QEMU is Successfully Installed

qemu-system-aarch64 --version

Start Root Linux

1. Prepare Root File System and Kernel Image

Place the image file in hvisor/images/aarch64/kernel/, named Image.

Place the Root file system in hvisor/images/aarch64/virtdisk/, named rootfs1.ext4.

2. Start QEMU

Execute the following command in the hviosr directory:

make run

3. Enter QEMU

It will automatically load uboot, wait for uboot to finish loading, then enter bootm 0x40400000 - 0x40000000 to enter Root Linux.

How to Start NonRoot Linux

Hvisor has properly handled the startup of NonRoot, making it relatively simple, as follows:

  1. Prepare the kernel image, device tree, and file system for NonRoot Linux. Place the kernel and device tree in the file system of Root Linux.

  2. Specify the serial port used by this NonRoot Linux and the file system to be mounted in the device tree file for NonRoot Linux, as shown in the example below:

	chosen {
		bootargs = "clk_ignore_unused console=ttymxc3,115200 earlycon=ec_imx6q3,0x30a60000,115200 root=/dev/mmcblk3p2 rootwait rw";
		stdout-path = "/soc@0/bus@30800000/serial@30a60000";
	};
  1. Compile the kernel module and command line tools for Hvisor and place them in the file system of Root Linux.

  2. Start Hvisor's Root Linux and inject the kernel module that was just compiled:

insmod hvisor.ko
  1. Use the command line tool, here assumed to be named hvisor, to start NonRoot Linux.
./hvisor zone start --kernel kernel image,addr=0x70000000 --dtb device tree file,addr=0x91000000 --id virtual machine number (starting from 1)
  1. After NonRoot Linux has started, open the specified serial port to use it.

Configuration and Management of Zones

The hvisor project, as a lightweight hypervisor, uses a Type-1 architecture that allows multiple virtual machines (zones) to run directly on top of hardware. Below is a detailed explanation of the key points for zone configuration and management:

Resource Allocation

Resources such as CPU, memory, devices, and interrupts are statically allocated to each zone, meaning that once allocated, these resources are not dynamically scheduled between zones.

Root Zone Configuration

The configuration of the root zone is hardcoded within hvisor, written in Rust, and represented as a C-style structure HvZoneConfig. This structure contains key information such as zone ID, number of CPUs, memory regions, interrupt information, physical addresses and sizes of the kernel and device tree binary (DTB).

Non-root Zones Configuration

The configuration of non-root zones is stored in the root Linux file system, usually represented in JSON format. For example:

    {
        "arch": "arm64",
        "zone_id": 1,
        "cpus": [2, 3],
        "memory_regions": [
            {
                "type": "ram",
                "physical_start": "0x50000000",
                "virtual_start":  "0x50000000",
                "size": "0x30000000"
            },
            {
                "type": "io",
                "physical_start": "0x30a60000",
                "virtual_start":  "0x30a60000",
                "size": "0x1000"
            },
            {
                "type": "virtio",
                "physical_start": "0xa003c00",
                "virtual_start":  "0xa003c00",
                "size": "0x200"
            }
        ],
        "interrupts": [61, 75, 76, 78],
        "kernel_filepath": "./Image",
        "dtb_filepath": "./linux2.dtb",
        "kernel_load_paddr": "0x50400000",
        "dtb_load_paddr":   "0x50000000",
        "entry_point":      "0x50400000"
    }
  • The arch field specifies the target architecture (e.g., arm64).
  • cpus is a list that indicates the CPU core IDs allocated to the zone.
  • memory_regions describe different types of memory regions and their physical and virtual start addresses and sizes.
  • interrupts list the interrupt numbers allocated to the zone.
  • kernel_filepath and dtb_filepath indicate the paths of the kernel and device tree binary files, respectively.
  • kernel_load_paddr and dtb_load_paddr are the physical memory load addresses for the kernel and device tree binary.
  • entry_point specifies the kernel's entry point address.

The management tool of root Linux is responsible for reading the JSON configuration file and converting it into a C-style structure, which is then passed to hvisor to start the non-root zones.

Command Line Tool

The command line tool is an auxiliary management tool for hvisor, used to create and shut down other virtual machines on the managed virtual machine Root Linux, and is responsible for starting the Virtio daemon, providing Virtio device emulation. The repository is located at hvisor-tool. For specific usage, please see the README.

Using VirtIO Devices

For specific usage tutorials, please refer to: hvisor-tool-README

hvisor Overall Architecture

  • CPU Virtualization

    • Architecture Compatibility: Supports architectures such as aarch64, riscv64, and loongarch, with dedicated CPU virtualization components for each architecture.
    • CPU Allocation: Uses static allocation method, pre-determining the CPU resources for each virtual machine.
  • Memory Virtualization

    • Two-stage Page Table: Utilizes two-stage page table technology to optimize the memory virtualization process.
  • Interrupt Virtualization

    • Interrupt Controller Virtualization: Supports virtualization of different architecture's interrupt controllers like ARM GIC and RISC-V PLIC.
    • Interrupt Handling: Manages the transmission and processing flow of interrupt signals.
  • I/O Virtualization

    • IOMMU Integration: Supports IOMMU to enhance the efficiency and security of DMA virtualization.
    • VirtIO Standard: Follows the VirtIO specification, providing high-performance virtual devices.
    • PCI Virtualization: Implements PCI virtualization, ensuring virtual machines can access physical or virtual I/O devices.

Initialization Process of hvisor

Abstract: This article introduces the relevant knowledge involved in running hvisor on qemu and the initialization process of hvisor. Starting from the launch of qemu, the entire process is tracked, and after reading this article, you will have a general understanding of the initialization process of hvisor.

Boot Process of qemu

The boot process of the computer simulated by qemu: After loading the necessary files into memory, the PC register is initialized to 0x1000, and a few instructions are executed from here before jumping to 0x80000000 to start executing the bootloader (hvsior arm part uses Uboot). After executing a few instructions, it jumps to the starting address of the kernel that uboot can recognize.

Generate the executable file of hvisor

rust-objcopy --binary-architecture=aarch64 target/aarch64-unknown-none/debug/hvisor --strip-all -O binary target/aarch64-unknown-none/debug/hvisor.bin.tmp

Convert the executable file of hvisor into a logical binary and save it as hvisor.bin.tmp.

Generate an image file recognizable by uboot

Uboot is a bootloader whose main task is to jump to the first instruction of the hvisor image and start execution, so it is necessary to ensure that the generated hvisor image is recognizable by uboot. Here, the mkimage tool is needed.

mkimage -n hvisor_img -A arm64 -O linux -C none -T kernel -a 0x40400000 -e 0x40400000 -d target/aarch64-unknown-none/debug/hvisor.bin.tmp target/aarch64-unknown-none/debug/hvisor.bin
  • -n hvisor_img: Specify the name of the kernel image.
  • -A arm64: Specify the architecture as ARM64.
  • -O linux: Specify the operating system as Linux.
  • -C none: Do not use compression algorithms.
  • -T kernel: Specify the type as kernel.
  • -a 0x40400000: Specify the loading address as 0x40400000.
  • -e 0x40400000: Specify the entry address as 0x40400000.
  • -d target/aarch64-unknown-none/debug/hvisor.bin.tmp: Specify the input file as the previously generated temporary binary file.
  • The last parameter is the output file name, i.e., the final kernel image file hvisor.bin.

Initialization Process

To understand how hvisor is executed, we first look at the link script aarch64.ld, which gives us a general understanding of the execution process of hvisor.

ENTRY(arch_entry)
BASE_ADDRESS = 0x40400000;

The first line sets the program entry arch_entry, which can be found in arch/aarch64/entry.rs, introduced later.

.text : {
        *(.text.entry)
        *(.text .text.*)
    }

We make the .text segment the first segment, and place the .text.entry containing the first instruction of the entry at the beginning of the .text segment, ensuring that hvisor indeed starts execution from the 0x40400000 location agreed with qemu.

Here we also need to remember something called __core_end, which is the address of the end position of the link script, and its role can be known during the startup process.

arch_entry

With the above prerequisites, we can step into the first instruction of hvisor, which is arch_entry().

// src/arch/aarch64/entry.rs

pub unsafe extern "C" fn arch_entry() -> i32 {
    unsafe {
        core::arch::asm!(
            "
            // x0 = dtbaddr
            mov x1, x0
            mrs x0, mpidr_el1
            and x0, x0, #0xff
            ldr x2, =__core_end          // x2 = &__core_end
            mov x3, {per_cpu_size}      // x3 = per_cpu_size
            madd x4, x0, x3, x3       // x4 = cpuid * per_cpu_size + per_cpu_size
            add x5, x2, x4
            mov sp, x5           // sp = &__core_end + (cpuid + 1) * per_cpu_size
            b {rust_main}             // x0 = cpuid, x1 = dtbaddr
            ",
            options(noreturn),
            per_cpu_size=const PER_CPU_SIZE,
            rust_main = sym crate::rust_main,
        );
    }
}

First, look at the embedded assembly part. The first instruction mov x1, x0 transfers the value in the x0 register to the x1 register, where x0 contains the address of the device tree. Qemu simulates an ARM architecture computer, which also has various devices such as mice, display screens, and various storage devices. When we want to get input from the keyboard or output to the display, we need to get input from somewhere or put the output data somewhere. In the computer, we use specific addresses to access these devices. The device tree contains the access addresses of these devices, and the hypervisor, as the general manager of all software, naturally needs to know the information of the device tree. Therefore, Uboot will put this information in x0 before entering the kernel, which is a convention.

In mrs x0, mpidr_el1, mrs is an instruction to access system-level registers, which means to send the contents of the system register mpidr_el1 to x0. mpidr_el1 contains information about which CPU we are currently dealing with (the computer supports multi-core CPUs), and there will be a lot of cooperation work with the CPU later, so we need to know which CPU is currently in use. This register contains a lot of information about the CPU, and we currently need to use the lower 8 bits to extract the corresponding CPU id, which is what the instruction and x0, x0, #0xff is doing.

ldr x2, = __core_end, at the end of the link script, we set a symbol __core_end as the end address of the entire hvisor program space, and put this address into x2.

mov x3, {per_cpu_size} puts the size of each CPU's stack into x3. This {xxx} is to replace the value of xxx defined externally into the assembly code. You can see that the parameter below per_cpu_size=const PER_CPU_SIZE has changed the name of an external variable as a parameter. Another parameter with sym indicates that a symbol follows, which is defined elsewhere.

per_cpu_size in this size space, related registers can be saved and restored, including the CPU's stack space.

madd x4, x0, x3, x3 is a multiply-add instruction, cpu_id * per_cpu_size + per_cpu_size, and the result is put into x4. At this point, x4 contains the total space required by the current number of CPUs. (Starting from 0, so add per_cpu_size one more time).

add x5, x2, x4 means to add the end address of hvisor and the total space required by the CPU to x5.

mov sp, x5 finds the top of the current CPU's stack.

b {rust_main} represents jumping to rust_main to start execution, which also indicates that this section of assembly code will not return, corresponding to option(noreturn).

Enter rust_main()

fn rust_main(cpuid: usize, host_dtb: usize)

Entering rust_main requires two parameters, which are passed through x0 and x1. Remember that in the previous entry, our x0 stored the cpu_id and x1 stored the device tree information.

install_trap_vector()

When the processor encounters an exception or interrupt, it needs to jump to the corresponding location for processing. Here, the corresponding jump addresses are set (which can be considered as setting a table) for handling exceptions at the Hypervisor level. Each privilege level has its own corresponding exception vector table, except for EL0, the application privilege level, which must jump to other privilege levels to handle exceptions. The VBAR_ELn register is used to store the base address of the exception vector table for the ELn privilege level.

extern "C" {
    fn _hyp_trap_vector();
}

pub fn install_trap_vector() {
    // Set the trap vector.
    VBAR_EL2.set(_hyp_trap_vector as _)
}

VBAR_EL2.set() sets the address of _hyp_trap_vector() as the base address of the exception vector table for the EL2 privilege level.

_hyp_trap_vector() This assembly code constructs the exception vector table.

Simple Introduction to the Exception Vector Table Format

Based on the level of the exception and whether the level of handling the exception remains the same, it is divided into two categories. If the level does not change, it is divided into two groups based on whether the current level's SP is used. If the exception level changes, it is divided into two groups based on whether the execution mode is 64-bit/32-bit. Thus, the exception vector table is divided into 4 groups. In each group, each table entry represents an entry point for handling a specific type of exception.

Main CPU

static MASTER_CPU: AtomicI32 = AtomicI32::new(-1);

let mut is_primary = false;
    if MASTER_CPU.load(Ordering::Acquire) == -1 {
        MASTER_CPU.store(cpuid as i32, Ordering::Release);
        is_primary = true;
        println!("Hello, HVISOR!");
        #[cfg(target_arch = "riscv64")]
        clear_bss();
    }

static MASTER_CPU: AtomicI32 In this, AtomicI32 indicates that this is an atomic type, meaning its operations are either successful or fail without any intermediate state, ensuring safe access in a multi-threaded environment. In short, it is a very safe i32 type.

MASSTER_CPU.load() is a method for performing read operations. The parameter Ordering::Acquire indicates that if there are some write operations before I read, I need to wait for these write operations to be completed in order. In short, this parameter ensures that the data is correctly changed before being read.

If it reads -1, the same as when it was defined, it indicates that the main CPU has not been set, so set cpu_id as the main CPU. Similarly, the role of Ordering::Release is certainly to ensure that all other modifications are completed before the change.

Common Data Structure for CPUs: PerCpu

hvisor supports different architectures, and a reasonable system design should allow different architectures to use a unified interface for easy description of each part's work. PerCpu is such a general CPU description.

pub struct PerCpu {
    pub id: usize,
    pub cpu_on_entry: usize,
    pub arch_cpu: ArchCpu,
    pub zone: Option<Arc<RwLock<Zone>>>,
    pub ctrl_lock: Mutex<()>,
    pub boot_cpu: bool,
    // percpu stack
}

For each field of PerCpu:

  • id: CPU sequence number
  • cpu_on_entry: The address of the first instruction when the CPU enters EL1, also known as the guest. Only when this CPU is the boot CPU will it be set to a valid value. Initially, we set it to an inaccessible address.
  • arch_cpu: CPU description related to the architecture. The behavior is initiated by PerCpu, and the specific executor is arch_cpu.
    • cpu_id
    • psci_on: Whether the cpu is started
  • zone: zone actually represents a guestOS. For the same guestOS, multiple CPUs may serve it.
  • ctrl_lock: Set for concurrent safety.
  • boot_cpu: For a guestOS, it distinguishes the main/secondary cores serving it. boot_cpu indicates whether the current CPU is the main core for a guest.

Main Core Wakes Up Other Cores

if is_primary {
        wakeup_secondary_cpus(cpu.id, host_dtb);
}

fn wakeup_secondary_cpus(this_id: usize, host_dtb: usize) {
    for cpu_id in 0..MAX_CPU_NUM {
        if cpu_id == this_id {
            continue;
        }
        cpu_start(cpu_id, arch_entry as _, host_dtb);
    }
}

pub fn cpu_start(cpuid: usize, start_addr: usize, opaque: usize) {
    psci::cpu_on(cpuid as u64 | 0x80000000, start_addr as _, opaque as _).unwrap_or_else(|err| {
        if let psci::error::Error::AlreadyOn = err {
        } else {
            panic!("can't wake up cpu {}", cpuid);
        }
    });
}

If the current CPU is the main CPU, it will wake up other secondary cores, and the secondary cores execute cpu_start. In cpu_start, cpu_on actually calls the SMC instruction in call64, falling into EL3 to perform the action of waking up the CPU.

From the declaration of cpu_on, we can roughly guess its function: to wake up a CPU, which will start executing from the location arch_entry. This is because multi-core processors communicate and cooperate with each other, so CPU consistency must be ensured. Therefore, the same entry should be used to start execution to maintain synchronization. This can be verified by the following few lines of code.

    ENTERED_CPUS.fetch_add(1, Ordering::SeqCst);
    wait_for(|| PerCpu::entered_cpus() < MAX_CPU_NUM as _);
    assert_eq!(PerCpu::entered_cpus(), MAX_CPU_NUM as _);

Among them, ENTERED_CPUS.fetch_add(1, Ordering::SeqCst) represents increasing the value of ENTERED_CPUS in sequence consistency. After each CPU executes once, this assert_eq macro should pass smoothly.

Things the Main Core Still Needs to Do primary_init_early()

Initialize Logging

  1. Creation of a global log recorder
  2. Setting of the log level filter, the main purpose of setting the log level filter is to decide which log messages should be recorded and output.

Initialize Heap Space and Page Tables

  1. A space in the .bss segment is allocated as heap space, and the allocator is set up.
  2. Set up the page frame allocator.

Parse Device Tree Information

Parse the device tree information based on the device tree address in the rust_main parameter.

Create a GIC Instance

Instantiate a global static variable GIC, an instance of the Generic Interrupt Controller.

Initialize hvisor's Page Table

This page table is only for the implementation of converting VA to PA for hypervisor itself (understood in terms of the relationship between the kernel and applications).

Create a zone for each VM

zone_create(zone_id, TENANTS[zone_id] as _, DTB_IPA);

zone_create(vmid: usize, dtb_ptr: *const u8, dtb_ipa: usize) -> Arc<RwLock<Zone>>

zone actually represents a guestVM, containing various information that a guestVM might use. Observing the function parameters, dtb_ptr is the address of the device information that the hypervisor wants this guest to see, which can be seen in images/aarch64/devicetree. The role of dtb_ipa is that each guest will obtain this address from the CPU's x0 register to find the device tree information, so it is necessary to ensure that this IPA will map to the guest's dtb address during the construction of the stage2 page table. In this way, the guest is informed about the type of machine it is running on, the starting address of the physical memory, the number of CPUs, etc.

let guest_fdt = unsafe { fdt::Fdt::from_ptr(dtb_ptr) }.unwrap();
    let guest_entry = guest_fdt
        .memory()
        .regions()
        .next()
        .unwrap()
        .starting_address as usize;

The above content, by parsing the device tree information, obtained guest_entry, which is the starting address of the physical address that this guest can see. In the qemu startup parameters, we can also see where a guest image is loaded into memory, and these two values are equal.

Next, the stage-2 page table, MMIO mapping, and IRQ bitmap for this guest will be constructed based on the dtb information.

guest_fdt.cpus().for_each(|cpu| {
        let cpu_id = cpu.ids().all().next().unwrap();
        zone.cpu_set.set_bit(cpu_id as usize);
});

pub fn set_bit(&mut self, id: usize) {
    assert!(id <= self.max_cpu_id);
    self.bitmap |= 1 << id;
}

The above code records the id of the CPU allocated to this zone in the bitmap according to the CPU information given in the dtb.

let new_zone_pointer = Arc::new(RwLock::new(zone));
    {
        cpu_set.iter().for_each(|cpuid| {
            let cpu_data = get_cpu_data(cpuid);
            cpu_data.zone = Some(new_zone_pointer.clone());
            //chose boot cpu
            if cpuid == cpu_set.first_cpu().unwrap() {
                cpu_data.boot_cpu = true;
            }
            cpu_data.cpu_on_entry = guest_entry;
        });
    }
  

The task completed by the above code is: Traverse the CPUs allocated to this zone, obtain the mutable reference of the PerCpu of that CPU, modify its zone member variable, and mark the first CPU allocated to this zone as boot_cpu. Also, set the address of the first instruction after this zone's main CPU enters the guest as guest_entry.

The tasks that the main core CPU needs to do are paused, marked with INIT_EARLY_OK.store(1, Ordering::Release), while other CPUs can only wait before the main core completes wait_for_counter(&INIT_EARLY_OK, 1).

Address Space Initialization

The previous section mentioned IPA and PA, which are actually part of the address space. Specific content will be provided in the memory management document, and here is a brief introduction.

If Hypervisor is not considered, guestVM, as a kernel, will perform memory management work, which is the process from the application program's virtual address VA to the kernel's PA. In this case, the PA is the actual physical memory address.

When considering Hypervisor, Hypervisor, as a kernel role, will also perform memory management work, only this time the application program becomes guestVM, and guestVM will not be aware of the existence of Hypervisor (otherwise, it would require changing the design of guestVM, which does not meet our intention to improve performance). We call the PA in guestVM IPA or GPA because it

PerCPU Structure

In the architecture of hvisor, the PerCpu structure plays a core role, used to implement local state management for each CPU core and support CPU virtualization. Below is a detailed introduction to the PerCpu structure and related functions:

PerCpu Structure Definition

The PerCpu structure is designed as a container for each CPU core to store its specific data and state. Its layout is as follows:

#[repr(C)]
pub struct PerCpu {
    pub id: usize,
    pub cpu_on_entry: usize,
    pub dtb_ipa: usize,
    pub arch_cpu: ArchCpu,
    pub zone: Option<Arc<RwLock<Zone>>>,
    pub ctrl_lock: Mutex<()>,
    pub boot_cpu: bool,
    // percpu stack
}

The field definitions are as follows:

    id: Identifier of the CPU core.
    cpu_on_entry: An address used to track the CPU's entry state, initialized to INVALID_ADDRESS, indicating an invalid address.
    dtb_ipa: Physical address of the device tree binary, also initialized to INVALID_ADDRESS.
    arch_cpu: A reference to the ArchCpu type, which contains architecture-specific CPU information and functions.
    zone: An optional Arc<RwLock<Zone>> type, representing the virtual machine (zone) currently running on the CPU core.
    ctrl_lock: A mutex used to control access and synchronize PerCpu data.
    boot_cpu: A boolean value indicating whether it is the boot CPU.

Construction and Operation of PerCpu

    PerCpu::new: This function creates and initializes the PerCpu structure. It first calculates the virtual address of the structure, then safely writes the initialization data. For the RISC-V architecture, it also updates the CSR_SSCRATCH register to store the pointer to ArchCpu.
    run_vm: When this method is called, if the current CPU is not the boot CPU, it will first be put into an idle state, then run the virtual machine.
    entered_cpus: Returns the number of CPU cores that have entered the virtual machine running state.
    activate_gpm: Activates the GPM (Guest Page Management) of the associated zone.

Obtaining PerCpu Instances

    get_cpu_data: Provides a method to obtain a PerCpu instance based on CPU ID.
    this_cpu_data: Returns the PerCpu instance of the currently executing CPU.

CPU Virtualization on AArch64

CPU Boot Mechanism

Under the AArch64 architecture, hvisor uses the psci::cpu_on() function to wake up a specified CPU core, bringing it from a shutdown state to a running state. This function takes the CPU's ID, boot address, and an opaque parameter as input. If an error occurs, such as the CPU already being awake, the function will handle the error appropriately to avoid redundant wake-ups.

CPU Virtualization Initialization and Operation

The ArchCpu structure encapsulates architecture-specific CPU information and functionalities, and its reset() method is responsible for setting the CPU to the initial state of virtualization mode. This includes:

  • Setting the ELR_EL2 register to the specified entry point
  • Configuring the SPSR_EL2 register
  • Clearing the general registers
  • Resetting the virtual machine-related registers
  • activate_vmm(), activating the Virtual Memory Manager (VMM)

The activate_vmm() method is used to configure the VTCR_EL2 and HCR_EL2 registers, enabling the virtualization environment.

The ArchCpu's run() and idle() methods are used to start and idle the CPU, respectively. Upon starting, it activates the zone's GPM (Guest Page Management), resets to the specified entry point and device tree binary (DTB) address, and then jumps to the EL2 entry point through the vmreturn macro. In idle mode, the CPU is reset to a waiting state (WFI) and prepares a parking instruction page for use during idle periods.

Switching Between EL1 and EL2

hvisor uses EL2 as the hypervisor mode and EL1 for the guest OS in the AArch64 architecture. The handle_vmexit macro handles the context switch from EL1 to EL2 (VMEXIT event), saves the user mode register context, calls an external function to handle the exit reason, and then returns to continue executing the hypervisor code segment. The vmreturn function is used to return from EL2 mode to EL1 mode (VMENTRY event), restores the user mode register context, and then returns to the guest OS code segment through the eret instruction.

MMU Configuration and Enabling

To support virtualization, the enable_mmu() function configures MMU mapping in EL2 mode, including setting the MAIR_EL2, TCR_EL2, and SCTLR_EL2 registers, enabling instruction and data caching capabilities, and ensuring the virtual range covers the entire 48-bit address space.

Through these mechanisms, hvisor achieves efficient CPU virtualization on the AArch64 architecture, allowing multiple independent zones to operate under statically allocated resources while maintaining system stability and performance.

CPU Virtualization under RISCV

Abstract: Introduce the CPU virtualization work under the RISCV architecture around the ArchCpu structure.

Two Data Structures Involved

Hvisor supports multiple architectures, and the work required for CPU virtualization in each architecture is different, but a unified interface should be provided in a system. Therefore, we split the CPU into two data structures: PerCpu and ArchCpu.

PerCpu

This is a general description of the CPU, which has already been introduced in the PerCpu documentation.

ArchCpu

ArchCpu is a CPU structure for specific architectures (RISCV architecture is introduced in this article). This structure undertakes the specific behavior of the CPU.

In the ARM architecture, there is also a corresponding ArchCpu, which has a slightly different structure from the ArchCpu introduced in this section, but they have the same interface (i.e., they both have behaviors such as initialization).

The fields included are as follows:

pub struct ArchCpu {
    pub x: [usize; 32], //x0~x31
    pub hstatus: usize,
    pub sstatus: usize,
    pub sepc: usize,
    pub stack_top: usize,
    pub cpuid: usize,
    pub power_on: bool,
    pub init: bool,
    pub sstc: bool,
}

The explanation of each field is as follows:

  • x: values of general-purpose registers
  • hstatus: stores the value of the Hypervisor status register
  • sstatus: stores the value of the Supervisor status register, managing S-mode state information, such as interrupt enable flags, etc.
  • sepc: the return address at the end of exception handling
  • stack_top: the stack top of the corresponding CPU stack
  • power_on: whether the CPU is powered on
  • init: whether the CPU has been initialized
  • sstc: whether the timer interrupt has been configured

This part explains the methods involved.

ArchCpu::init

This method mainly initializes the CPU, sets the context when first entering the VM, and some CSR initialization.

ArchCpu::idle

By executing the wfi instruction, set non-primary CPUs to a low-power idle state.

Set up a special memory page containing instructions that make the CPU enter a low-power waiting state, allowing them to be placed in a low-power waiting state when no tasks are allocated to some CPUs in the system until an interrupt occurs.

ArchCpu::run

The main content of this method is some initialization, setting the correct CPU execution entry, and modifying the flag that the CPU has been initialized.

vcpu_arch_entry / VM_ENTRY

This is a piece of assembly code describing the work that needs to be handled when entering the VM from hvisor. First, it gets the context information in the original ArchCpu through the sscratch register, then sets hstatus, sstatus, and sepc to the values we previously saved, ensuring that when returning to the VM, it is in VS mode and starts executing from the correct position. Finally, restore the values of the general-purpose registers and return to the VM using sret.

VM_EXIT

When exiting the VM and entering hvisor, it is also necessary to save the relevant state at the time of VM exit.

First, get the address of ArchCpu through the sscratch register, but here we will swap the information of sscratch and x31, rather than directly overwriting x31. Then save the values of the general-purpose registers except x31. Now the information of x31 is in sscratch, so first save the value of x31 to sp, then swap x31 and sscratch, and store the information of x31 through sp to the corresponding position in ArchCpu.

Then save hstatus, sstatus, and sepc. When we finish the work in hvisor and need to return to the VM, we need to use the VM_ENTRY code to restore the values of these three registers to the state before the VM entered hvisor, so we should save them here.

ld sp, 35*8(sp) puts the top of the kernel stack saved by ArchCpu into sp for use, facilitating the use of the kernel stack in hvisor.

csrr a0, sscratch puts the value of sscratch into the a0 register. When we have saved the context and jump to the exception handling function, the parameters will be passed through a0, allowing access to the saved context during exception handling, such as the exit code, etc.

LoongArch Processor Virtualization

The LoongArch instruction set is an independent RISC instruction set released by China's Loongson Zhongke Company in 2020, which includes five modules: the basic instruction set, binary translation extension (LBT), vector extension (LSX), advanced vector extension (LASX), and virtualization extension (LVZ).

This article will provide a brief introduction to the CPU virtualization design of the LoongArch instruction set, with related explanations from the currently publicly available KVM source code and code comments.

Introduction to LoongArch Registers

Conventions for General Registers Usage

NameAliasUsagePreserved across calls
$r0$zeroConstant 0(constant)
$r1$raReturn addressNo
$r2$tpThread pointer(not allocatable)
$r3$spStack pointerYes
$r4 - $r5$a0 - $a1Argument/return value registersNo
$r6 - $r11$a2 - $a7Argument registersNo
$r12 - $r20$t0 - $t8Temporary registersNo
$r21Reserved(not allocatable)
$r22$fp / $s9Frame pointer / static registerYes
$r23 - $r31$s0 - $s8Static registersYes

Conventions for Floating Point Registers Usage

NameAliasUsagePreserved across calls
$f0 - $f1$fa0 - $fa1Argument/return value registersNo
$f2 - $f7$fa2 - $fa7Argument registersNo
$f8 - $f23$ft0 - $ft15Temporary registersNo
$f24 - $f31$fs0 - $fs7Static registersYes

Temporary registers are also known as caller-saved registers. Static registers are also known as callee-saved registers.

CSR Registers

Control and Status Register (CSR) is a special class of registers in the LoongArch architecture used to control the processor's operational state. List of CSR registers (excluding new CSRs in the LVZ virtualization extension):

NumberNameNumberNameNumberName
0x0Current mode information CRMD0x1Exception prior mode information PRMD0x2Extension part enable EUEN
0x3Miscellaneous control MISC0x4Exception configuration ECFG0x5Exception status ESTAT
0x6Exception return address ERA0x7Error virtual address BADV0x8Error instruction BADI
0xcException entry address EENTRY0x10TLB index TLBIDX0x11TLB entry high TLBEHI
0x12TLB entry low 0 TLBELO00x13TLB entry low 1 TLBELO10x18Address space identifier ASID
0x19Low half address space global directory base PGDL0x1AHigh half address space global directory base PGDH0x1BGlobal directory base PGD
0x1CPage table traversal control low half PWCL0x1DPage table traversal control high half PWCH0x1ESTLB page size STLBPS
0x1FReduced virtual address configuration RVACFG0x20Processor number CPUID0x21Privilege resource configuration info 1 PRCFG1
0x22Privilege resource configuration info 2 PRCFG20x23Privilege resource configuration info 3 PRCFG30x30+n (0≤n≤15)Data save SAVEn
0x40Timer number TID0x41Timer configuration TCFG0x42Timer value TVAL
0x43Timer compensation CNTC0x44Timer interrupt clear TICLR0x60LLBit control LLBCTL
0x80Implementation related control 1 IMPCTL10x81Implementation related control 2 IMPCTL20x88TLB refill exception entry address TLBRENTRY
0x89TLB refill exception error virtual address TLBRBADV0x8ATLB refill exception return address TLBRERA0x8BTLB refill exception data save TLBRSAVE
0x8CTLB refill exception entry low 0 TLBRELO00x8DTLB refill exception entry low 1 TLBRELO10x8ETLB refill exception entry high TLBREHI
0x8FTLB refill exception prior mode information TLBRPRMD0x90Machine error control MERRCTL0x91Machine error information 1 MERRINFO1
0x92Machine error information 2 MERRINFO20x93Machine error exception entry address MERRENTRY0x94Machine error exception return address MERRERA
0x95Machine error exception data save MERRSAVE0x98Cache tag CTAG0x180+n (0≤n≤3)Direct mapping configuration window n DMWn
0x200+2n (0≤n≤31)Performance monitoring configuration n PMCFGn0x201+2n (0≤n≤31)Performance monitoring counter n PMCNTn0x300Load/store monitor point overall control MWPC
0x301Load/store monitor point overall status MWPS0x310+8n (0≤n≤7)Load/store monitor point n configuration 1 MWPnCFG10x311+8n (0≤n≤7)Load/store monitor point n configuration 2 MWPnCFG2
0x312+8n (0≤n≤7)Load/store monitor point n configuration 3 MWPnCFG30x313+8n (0≤n≤7)Load/store monitor point n configuration 4 MWPnCFG40x380Instruction fetch monitor point overall control FWPC
0x381Instruction fetch monitor point overall status FWPS0x390+8n (0≤n≤7)Instruction fetch monitor point n configuration 1 FWPnCFG10x391+8n (0≤n≤7)Instruction fetch monitor point n configuration 2 FWPnCFG2
0x392+8n (0≤n≤7)Instruction fetch monitor point n configuration 3 FWPnCFG30x393+8n (0≤n≤7)Instruction fetch monitor point n configuration 4 FWPnCFG40x500Debug register DBG
0x501Debug exception return address DERA0x502Debug data save DSAVE

For processors that have implemented the LVZ virtualization extension, there is an additional set of CSR registers for controlling virtualization.

NumberName
0x15Guest TLB control GTLBC
0x16TLBRD read Guest item TRGP
0x50Guest status GSTAT
0x51Guest control GCTL
0x52Guest interrupt control GINTC
0x53Guest counter compensation GCNTC

GCSR Register Group

In LoongArch processors that implement virtualization, there is an additional group of GCSR (Guest Control and Status Register) registers.

Process of Entering Guest Mode (from Linux KVM source code)

  1. switch_to_guest:
  2. Clear the CSR.ECFG.VS field (set to 0, i.e., all exceptions share one entry address)
  3. Read the guest eentry saved in the Hypervisor (guest OS interrupt vector address) -> GEENTRY
    1. Then write GEENTRY to CSR.EENTRY
  4. Read the guest era saved in the Hypervisor (guest OS exception return address) -> GPC
    1. Then write GPC to CSR.ERA
  5. Read the CSR.PGDL global page table address, save it in the Hypervisor
  6. Load the guest pgdl from the Hypervisor into CSR.PGDL
  7. Read CSR.GSTAT.GID and CSR.GTLBC.TGID, write to CSR.GTLBC
  8. Set CSR.PRMD.PIE to 1, turn on Hypervisor-level global interrupts
  9. Set CSR.GSTAT.PGM to 1, the purpose is to make the ertn instruction enter guest mode
  10. The Hypervisor restores the guest's general registers (GPRS) saved by itself to the hardware registers (restore the scene)
  11. Execute the ertn instruction, enter guest mode
codesubcodeAbbreviationIntroduction
22-GSPRGuest-sensitive privileged resource exception, triggered by cpucfg, idle, cacop instructions, and when the virtual machine accesses non-existent GCSR and IOCSR, forcing a trap into the Hypervisor for processing (such as software simulation)
23-HVCException triggered by the hvcl supercall instruction
240GCMGuest GCSR software modification exception
241GCHCGuest GCSR hardware modification exception

Process of Handling Exceptions Under Guest Mode (from Linux KVM source code)

  1. kvm_exc_entry:

  2. The Hypervisor first saves the guest's general registers (GPRS), protecting the scene.

  3. The Hypervisor saves CSR.ESTAT -> host ESTAT

  4. The Hypervisor saves CSR.ERA -> GPC

  5. The Hypervisor saves CSR.BADV -> host BADV, i.e., when an address error exception is triggered, the erroneous virtual address is recorded

  6. The Hypervisor saves CSR.BADI -> host BADI, this register is used to record the instruction code of the instruction that triggered the synchronous class exception, synchronous class exceptions refer to all exceptions except for interrupts (INT), guest CSR hardware modification exceptions (GCHC), and machine error exceptions (MERR).

  7. Read the host ECFG saved by the Hypervisor, write to CSR.ECFG (i.e., switch to the host's exception configuration)

  8. Read the host EENTRY saved by the Hypervisor, write to CSR.EENTRY

  9. Read the host PGD saved by the Hypervisor, write to CSR.PGDL (restore the host page table global directory base, low half space)

  10. Set CSR.GSTAT.PGM off

  11. Clear the GTLBC.TGID field

  12. Restore kvm per CPU registers

    1. The kvm assembly involves KVM_ARCH_HTP, KVM_ARCH_HSP, KVM_ARCH_HPERCPU
  13. Jump to KVM_ARCH_HANDLE_EXIT to handle the exception

  14. Determine if the return of the function just now is <=0

    1. If <=0, continue running the host
    2. Otherwise, continue running the guest, save percpu registers, as it may switch to a different CPU to continue running the guest. Save host percpu registers to CSR.KSAVE register
  15. Jump to switch_to_guest

vCPU Context Registers to be Saved

According to the LoongArch function call specification, if you need to manually switch the CPU function running context, the registers to be saved are as follows (excluding floating point registers): $s0-$s9, $sp, $ra

References

[1] Loongson Zhongke Technology Co., Ltd. Loongson Architecture ELF psABI Specification. Version 2.01.

[2] Loongson Zhongke Technology Co., Ltd. Loongson Architecture Reference Manual. Volume One: Basic Architecture.

[3] https://github.com/torvalds/linux/blob/master/arch/loongarch/kvm/switch.S.

Memory Management

Memory Allocation on Heap

Allocator Initialization

When using programming languages, dynamic memory allocation is often encountered, such as allocating a block of memory through malloc or new in C, or Vec, String, etc. in Rust, which are allocated on the heap.

To allocate memory on the heap, we need to do the following:

  • Provide a large block of memory space during initialization
  • Provide interfaces for allocation and release
  • Manage free blocks

In short, we need to allocate a large space and set up an allocator to manage this space, and tell Rust that we now have an allocator, asking it to use it, allowing us to use variables like Vec, String that allocate memory on the heap. This is what the following lines do.

use buddy_system_allocator::LockedHeap;

use crate::consts::HV_HEAP_SIZE;

#[cfg_attr(not(test), global_allocator)]
static HEAP_ALLOCATOR: LockedHeap<32> = LockedHeap::<32>::new();

/// Initialize the global heap allocator.
pub fn init() {
    const MACHINE_ALIGN: usize = core::mem::size_of::<usize>();
    const HEAP_BLOCK: usize = HV_HEAP_SIZE / MACHINE_ALIGN;
    static mut HEAP: [usize; HEAP_BLOCK] = [0; HEAP_BLOCK];
    let heap_start = unsafe { HEAP.as_ptr() as usize };
    unsafe {
        HEAP_ALLOCATOR
            .lock()
            .init(heap_start, HEAP_BLOCK * MACHINE_ALIGN);
    }
    info!(
        "Heap allocator initialization finished: {:#x?}",
        heap_start..heap_start + HV_HEAP_SIZE
    );
}

#[cfg_attr(not(test), global_allocator)] is a conditional compilation attribute, which sets the HEAP_ALLOCATOR defined in the next line as Rust's global memory allocator when not in a test environment. Now Rust knows we can do dynamic allocation.

HEAP_ALLOCATOR.lock().init(heap_start, HEAP_BLOCK * MACHINE_ALIGN) hands over the large space we applied for to the allocator for management.

Testing

pub fn test() {
    use alloc::boxed::Box;
    use alloc::vec::Vec;
    extern "C" {
        fn sbss();
        fn ebss();
    }
    let bss_range = sbss as usize..ebss as usize;
    let a = Box::new(5);
    assert_eq!(*a, 5);
    assert!(bss_range.contains(&(a.as_ref() as *const _ as usize)));
    drop(a);
    let mut v: Vec<usize> = Vec::new();
    for i in 0..500 {
        v.push(i);
    }
    for (i, val) in v.iter().take(500).enumerate() {
        assert_eq!(*val, i);
    }
    assert!(bss_range.contains(&(v.as_ptr() as usize)));
    drop(v);
    info!("heap_test passed!");
}

In this test, Box and Vec are used to verify the memory we allocated, whether it is in the bss segment.

The large block of memory we just handed over to the allocator is an uninitialized global variable, which will be placed in the bss segment. We only need to test whether the address of the variable we obtained is within this range.

Memory Management Knowledge for Armv8

Addressing

The address bus is by default 48 bits, while the addressing request issued is 64 bits, so the virtual address can be divided into two spaces based on the high 16 bits:

  • High 16 bits are 1: Kernel space
  • High 16 bits are 0: User space

From the perspective of guestVM, when converting a virtual address to a physical address, the CPU will select the TTBR register based on the value of the 63rd bit of the virtual address. TTBR registers store the base address of the first-level page table. If it is user space, select TTBR0; if it is kernel space, select TTBR1.

Four-Level Page Table Mapping (using a page size of 4K as an example)

In addition to the high 16 bits used to determine which page table base address register to use, the following 36 bits are used as indexes for each level of the page table, with the lower 12 bits as the page offset, as shown in the diagram below.

Level4_PageTable

Stage-2 Page Table Mechanism

In a virtualized environment, there are two types of address mapping processes in the system:

  • guestVM uses Stage-1 address conversion, using TTBR0_EL1 or TTBR1_EL1, to convert the accessed VA to IPA, then through Stage-2 address conversion, using VTTBR0_EL2 to convert IPA to PA.
  • Hypervisor may run its own applications, and the VA to PA conversion for these applications only requires one conversion, using the TTBR0_EL2 register.

Nested_Address_Translation

Memory Management in hvsior

Management of Physical Page Frames

Similar to the heap construction mentioned above, page frame allocation also requires an allocator, then we hand over a block of memory we use for allocation to the allocator for management.

Bitmap-based Allocator

use bitmap_allocator::BitAlloc;
type FrameAlloc = bitmap_allocator::BitAlloc1M;

struct FrameAllocator {
    base: PhysAddr,
    inner: FrameAlloc,
}

BitAlloc1M is a bitmap-based allocator, which manages page numbers, providing information on which pages are free and which are occupied.

Then, the bitmap allocator and the starting address used for page frame allocation are encapsulated into a page frame allocator.

So we see the initialization function as follows:

fn init(&mut self, base: PhysAddr, size: usize) {
        self.base = align_up(base);
        let page_count = align_up(size) / PAGE_SIZE;
        self.inner.insert(0..page_count);
    }

The starting address of the page frame allocation area and the size of the available space are passed in, the number of page frames available for allocation in this space page_size is calculated, and then all page frame numbers are told to the bitmap allocator through the insert function.

Structure of Page Frames

pub struct Frame {
    start_paddr: PhysAddr,
    frame_count: usize,
}

The structure of the page frame includes the starting address of this page frame and the number of page frames corresponding to this frame instance, which may be 0, 1, or more than 1.

Why are there cases where the number of page frames is 0?

When hvisor wants to access the contents of the page frame through Frame, it needs a temporary instance that does not involve page frame allocation and page frame recycling, so 0 is used as a flag.

Why are there cases where the number of page frames is greater than 1?

In some cases, we are required to allocate continuous memory, and the size exceeds one page, which means allocating multiple continuous page frames.

Allocation alloc

Now we know that the page frame allocator can allocate a number of a free page frame, turning the number into a Frame instance completes the allocation of the page frame, as shown below for a single page frame allocation:

impl FrameAllocator {
    fn init(&mut self, base: PhysAddr, size: usize) {
        self.base = align_up(base);
        let page_count = align_up(size) / PAGE_SIZE;
        self.inner.insert(0..page_count);
    }
}

impl Frame {
    /// Allocate one physical frame.
    pub fn new() -> HvResult<Self> {
        unsafe {
            FRAME_ALLOCATOR
                .lock()
                .alloc()
                .map(|start_paddr| Self {
                    start_paddr,
                    frame_count: 1,
                })
                .ok_or(hv_err!(ENOMEM))
        }
    }
}

As you can see, the frame allocator helps us allocate a page frame and returns the starting physical address, then creates a Frame instance.

Recycling of Page Frames

The Frame structure is linked to the actual physical page, following the RAII design standard, so when a Frame leaves the scope, the corresponding memory area also needs to be returned to hvisor. This requires us to implement the drop method in the Drop Trait, as follows:

impl Drop for Frame {
    fn drop(&mut self) {
        unsafe {
            match self.frame_count {
                0 => {} // Do not deallocate when use Frame::from_paddr()
                1 => FRAME_ALLOCATOR.lock().dealloc(self.start_paddr),
                _ => FRAME_ALLOCATOR
                    .lock()
                    .dealloc_contiguous(self.start_paddr, self.frame_count),
            }
        }
    }
}

impl FrameAllocator{
    unsafe fn dealloc(&mut self, target: PhysAddr) {
        trace!("Deallocate frame: {:x}", target);
        self.inner.dealloc((target - self.base) / PAGE_SIZE)
    }
}

In drop, you can see that page frames with a frame count of 0 do not need to release the corresponding physical pages, and those with a frame count greater than 1 indicate that they are continuously allocated page frames, requiring the recycling of more than one physical page.

With the knowledge of Armv8 memory management mentioned above, we know that the process of building page tables is divided into two parts: the page table used by hvisor itself and the Stage-2 conversion page table. We will focus on the Stage-2 page table.

Before that, we need to understand a few data structures that will be used.

Logical Segment MemoryRegion

Description of the logical segment, including starting address, size, permission flags, and mapping method.

pub struct MemoryRegion<VA> {
    pub start: VA,
    pub size: usize,
    pub flags: MemFlags,
    pub mapper: Mapper,
}

Address Space MemorySet

Description of each process's address space, including a collection of logical segments and the corresponding page table for the process.

pub struct MemorySet<PT: GenericPageTable>
where
    PT::VA: Ord,
{
    regions: BTreeMap<PT::VA, MemoryRegion<PT::VA>>,
    pt: PT,
}

4-Level Page Table Level4PageTableImmut

root is the page frame where the L0 page table is located.

pub struct Level4PageTableImmut<VA, PTE: GenericPTE> {
    /// Root table frame.
    root: Frame,
    /// Phantom data.
    _phantom: PhantomData<(VA, PTE)>,
}

Building the Stage-2 Page Table

We need to build a Stage-2 page table for each zone.

Areas to be Mapped by the Stage-2 Page Table:

  • The memory area seen by guestVM
  • The IPA of the device tree accessed by guestVM
  • The memory area of the UART device seen by guestVM

Adding Mapping Relationships to the Address Space

/// Add a memory region to this set.
    pub fn insert(&mut self, region: MemoryRegion<PT::VA>) -> HvResult {
        assert!(is_aligned(region.start.into()));
        assert!(is_aligned(region.size));
        if region.size == 0 {
            return Ok(());
        }
        if !self.test_free_area(&region) {
            warn!(
                "MemoryRegion overlapped in MemorySet: {:#x?}\n{:#x?}",
                region, self
            );
            return hv_result_err!(EINVAL);
        }
        self.pt.map(&region)?;
        self.regions.insert(region.start, region);
        Ok(())
    }

In addition to adding the mapping relationship between the virtual address and the logical segment to our Map structure, we also need to perform mapping in the page table, as follows:

fn map(&mut self, region: &MemoryRegion<VA>) -> HvResult {
        assert!(
            is_aligned(region.start.into()),
            "region.start = {:#x?}",
            region.start.into()
        );
        assert!(is_aligned(region.size), "region.size = {:#x?}", region.size);
        trace!(
            "create mapping in {}: {:#x?}",
            core::any::type_name::<Self>(),
            region
        );
        let _lock = self.clonee_lock.lock();
        let mut vaddr = region.start.into();
        let mut size = region.size;
        while size > 0 {
            let paddr = region.mapper.map_fn(vaddr);
            let page_size = if PageSize::Size1G.is_aligned(vaddr)
                && PageSize::Size1G.is_aligned(paddr)
                && size >= PageSize::Size1G as usize
                && !region.flags.contains(MemFlags::NO_HUGEPAGES)
            {
                PageSize::Size1G
            } else if PageSize::Size2M.is_aligned(vaddr)
                && PageSize::Size2M.is_aligned(paddr)
                and size >= PageSize::Size2M as usize
                && !region.flags.contains(MemFlags::NO_HUGEPAGES)
            {
                PageSize::Size2M
            } else {
                PageSize::Size4K
            };
            let page = Page::new_aligned(vaddr.into(), page_size);
            self.inner
                .map_page(page, paddr, region.flags)
                .map_err(|e: PagingError| {
                    error!(
                        "failed to map page: {:#x?}({:?}) -> {:#x?}, {:?}",
                        vaddr, page_size, paddr, e
                    );
                    e
                })?;
            vaddr += page_size as usize;
            size -= page_size as usize;
        }
        Ok(())
    }

Let's briefly interpret this function. For a logical segment MemoryRegion, we map it page by page until the entire logical segment size is covered.

The specific behavior is as follows:

Before mapping each page, we first determine the physical address paddr corresponding to this page according to the mapping method of the logical segment.

Then determine the page size page_size. We start by checking the 1G page size. If the physical address can be aligned, the remaining unmapped page size is greater than 1G, and large page mapping is not disabled, then 1G is chosen as the page size. Otherwise, check the 2M page size, and if none of these conditions are met, use the standard 4KB page size.

We now have the information needed to fill in the page table entry. We combine the page starting address and page size into a Page instance and perform mapping in the page table, which is modifying the page table entry:

fn map_page(
        &mut self,
        page: Page<VA>,
        paddr: PhysAddr,
        flags: MemFlags,
    ) -> PagingResult<&mut PTE> {
        let entry: &mut PTE = self.get_entry_mut_or_create(page)?;
        if !entry.is_unused() {
            return Err(PagingError::AlreadyMapped);
        }
        entry.set_addr(page.size.align_down(paddr));
        entry.set_flags(flags, page.size.is_huge());
        Ok(entry)
    }

This function briefly describes the following functionality: First, we obtain the PTE according to the VA, or more precisely, according to the page number VPN corresponding to this VA. We fill in the control bit information and the physical address (actually, it should be PPN) in the PTE. Specifically, you can see in the PageTableEntry's set_addr method that we did not fill in the entire physical address but only the content excluding the lower 12 bits, because our page table only cares about the mapping of page frame numbers.

Let's take a closer look at how to obtain the PTE:

fn get_entry_mut_or_create(&mut self, page: Page<VA>) -> PagingResult<&mut PTE> {
        let vaddr: usize = page.vaddr.into();
        let p4 = table_of_mut::<PTE>(self.inner.root_paddr());
        let p4e = &mut p4[p4_index(vaddr)];

        let p3 = next_table_mut_or_create(p4e, || self.alloc_intrm_table())?;
        let p3e = &mut p3[p3_index(vaddr)];
        if page.size == PageSize::Size1G {
            return Ok(p3e);
        }

        let p2 = next_table_mut_or_create(p3e, || self.alloc_intrm_table())?;
        let p2e = &mut p2[p2_index(vaddr)];
        if page.size == PageSize::Size2M {
            return Ok(p2e);
        }

        let p1 = next_table_mut_or_create(p2e, || self.alloc_intrm_table())?;
        let p1e = &mut p1[p1_index(vaddr)];
        Ok(p1e)
    }

First, we find the starting address of the L0 page table, then obtain the corresponding page table entry p4e according to the L0 index in the VA. However, we cannot directly obtain the starting address of the next level page table from p4e, as the corresponding page table may not have been created yet. If it has not been created, we create a new page table (this process also requires page frame allocation), then return the starting address of the page table, and so forth, until we obtain the page table entry PTE corresponding to the L4 index in the L4 page table.

After mapping the memory (the same applies to UART devices) through the process described above, we also need to fill the L0 page table base address into the VTTBR_EL2 register. This process can be seen in the activate function of the Zone's MemorySet's Level4PageTable.

In a non-virtualized environment, why isn't guestVM allowed to access memory areas related to MMIO and GIC devices?

This is because, in a virtualized environment, hvisor is the manager of resources and cannot arbitrarily allow guestVM to access areas related to devices. In the previous exception handling, we mentioned access to MMIO/GIC, which actually results in falling into EL2 due to the lack of address mapping, and EL2 accesses the resources and returns the results. If mapping was performed in the page table, it would directly access the resources through the second-stage address conversion without passing through EL2's control.

Therefore, in our design, only MMIOs that are allowed to be accessed by the Zone are registered in the Zone, and when related exceptions occur, they are used to determine whether a certain MMIO resource is allowed to be accessed by the Zone.

ARM GICv3 Module

1. GICv3 Module

GICv3 Initialization Process

The GICv3 initialization process in hvisor involves the initialization of the GIC Distributor (GICD) and GIC Redistributor (GICR), as well as the mechanisms for interrupt handling and virtual interrupt injection. Key steps in this process include:

  • SDEI version check: Obtain the version information of the Secure Debug Extensions Interface (SDEI) through smc_arg1!(0xc4000020).
  • ICCs configuration: Set icc_ctlr_el1 to only provide priority drop functionality, set icc_pmr_el1 to define the interrupt priority mask, and enable Group 1 IRQs.
  • Clear pending interrupts: Call the gicv3_clear_pending_irqs function to clear all pending interrupts, ensuring the system is in a clean state.
  • VMCR and HCR configuration: Set ich_vmcr_el2 and ich_hcr_el2 registers to enable the virtualization CPU interface and prepare for virtual interrupt handling.

Pending Interrupt Handling

  • The pending_irq function reads the icc_iar1_el1 register and returns the current interrupt ID being processed. If the value is greater than or equal to 0x3fe, it is considered an invalid interrupt.
  • The deactivate_irq function clears the interrupt flags by writing to the icc_eoir1_el1 and icc_dir_el1 registers, enabling the interrupt.

Virtual Interrupt Injection

  • The inject_irq function checks for an available List Register (LR) and writes the virtual interrupt information into it. This function distinguishes between hardware interrupts and software-generated interrupts, appropriately setting fields in the LR.

GIC Data Structure Initialization

  • GIC is a global Once container used for the lazy initialization of the Gic structure, which contains the base addresses and sizes of GICD and GICR.
  • The primary_init_early and primary_init_late functions configure the GIC during the early and late initialization stages, enabling interrupts.

Zone-Level Initialization

In the Zone structure, the arch_irqchip_reset method is responsible for resetting all interrupts allocated to a specific zone by directly writing to the GICD's ICENABLER and ICACTIVER registers.

2. vGICv3 Module

hvisor's VGICv3 (Virtual Generic Interrupt Controller version 3) module provides virtualization support for GICv3 in the ARMv8-A architecture. It controls and coordinates interrupt requests between different zones (virtual machine instances) through MMIO (Memory Mapped I/O) access and interrupt bitmaps management.

MMIO Region Registration

During initialization, the Zone structure's vgicv3_mmio_init method registers the MMIO regions for the GIC Distributor (GICD) and each CPU's GIC Redistributor (GICR). MMIO region registration is completed through the mmio_region_register function, which associates specific processor or interrupt controller addresses with corresponding handling functions vgicv3_dist_handler and vgicv3_redist_handler.

Interrupt Bitmap Initialization

The Zone structure's irq_bitmap_init method initializes the interrupt bitmap to track which interrupts belong to the current zone. Each interrupt is inserted into the bitmap by iterating through the provided interrupt list. The insert_irq_to_bitmap function is responsible for mapping specific interrupt numbers to the appropriate positions in the bitmap.

MMIO Access Restriction

The restrict_bitmask_access function restricts MMIO access to the GICD registers, ensuring that only interrupts belonging to the current zone can be modified. The function checks whether the access is for the current zone's interrupts and, if so, updates the access mask to allow or restrict specific read/write operations.

VGICv3 MMIO Handling

The vgicv3_redist_handler and vgicv3_dist_handler functions handle MMIO access for GICR and GICD, respectively. The vgicv3_redist_handler function handles read/write operations for GICR, checking whether the access is for the current zone's GICR and allowing access if so; otherwise, the access is ignored. The vgicv3_dist_handler function calls vgicv3_handle_irq_ops or restrict_bitmask_access based on different types of GICD registers to appropriately handle interrupt routing and configuration register access.

Through these mechanisms, hvisor effectively manages interrupts across zones, ensuring that each zone can only access and control the interrupt resources allocated to it while providing necessary isolation. This allows VGICv3 to work efficiently and securely in a multi-zone environment, supporting complex virtualization scenarios.

Source of Interruptions

In hvisor, there are three types of interrupts: timer interrupts, software interrupts, and external interrupts.

Timer Interrupt: A timer interrupt is generated when the time register becomes greater than the timecmp register.

Software Interrupt: In a multi-core system, one hart sends an inter-core interrupt to another hart, implemented through an SBI call.

External Interrupt: External devices send interrupt signals to the processor through interrupt lines.

Timer Interrupt

When a virtual machine needs to trigger a timer interrupt, it traps into hvisor through the ecall instruction.

#![allow(unused)]
fn main() {
        ExceptionType::ECALL_VS => {
            trace!("ECALL_VS");
            sbi_vs_handler(current_cpu);
            current_cpu.sepc += 4;
        }
        ...
pub fn sbi_vs_handler(current_cpu: &mut ArchCpu) {
    let eid: usize = current_cpu.x[17];
    let fid: usize = current_cpu.x[16];
    let sbi_ret;
    match eid {
        ...
            SBI_EID::SET_TIMER => {
            sbi_ret = sbi_time_handler(fid, current_cpu);
        }
        ...
    }
}
}

If the sstc extension is not enabled, it is necessary to trap into machine mode through an SBI call, set the mtimecmp register, clear the virtual machine's timer interrupt pending bit, and enable hvisor's timer interrupt; if the sstc extension is enabled, stimecmp can be set directly.

pub fn sbi_time_handler(fid: usize, current_cpu: &mut ArchCpu) -> SbiRet {
...
    if current_cpu.sstc {
        write_csr!(CSR_VSTIMECMP, stime);
    } else {
        set_timer(stime);
        unsafe {
            // clear guest timer interrupt pending
            hvip::clear_vstip();
            // enable timer interrupt
            sie::set_stimer();
        }
    }
    return sbi_ret;
}

When the time register becomes greater than the timecmp register, a timer interrupt is generated.

After the interrupt is triggered, the trap context is saved, and dispatched to the corresponding handling function.

        InterruptType::STI => {
            unsafe {
                hvip::set_vstip();
                sie::clear_stimer();
            }
        }

Set the virtual machine's timer interrupt pending bit to 1, injecting a timer interrupt into the virtual machine, and clear hvisor's timer interrupt enable bit to complete the interrupt handling.

Software Interrupt

When a virtual machine needs to send an IPI, it traps into hvisor through the ecall instruction.

        SBI_EID::SEND_IPI => {
            ...
            sbi_ret = sbi_call_5(
                eid,
                fid,
                current_cpu.x[10],
                current_cpu.x[11],
                current_cpu.x[12],
                current_cpu.x[13],
                current_cpu.x[14],
            );
        }

Then through an SBI call, trap into machine mode to send an IPI to the specified hart, setting the SSIP bit in the mip register to inject an inter-core interrupt into hvisor.

After the interrupt is triggered, the trap context is saved, and dispatched to the corresponding handling function.

pub fn handle_ssi(current_cpu: &mut ArchCpu) {
    ...
    clear_csr!(CSR_SIP, 1 << 1);
    set_csr!(CSR_HVIP, 1 << 2);
    check_events();
}

Set the virtual machine's software interrupt pending bit to 1, injecting a software interrupt into the virtual machine. Then determine the type of inter-core interrupt, wake or block the CPU, or handle VIRTIO-related interrupt requests.

External Interrupt

PLIC

RISC-V implements external interrupt handling through PLIC, which does not support virtualization or MSI.

The architectural diagram of PLIC.

The interrupt process of PLIC is shown in the following diagram.

The interrupt source sends an interrupt signal to the PLIC through the interrupt line, and only when the interrupt priority is greater than the threshold, it can pass through the threshold register's filter.

Then read the claim register to get the pending highest priority interrupt, then clear the corresponding pending bit. Pass it to the target hart for interrupt handling.

After handling, write the interrupt number to the complete register to receive the next interrupt request.

Initialization

The initialization process is similar to AIA.

Handling Process

When an external interrupt is triggered in the virtual machine, it will access the vPLIC address space, however, PLIC does not support virtualization, and this address space is unmapped. Therefore, a page fault exception will be triggered, trapping into hvisor for handling.

After the exception is triggered, the trap context is saved, and enters the page fault exception handling function.

pub fn guest_page_fault_handler(current_cpu: &mut ArchCpu) {
    ...
    if addr >= host_plic_base && addr < host_plic_base + PLIC_TOTAL_SIZE {
        let mut inst: u32 = read_csr!(CSR_HTINST) as u32;
        ...
        if let Some(inst) = inst {
            if addr >= host_plic_base + PLIC_GLOBAL_SIZE {
                vplic_hart_emul_handler(current_cpu, addr, inst);
            } else {
                vplic_global_emul_handler(current_cpu, addr, inst);
            }
            current_cpu.sepc += ins_size;
        } 
        ...
    }
}

Determine if the address where the page fault occurred is within the PLIC's address space, then parse the instruction that caused the exception, and modify the PLIC's address space based on the access address and instruction to emulate the configuration for vPLIC.

pub fn vplic_hart_emul_handler(current_cpu: &mut ArchCpu, addr: GuestPhysAddr, inst: Instruction) {
    ...
    if offset >= PLIC_GLOBAL_SIZE && offset < PLIC_TOTAL_SIZE {
        ...
        if index == 0 {
            // threshold
            match inst {
                Instruction::Sw(i) => {
                    // guest write threshold register to plic core
                    let value = current_cpu.x[i.rs2() as usize] as u32;
                    host_plic.write().set_threshold(context, value);
                }
                _ => panic!("Unexpected instruction threshold {:?}", inst),
            }
            ...
        }
    }
}

Overall Structure

AIA mainly includes two parts, the Interrupt Message Controller IMSIC and the Advanced Platform Level Interrupt Controller APLIC, with the overall structure shown in the diagram below.

Peripherals can choose to send message interrupts or send wired interrupts via a connected line.

If peripheral A supports MSI, it only needs to write the specified data into the interrupt file of the designated hart, after which IMSIC will deliver an interrupt to the target processor.

For all devices, they can connect to APLIC via an interrupt line, and APLIC will choose the interrupt delivery mode based on the configuration:

  • Wired interrupt
  • MSI

In hvisor, the interrupt delivery mode is MSI.

After enabling the AIA specification with IRQ=aia in hvisor, the handling of clock interrupts remains consistent, while the handling of software interrupts and external interrupts changes.

External Interrupts

IMSIC

In hvisor, a physical CPU corresponds to a virtual CPU, each having their own interrupt file.

Writing to an interrupt file can trigger an external interrupt for a specified hart at a specified privilege level.

Provide a two-stage address mapping table for IMSIC.

        let paddr = 0x2800_0000 as HostPhysAddr;
        let size = PAGE_SIZE;
        self.gpm.insert(MemoryRegion::new_with_offset_mapper(
            paddr as GuestPhysAddr,
            paddr + PAGE_SIZE * 1,
            size,
            MemFlags::READ | MemFlags::WRITE,
        ))?;
        ...

APLIC

Structure

There is only one global APLIC.

When a wired interrupt arrives, it first reaches the root interrupt domain in machine mode (OpenSBI), then the interrupt is routed to the sub-interrupt domain (hvisor), and hvisor sends the interrupt signal to the corresponding CPU of the virtual machine in MSI mode according to the target registers configured by APLIC.

The AIA specification manual specifies the byte offsets for various fields of APLIC. Define the APLIC structure as follows, and implement read and write operations for APLIC fields using the following methods:

#[repr(C)]
pub struct Aplic {
    pub base: usize,
    pub size: usize,
}
impl Aplic {
    pub fn new(base: usize, size: usize) -> Self {
        Self {
            base,
            size,
        }
    }
    pub fn read_domaincfg(&self) -> u32{
        let addr = self.base + APLIC_DOMAINCFG_BASE;
        unsafe { core::ptr::read_volatile(addr as *const u32) }
    }
    pub fn set_domaincfg(&self, bigendian: bool, msimode: bool, enabled: bool){
        ...
        let addr = self.base + APLIC_DOMAINCFG_BASE;
        let src = (enabled << 8) | (msimode << 2) | bigendian;
        unsafe {
            core::ptr::write_volatile(addr as *mut u32, src);
        }
    }
    ...
}

Initialization

Initialize APLIC based on the base address and size in the device tree.

pub fn primary_init_early(host_fdt: &Fdt) {
    let aplic_info = host_fdt.find_node("/soc/aplic").unwrap();
    init_aplic(
        aplic_info.reg().unwrap().next().unwrap().starting_address as usize,
        aplic_info.reg().unwrap().next().unwrap().size.unwrap(),
    );
}
pub fn init_aplic(aplic_base: usize, aplic_size: usize) {
    let aplic = Aplic::new(aplic_base, aplic_size);
    APLIC.call_once(|| RwLock::new(aplic));
}
pub static APLIC: Once<RwLock<Aplic>> = Once::new();
pub fn host_aplic<'a>() -> &'a RwLock<Aplic> {
    APLIC.get().expect("Uninitialized hypervisor aplic!")
}

There is only one global APLIC, so locking is used to avoid read-write conflicts, and the host_aplic() method is used for access.

When the virtual machine starts, the address space of APLIC is initialized, which is unmapped. This triggers a page fault, trapping into hvisor for handling.

pub fn guest_page_fault_handler(current_cpu: &mut ArchCpu) {
    ...
    if addr >= host_aplic_base && addr < host_aplic_base + host_aplic_size {
        let mut inst: u32 = read_csr!(CSR_HTINST) as u32;
        ...
        if let Some(inst) = inst {
                vaplic_emul_handler(current_cpu, addr, inst);
                current_cpu.sepc += ins_size;
            }
        ...
    }
}

Determine if the accessed address space belongs to APLIC, parse the access instruction, and enter vaplic_emul_handler to simulate APLIC in the virtual machine.

pub fn vaplic_emul_handler(
    current_cpu: &mut ArchCpu,
    addr: GuestPhysAddr,
    inst: Instruction,
) {
    let host_aplic = host_aplic();
    let offset = addr.wrapping_sub(host_aplic.read().base);
    if offset >= APLIC_DOMAINCFG_BASE && offset < APLIC_SOURCECFG_BASE {
        match inst {
            Instruction::Sw(i) => {
                ...
                host_aplic.write().set_domaincfg(bigendian, msimode, enabled);
            }
            Instruction::Lw(i) => {
                let value = host_aplic.read().read_domaincfg();
                current_cpu.x[i.rd() as usize] = value as usize;
            }
            _ => panic!("Unexpected instruction {:?}", inst),
        }
    }
    ...
}

Interrupt Process

After hvisor completes the simulation of APLIC initialization for the virtual machine through a page fault, it enters the virtual machine. Taking the interrupt generated by a keyboard press as an example: the interrupt signal first arrives at OpenSBI, then is routed to hvisor, and based on the configuration of the target register, it writes to the virtual interrupt file to trigger an external interrupt in the virtual machine.

Software Interrupts

After enabling the AIA specification, the Linux kernel of the virtual machine sends IPIs through MSI, eliminating the need to trap into hvisor using the ecall instruction.

As shown in the diagram, in hvisor, writing to the interrupt file of a specified hart can trigger an IPI.

In the virtual machine, writing to a specified virtual interrupt file can achieve IPIs within the virtual machine, without the need for simulation support from hvisor.

LoongArch Interrupt Control

Due to the different interrupt controllers designed for different Loongson processors/development boards (embedded processors like 2K1000 have their own interrupt controller designs, while the 3 series processors have the 7A1000 and 7A2000 bridge chips responsible for external interrupt control), this article mainly introduces the interrupt controller inside the latest Loongson 7A2000 bridge chip[1].

CPU Interrupts

The interrupt configuration of LoongArch is controlled by CSR.ECFG. The interrupts under the Loongson architecture use line interrupts, and each processor core can record 13 line interrupts. These interrupts include: 1 inter-core interrupt (IPI), 1 timer interrupt (TI), 1 performance monitoring counter overflow interrupt (PMI), 8 hardware interrupts (HWI0~HWI7), and 2 software interrupts (SWI0~SWI1). All line interrupts are level interrupts and are active high[3].

  • Inter-core Interrupt: Comes from an external interrupt controller and is recorded in the CSR.ESTAT.IS[12] bit.
  • Timer Interrupt: Originates from an internal constant frequency timer, triggered when the timer counts down to zero, and recorded in the CSR.ESTAT.IS[11] bit. It is cleared by writing 1 to the TI bit of the CSR.TICLR register through software.
  • Performance Counter Overflow Interrupt: Comes from an internal performance counter, triggered when any performance counter enabled for interrupts has its 63rd bit set to 1, and recorded in the CSR.ESTAT.IS[10] bit. It is cleared by either clearing the 63rd bit of the performance counter causing the interrupt or disabling the interrupt enable of that performance counter.
  • Hardware Interrupts: Come from an external interrupt controller outside the processor core, 8 hardware interrupts HWI[7:0] are recorded in the CSR.ESTAT.IS[9:2] bits.
  • Software Interrupts: Originating from within the processor core, triggered by writing 1 to the CSR.ESTAT.IS[1:0] through software instructions, cleared by writing 0.

The index value recorded in the CSR.ESTAT.IS field is also referred to as the interrupt number (Int Number). For example, the interrupt number for SWI0 is 0, for SWI1 is 1, and so on, with IPI being 12.

Traditional IO Interrupts

The diagram above shows the interrupt system of the 3A series processor + 7A series bridge chip. It illustrates two types of interrupt processes, the upper part shows the interrupt through the INTn0 line, and the lower part shows the interrupt through an HT message packet.

Interrupts intX issued by devices (except for PCIe devices working in MSI mode) are sent to the internal interrupt controller of 7A, after interrupt routing, they are sent to the bridge chip pins or converted into HT message packets sent to the 3A's HT controller. The 3A's interrupt controller receives the interrupt through external interrupt pins or HT controller and routes it to interrupt a specific processor core[1].

The traditional IO interrupts of the Loongson 3A5000 chip support 32 interrupt sources, managed in a unified manner as shown in the diagram below. Any IO interrupt source can be configured to enable/disable, trigger mode, and the target processor core interrupt pin to which it is routed. Traditional interrupts do not support cross-chip distribution of interrupts; they can only interrupt processor cores within the same processor chip[2].

Extended IO Interrupts

In addition to the existing traditional IO interrupt mode, starting with 3A5000, extended I/O interrupts are supported, which distribute the 256-bit interrupts on the HT bus directly to each processor core without going through the HT interrupt line, enhancing the flexibility of IO interrupt usage[2].

References

[1] Loongson Technology Corporation Limited. Loongson 7A2000 Bridge Chip User Manual. V1.0. Chapter 5.

[2] Loongson Technology Corporation Limited. Loongson 3A5000/3B5000 Processor Register Usage Manual - Multi-core Processor Architecture, Register Description and System Software Programming Guide. V1.3. Chapter 11.

[3] Loongson Technology Corporation Limited. Loongson Architecture Reference Manual. Volume One: Basic Architecture.

ARM-SMMU Technical Documentation

Abstract: Introducing the development process of ARM-SMMU.

Background Knowledge

A brief introduction to the principle and function of SMMU.

What is DMA? Why is IOMMU needed?

Virtual machines running on top of the hypervisor need to interact with devices, but if they wait for the CPU to host this work every time, it would reduce processing efficiency. Therefore, the DMA mechanism appears. DMA is a mechanism that allows devices to exchange data directly with memory without CPU involvement.

We can roughly figure out the process of virtual machines interacting with devices through DMA. First, the virtual machine issues a DMA request, telling the target device where to write the data, and then the device writes to the memory according to the address.

However, some issues need to be considered in the above process:

  • The hypervisor has virtualized memory for each virtual machine, so the target memory address of the DMA request issued by the virtual machine is GPA, also called IOVA here, which needs to be converted to the real PA to be written to the correct position in physical memory.
  • Moreover, if the range of IOVA is not restricted, it means that any memory address can be accessed through the DMA mechanism, causing unforeseen severe consequences.

Therefore, we need an institution that can help us with address conversion and ensure the legality of the operation address, just like the MMU memory management unit. This institution is called IOMMU, and it has another name in the Arm architecture called SMMU (referred to as SMMU hereafter).

Now you know that SMMU can convert virtual addresses to physical addresses, thus ensuring the legality of devices directly accessing memory.

Specific Tasks of SMMU

As mentioned above, the function of SMMU is similar to MMU, whose target is virtual machines or applications, while the target of SMMU is each device. Each device is identified by a sid, and the corresponding table is called stream table. This table uses the device's sid as an index, and the sid of PCI devices can be obtained from the BDF number: sid = (B << 5) | (D << 3) | F.

Development Work

Currently, we have implemented support for stage-2 address translation of SMMUv3 in Qemu, created a simple linear table, and conducted simple verification using PCI devices.

The work of IOMMU has not yet been merged into the mainline, and you can switch to the IOMMU branch to check.

Overall Idea

We pass through the PCI HOST to zone0, that is, add the PCI node to the device tree provided to zone0, map the corresponding memory address in the second-stage page table of zone0, and ensure normal interrupt injection. Then zone0 will detect and configure the PCI device by itself, and we only need to configure SMMU in the hypervisor.

Qemu Parameters

Add iommu=smmuv3 in machine to enable SMMUv3 support, and add arm-smmuv3.stage=2 in global to enable second-stage address translation.

Note that nested translation is not yet supported in Qemu. If stage=2 is not specified, only the first stage of address translation is supported by default. Please use Qemu version 8.1 or above, as lower versions do not support enabling second-stage address translation.

When adding PCI devices, please ensure to enable iommu_platform=on.

The addr can specify the bdf number of the device.

In the PCI bus simulated by Qemu, in addition to the PCI HOST, there is a default network card device, so the addr parameter of other added devices must start from 2.0.

// scripts/qemu-aarch64.mk

QEMU_ARGS := -machine virt,secure=on,gic-version=3,virtualization=on,iommu=smmuv3
QEMU_ARGS += -global arm-smmuv3.stage=2

QEMU_ARGS += -device virtio-blk-pci,drive=Xa003e000,disable-legacy=on,disable-modern=off,iommu_platform=on,addr=2.0

Consulting the Qemu source code reveals that the memory area corresponding to VIRT_SMMU starts at 0x09050000 and is 0x20000 in size. We need to access this area, so it must be mapped in the hypervisor's page table.

// src/arch/aarch64/mm.rs

pub fn init_hv_page_table(fdt: &fdt::Fdt) -> HvResult {
    hv_pt.insert(MemoryRegion::new_with_offset_mapper(
        smmuv3_base(),
        smmuv3_base(),
        smmuv3_size(),
        MemFlags::READ | MemFlags::WRITE,
    ))?;
}

SMMUv3 Data Structure

This structure contains a reference to the memory area of SMMUv3 that will be accessed, whether it supports secondary tables, the maximum number of bits of sid, and the base address and allocated page frames of the stream table.

The rp is a reference to the defined RegisterPage, which is set according to the offsets in Chapter 6 of the SMMUv3 manual. Readers can refer to it on their own.

// src/arch/aarch64/iommu.rs

pub struct Smmuv3{
    rp:&'static RegisterPage,

    strtab_2lvl:bool,
    sid_max_bits:usize,

    frames:Vec<Frame>,

    // strtab
    strtab_base:usize,

    // about queues...
}

new()

After completing the mapping work, we can refer to the corresponding register area.

impl Smmuv3{
    fn new() -> Self{
        let rp = unsafe {
            &*(SMMU_BASE_ADDR as *const RegisterPage)
        };

        let mut r = Self{
            ...
        };

        r.check_env();

        r.init_structures();

        r.device_reset();

        r
    }
}

check_env()

Check which stage of address translation the current environment supports, what type of stream table it supports, how many bits of sid it supports, etc.

Taking the check of which table format the environment supports as an example, the supported table type is in the IDR0 register. Obtain the value of IDR0 by self.rp.IDR0.get() as usize, and use extract_bit to extract and get the value of the ST_LEVEL field. According to the manual, 0b00 represents support for linear tables, 0b01 represents support for linear tables and secondary tables, and 0b1x is a reserved bit. We can choose what type of stream table to create based on this information.

impl Smmuv3{
    fn check_env(&mut self){
        let idr0 = self.rp.IDR0.get() as usize;

        info!("Smmuv3 IDR0:{:b}",idr0);

        // supported types of stream tables.
        let stb_support = extract_bits(idr0, IDR0_ST_LEVEL_OFF, IDR0_ST_LEVEL_LEN);
        match stb_support{
            0 => info!("Smmuv3 Linear Stream Table Supported."),
            1 => {info!("Smmuv3 2-level Stream Table Supported.");
                self.strtab_2lvl = true;
            }
            _ => info!("Smmuv3 doesn't support any stream table."),
        }

	...
    }
}

init_linear_strtab()

We need to support the second stage of address translation, and there are not many devices in the system, so we choose to use a linear table.

When applying for the space needed for the linear table, we should calculate the number of table entries based on the current maximum number of bits of sid, multiplied by the space needed for each table entry STRTAB_STE_SIZE, to know how many page frames need to be applied for. However, SMMUv3 has strict requirements for the starting address of the stream table. The low (5+sid_max_bits) bits of the starting address must be 0.

Since the current hypervisor does not support applying for space in this way, we apply for a space under safe conditions and select an address that meets the conditions as the table base address within this space, although this will cause some space waste.

After applying for space, we can fill in this table's base address into the STRTAB_BASE register:

	let mut base = extract_bits(self.strtab_base, STRTAB_BASE_OFF, STRTAB_BASE_LEN);
	base = base << STRTAB_BASE_OFF;
	base |= STRTAB_BASE_RA;
	self.rp.STRTAB_BASE.set(base as _);

Next, we also need to set the STRTAB_BASE_CFG register to indicate the format of the table we are using, whether it is a linear table or a secondary table, and the number of table items (represented in LOG2 form, i.e., the maximum number of bits of SID):

        // format: linear table
        cfg |= STRTAB_BASE_CFG_FMT_LINEAR << STRTAB_BASE_CFG_FMT_OFF;

        // table size: log2(entries)
        // entry_num = 2^(sid_bits)
        // log2(size) = sid_bits
        cfg |= self.sid_max_bits << STRTAB_BASE_CFG_LOG2SIZE_OFF;

        // linear table -> ignore SPLIT field
        self.rp.STRTAB_BASE_CFG.set(cfg as _);

init_bypass_ste(sid:usize)

Currently, we have not configured any relevant information yet, so we need to set all table entries to the default state first.

For each sid, find the address of the table entry based on the table base address, i.e., the valid bit is 0, and the address translation is set to BYPASS.

	let base = self.strtab_base + sid * STRTAB_STE_SIZE;
	let tab = unsafe{&mut *(base as *mut [u64;STRTAB_STE_DWORDS])};

	let mut val:usize = 0;
	val |= STRTAB_STE_0_V;
	val |= STRTAB_STE_0_CFG_BYPASS << STRTAB_STE_0_CFG_OFF;

device_reset()

We have done some preparatory work above, but some additional configurations are still needed, such as enabling SMMU, otherwise, SMMU will be in a disabled state.

	let cr0 = CR0_SMMUEN;
	self.rp.CR0.set(cr0 as _);

write_ste(sid:usize,vmid:usize,root_pt:usize)

This method is used to configure specific device information.

First, we need to find the address of the corresponding table entry based on sid.

	let base = self.strtab_base + sid * STRTAB_STE_SIZE;
        let tab = unsafe{&mut *(base as *mut [u64;STRTAB_STE_DWORDS])};

In the second step, we need to indicate that the information related to this device is used for the second stage of address translation, and this table entry is now valid.

        let mut val0:usize = 0;
        val0 |= STRTAB_STE_0_V;
        val0 |= STRTAB_STE_0_CFG_S2_TRANS << STRTAB_STE_0_CFG_OFF;

In the third step, we need to specify which virtual machine this device is allocated to, and enable the second-stage page table traversal, S2AA64 represents that the second-stage translation table is based on aarch64, S2R represents enabling error recording.

        let mut val2:usize = 0;
        val2 |= vmid << STRTAB_STE_2_S2VMID_OFF;
        val2 |= STRTAB_STE_2_S2PTW;
        val2 |= STRTAB_STE_2_S2AA64;
        val2 |= STRTAB_STE_2_S2R;

The last step is to point out the basis for the second-stage translation, which is the page table of the corresponding virtual machine in the hypervisor. Just fill in the base address of the page table in the corresponding position, i.e., the S2TTB field.

Here we also need to explain the configuration information of this page table, so that SMMU knows the format and other information of this page table and can use this page table, i.e., the VTCR field.

	let vtcr = 20 + (2<<6) + (1<<8) + (1<<10) + (3<<12) + (0<<14) + (4<<16);
        let v = extract_bits(vtcr as _, 0, STRTAB_STE_2_VTCR_LEN);
        val2 |= v << STRTAB_STE_2_VTCR_OFF;

        let vttbr = extract_bits(root_pt, STRTAB_STE_3_S2TTB_OFF, STRTAB_STE_3_S2TTB_LEN);

Initialization and Device Allocation

In src/main.rs, after the hypervisor's page table is initialized (mapping the SMMU-related area), SMMU can be initialized.

fn primary_init_early(dtb: usize) {
    ...

    crate::arch::mm::init_hv_page_table(&host_fdt).unwrap();

    info!("Primary CPU init hv page table OK.");

    iommu_init();

    zone_create(0,ROOT_ENTRY,ROOT_ZONE_DTB_ADDR as _, DTB_IPA).unwrap();
    INIT_EARLY_OK.store(1, Ordering::Release);
}

Next, we need to allocate devices, which we complete synchronously when creating the virtual machine. Currently, we only allocate devices for zone0 to use.

// src/zone.rs

pub fn zone_create(
    zone_id: usize,
    guest_entry: usize,
    dtb_ptr: *const u8,
    dtb_ipa: usize,
) -> HvResult<Arc<RwLock<Zone>>> {
    ...

    if zone_id==0{
        // add_device(0, 0x8, zone.gpm.root_paddr());
        iommu_add_device(zone_id, BLK_PCI_ID, zone.gpm.root_paddr());
    }
  
    ...
}

Simple Verification

Start qemu with the parameter -trace smmuv3_* to see related outputs:

smmuv3_config_cache_hit Config cache HIT for sid=0x10 (hits=1469, misses=1, hit rate=99)
smmuv3_translate_success smmuv3-iommu-memory-region-16-2 sid=0x10 iova=0x8e043242 translated=0x8e043242 perm=0x3

4.6.1.2 Implementation of the RISC-V IOMMU Standard

RISC-V IOMMU Workflow

For virtualized systems with DMA devices, the system could potentially be compromised by malicious DMA configurations by the virtual machines, affecting the overall system stability. The introduction of IOMMU can further enhance the isolation between Zones to ensure system security.

IOMMU supports two-stage address translation and provides DMA remapping functionality. On one hand, it offers memory protection for DMA operations, limiting the physical memory areas that devices can access, making DMA operations safer. On the other hand, device DMA operations only require contiguous IOVAs, not contiguous PAs, allowing for better utilization of scattered pages in physical memory.

To perform address translation and memory protection, RISC-V IOMMU uses the same page table format as the CPU's MMU in both the first and second stages. Using the same page table format as the CPU MMU simplifies some of the complexities associated with DMA in memory management, and allows the CPU MMU and IOMMU to share the same page tables.

In hvisor, the second-stage address translation process supported by IOMMU is the translation from device-side IOVA (GPA) to HPA, and the second-stage page table is shared between the CPU MMU and IOMMU, as illustrated below:

riscv_iommu_struct.png

Before translation, IOMMU needs to find the device context (DC) in the device directory table based on the device identifier (device_id). Each device has a unique device_id; for platform devices, the device_id is specified during hardware implementation, and for PCI/PCIe devices, the BDF number of the PCI/PCIe device is used as the device_id. The DC contains information such as the base address of the two-stage address translation page table and some translation control information. For example, in two-stage address translation, the IO device's IOVA is first translated into GPA in the Stage-1 page table pointed to by the fsc field, then into HPA in the Stage-2 page table pointed to by the iohgatp field, and accesses memory accordingly. In hvisor, only the second-stage translation using the iohgatp field is supported, as shown below:

riscv_iommu_translate.png

RISC-V IOMMU, as a physical hardware, can be accessed using the MMIO method, and its various fields' byte offsets are specified in the IOMMU specification manual. Implementation requires access according to the specified offsets and sizes to correctly retrieve the values of each field. The IommuHw structure is defined to simplify access to the physical IOMMU, as follows:

#![allow(unused)]
fn main() {
#[repr(C)]
#[repr(align(0x1000))]
pub struct IommuHw {
    caps: u64,
    fctl: u32,
    __custom1: [u8; 4],
    ddtp: u64,
    cqb: u64,
    cqh: u32,
    cqt: u32,
    fqb: u64,
    fqh: u32,
    fqt: u32,
    pqb: u64,
    pqh: u32,
    pqt: u32,
    cqcsr: u32,
    fqcsr: u32,
    pqcsr: u32,
    ipsr: u32,
    iocntovf: u32,
    iocntinh: u32,
    iohpmcycles: u64,
    iohpmctr: [u64; 31],
    iohpmevt: [u64; 31],
    tr_req_iova: u64,
    tr_req_ctl: u64,
    tr_response: u64,
    __rsv1: [u8; 64],
    __custom2: [u8; 72],
    icvec: u64,
    msi_cfg_tbl: [MsiCfgTbl; 16],
    __rsv2: [u8;3072],
}
}

The Capabilities of the IOMMU is a read-only register, which reports the features supported by the IOMMU. When initializing the IOMMU, it is necessary to first check this register to determine if the hardware can support IOMMU functionality.

When initializing, the IOMMU first checks if the current IOMMU matches the driver. The rv_iommu_check_features function is defined to check for hardware support for Sv39x4, WSI, etc., as implemented below:

#![allow(unused)]
fn main() {
impl IommuHw {
    pub fn rv_iommu_check_features(&self){
        let caps = self.caps as usize;
        let version = caps & RV_IOMMU_CAPS_VERSION_MASK;
        // get version, version 1.0 -> 0x10
        if version != RV_IOMMU_SUPPORTED_VERSION{
            error!("RISC-V IOMMU unsupported version: {}", version);
        }
        // support SV39x4
        if caps & RV_IOMMU_CAPS_SV39X4_BIT == 0 {
            error!("RISC-V IOMMU HW does not support Sv39x4");
        }
        if caps & RV_IOMMU_CAPS_MSI_FLAT_BIT == 0 {
            error!("RISC-V IOMMU HW does not support MSI Address Translation (basic-translate mode)");
        }
        if caps & RV_IOMMU_CAPS_IGS_MASK == 0 {
            error!("RISC-V IOMMU HW does not support WSI generation");
        }
        if caps & RV_IOMMU_CAPS_AMO_HWAD_BIT == 0 {
            error!("RISC-V IOMMU HW AMO HWAD unsupport");
        }
    }
}
}

The fctl of the IOMMU is a functional control register, which provides some functional controls of the IOMMU, including whether the IOMMU accesses memory data in big-endian or little-endian, whether the interrupt generated by the IOMMU is a WSI interrupt or an MSI interrupt, and the control of the guest address translation scheme.

The ddtp of the IOMMU is the device directory table pointer register, which contains the PPN of the root page of the device directory table, as well as the IOMMU Mode, which can be configured as Off, Bare, 1LVL, 2LVL, or 3LVL. Off means that the IOMMU does not allow devices to access memory, Bare means that the IOMMU allows all memory accesses by devices without translation and protection, and 1LVL, 2LVL, 3LVL indicate the levels of the device directory table used by the IOMMU.

The rv_iommu_init function is defined for functional checks and controls of the physical IOMMU, such as configuring interrupts as WSI, configuring the device directory table, etc., as implemented below:

#![allow(unused)]
fn main() {
impl IommuHw {
	pub fn rv_iommu_init(&mut self, ddt_addr: usize){
        // Read and check caps
        self.rv_iommu_check_features();
        // Set fctl.WSI We will be first using WSI as IOMMU interrupt mechanism
        self.fctl = RV_IOMMU_FCTL_DEFAULT;
        // Clear all IP flags (ipsr)
        self.ipsr = RV_IOMMU_IPSR_CLEAR;
        // Configure ddtp with DDT base address and IOMMU mode
        self.ddtp = IOMMU_MODE as u64 | ((ddt_addr >> 2) & RV_IOMMU_DDTP_PPN_MASK) as u64;    
    }
}
}

The format of the entries in the device directory table is given in the specification manual. To make the hardware work, it is necessary to implement it in accordance with the specification. The DdtEntry structure is defined to represent an entry in the device directory table, representing a DMA device. The iohgatp saves the PPN of the second-stage page table, the Guest Software Context ID (GSCID), and the Mode field used to select the second-stage address translation scheme. The tc contains many translation control-related bits, most of which are not used in hvisor, and the valid bits need to be set to 1 for subsequent higher-level functional extensions. The structure of the device directory table entry is as follows:

#![allow(unused)]
fn main() {
#[repr(C)]
struct DdtEntry{
    tc: u64,
    iohgatp: u64,
    ta: u64,
    fsc: u64,
    msiptp: u64,
    msi_addr_mask: u64,
    msi_addr_pattern: u64,
    __rsv: u64,
}
}

Currently, hvisor only supports a single-level device directory table. The Lvl1DdtHw structure is defined to facilitate access to the device directory table entries. A single-level device directory table can support 64 DMA devices, occupying one physical page, as follows:

#![allow(unused)]
fn main() {
pub struct Lvl1DdtHw{
    dc: [DdtEntry; 64],
}
}

The Iommu structure is defined as a higher-level abstraction of the IOMMU, where base is the base address of IommuHw, i.e., the physical address of the IOMMU, which can be used to access the physical IOMMU. ddt is the device directory table, which needs to be allocated physical pages during IOMMU initialization. Since it only supports a single-level device directory table, only one physical page is needed, as defined below:

#![allow(unused)]
fn main() {
pub struct Iommu{
    pub base: usize,
    pub ddt: Frame,		// Lvl1 DDT -> 1 phys page
}
}

The device directory table and translation page table of the IOMMU are stored in memory and need to be allocated according to actual needs, i.e., the memory of the device directory table needs to be allocated when new is called. In addition, adding device entries to the device directory table is a very important task, because DMA devices perform DMA operations, and the first step is to find the translation page table and other information from the device directory table, and then the IOMMU performs translation based on the page table-related information. The tc, iohgatp, etc. need to be filled in, as implemented below:

#![allow(unused)]
fn main() {
impl Iommu {
    pub fn new(base: usize) -> Self{
        Self { 
            base: base,
            ddt: Frame::new_zero().unwrap(),
        }
    }

    pub fn iommu(&self) -> &mut IommuHw{
        unsafe { &mut *(self.base as *mut _) }
    }

    pub fn dc(&self) -> &mut Lvl1DdtHw{
        unsafe { &mut *(self.ddt.start_paddr() as *mut _)}
    }

    pub fn rv_iommu_init(&mut self){
        self.iommu().rv_iommu_init(self.ddt.start_paddr());
    }

    pub fn rv_iommu_add_device(&self, device_id: usize, vm_id: usize, root_pt: usize){
        // only support 64 devices
        if device_id > 0 && device_id < 64{
            // configure DC
            let tc: u64 = 0 | RV_IOMMU_DC_VALID_BIT as u64 | 1 << 4;
            self.dc().dc[device_id].tc = tc;
            let mut iohgatp: u64 = 0;
            iohgatp |= (root_pt as u64 >> 12) & RV_IOMMU_DC_IOHGATP_PPN_MASK as u64;
            iohgatp |= (vm_id as u64) & RV_IOMMU_DC_IOHGATP_GSCID_MASK as u64;
            iohgatp |= RV_IOMMU_IOHGATP_SV39X4 as u64;
            self.dc().dc[device_id].iohgatp = iohgatp;
            self.dc().dc[device_id].fsc = 0;
            info!("{:#x}", &mut self.dc().dc[device_id] as *mut _ as usize);
            info!("RV IOMMU: Write DDT, add decive context, iohgatp {:#x}", iohgatp);
        }
        else{
            info!("RV IOMMU: Invalid device ID: {}", device_id);
        }
    }
}
}

Since hvisor supports RISC-V's IOMMU and Arm's SMMUv3, two interfaces for external calls, iommu_init and iommu_add_device, are encapsulated during implementation. These two functions have the same function names and parameters as the common call interfaces under the Arm architecture, as implemented below:

#![allow(unused)]
fn main() {
// alloc the Fram for DDT & Init
pub fn iommu_init() {
    let iommu = Iommu::new(0x10010000);
    IOMMU.call_once(|| RwLock::new(iommu));
    rv_iommu_init();
}

// every DMA device do!
pub fn iommu_add_device(vm_id: usize, device_id: usize, root_pt: usize){
    info!("RV_IOMMU_ADD_DEVICE: root_pt {:#x}, vm_id {}", root_pt, vm_id);
    let iommu = iommu();
    iommu.write().rv_iommu_add_device(device_id, vm_id, root_pt);
}
}

Virtio

Note that this document mainly introduces how Virtio is implemented in hvisor. For detailed usage tutorials, please refer to hvisor-tool-README.

Introduction to Virtio

Virtio, proposed by Rusty Russell in 2008, is a device virtualization standard aimed at improving device performance and unifying various semi-virtual device solutions. Currently, Virtio includes over a dozen peripherals such as disks, network cards, consoles, GPUs, etc., and many operating systems including Linux have implemented front-end drivers for various Virtio devices. Therefore, virtual machine monitors only need to implement Virtio backend devices to directly allow virtual machines that have implemented Virtio drivers, such as Linux, to use Virtio devices.

The Virtio protocol defines a set of driver interfaces for semi-virtual IO devices, specifying that the operating system of the virtual machine needs to implement front-end drivers, and the Hypervisor needs to implement backend devices. The virtual machine and Hypervisor communicate and interact through the data plane interface and control plane interface.

Data Plane Interface

The data plane interface refers to the method of IO data transfer between the driver and the device. For Virtio, the data plane interface refers to a shared memory Virtqueue between the driver and the device. Virtqueue is an important data structure in the Virtio protocol, representing the mechanism and abstract representation of batch data transfer for Virtio devices, used for various data transfer operations between the driver and the device. Virtqueue consists of three main components: descriptor table, available ring, and used ring, which function as follows:

  1. Descriptor Table: An array of descriptors. Each descriptor contains four fields: addr, len, flag, next. Descriptors can represent the address (addr), size (len), and attributes (flag) of a memory buffer, which may contain IO request commands or data (filled by the Virtio driver) or the results after the completion of IO requests (filled by the Virtio device). Descriptors can be linked into a descriptor chain by the next field as needed, with one descriptor chain representing a complete IO request or result.

  2. Available Ring: A circular queue, where each element represents the index of an IO request issued by the Virtio driver in the descriptor table, i.e., each element points to the starting descriptor of a descriptor chain.

  3. Used Ring: A circular queue, where each element represents the index of the IO result written by the Virtio device after completing the IO request in the descriptor table.

Using these three data structures, the commands, data, and results of IO data transfer requests between the driver and the device can be fully described. The Virtio driver is responsible for allocating the memory area where the Virtqueue is located and writing its address into the corresponding MMIO control registers to inform the Virtio device. This way, the device can obtain the addresses of the three and perform IO transfers with the driver through the Virtqueue.

Control Plane Interface

The control plane interface refers to the way the driver discovers, configures, and manages the device. In hvisor, the control plane interface of Virtio mainly refers to the MMIO registers based on memory mapping. The operating system first detects MMIO-based Virtio devices through the device tree and negotiates, configures, and notifies the device by reading and writing these memory-mapped control registers. Some of the more important registers include:

  • QueueSel: Used to select the current Virtqueue being operated. A device may contain multiple Virtqueues, and the driver indicates which queue it is operating by writing this register.

  • QueueDescLow, QueueDescHigh: Used to indicate the intermediate physical address IPA of the descriptor table. The driver writes these two 32-bit registers to inform the device of the 64-bit physical address of the descriptor table, used to establish shared memory.

  • QueueDriverLow, QueueDriverHigh: Used to indicate the intermediate physical address IPA of the available ring.

  • QueueDeviceLow, QueueDeviceHigh: Used to indicate the intermediate physical address IPA of the used ring.

  • QueueNotify: When the driver writes this register, it indicates that there are new IO requests in the Virtqueue that need to be processed.

In addition to the control registers, the MMIO memory area of each device also contains a device configuration space. For disk devices, the configuration space indicates the disk capacity and block size; for network devices, the configuration space indicates the device's MAC address and connection status. For console devices, the configuration space provides console size information.

For the MMIO memory area of Virtio devices, the Hypervisor does not map the second-stage address translation for the virtual machine. When the driver reads and writes this area, a page fault exception will occur, causing a VM Exit into the Hypervisor. The Hypervisor can determine the register accessed by the driver based on the access address that caused the page fault and take appropriate action, such as notifying the device to perform IO operations. After processing, the Hypervisor returns to the virtual machine through VM Entry.

IO Process of Virtio Devices

The process from when a user process running on a virtual machine initiates an IO operation to when it obtains the IO result can be roughly divided into the following four steps:

  1. The user process initiates an IO operation, and the Virtio driver in the operating system kernel receives the IO operation command, writes it into the Virtqueue, and writes to the QueueNotify register to notify the Virtio device.
  2. After receiving the notification, the device parses the available ring and descriptor table, obtains the specific IO request and buffer address, and performs the actual IO operation.
  3. After completing the IO operation, the device writes the result into the used ring. If the driver program uses the polling method to wait for the IO result in the used ring, the driver can immediately receive the result information; otherwise, it needs to notify the driver program through an interrupt.
  4. The driver program obtains the IO result from the used ring and returns it to the user process.

Design and Implementation of Virtio Backend Mechanism

Virtio devices in hvisor follow the Virtio v1.2 protocol for design and implementation. To maintain good device performance while ensuring the lightness of hvisor, two design points of the Virtio backend are:

  1. Adopting a microkernel design philosophy, moving the implementation of Virtio devices from the Hypervisor layer to the management virtual machine user space. The management virtual machine runs the Linux operating system, known as Root Linux. Physical devices such as disks and network cards are passed through to Root Linux, while Virtio devices serve as daemons on Root Linux, providing device emulation for other virtual machines (Non Root Linux). This ensures the lightness of the Hypervisor layer, facilitating formal verification.

  2. The Virtio driver programs located on other virtual machines and the Virtio devices on Root Linux interact directly through shared memory. The shared memory area stores interaction information, known as the communication springboard, and adopts a producer-consumer model, shared by the Virtio device backend and Hypervisor. This reduces the interaction overhead between the driver and the device, enhancing the device's performance.

Based on the above two design points, the implementation of the Virtio backend device will be divided into three parts: communication springboard, Virtio daemon, and kernel service module:

Communication Springboard

To achieve efficient interaction between drivers and devices distributed across different virtual machines, this document designs a communication springboard as a bridge for passing control plane interaction information between the driver and the device. It is essentially a shared memory area containing two circular queues: the request submission queue and the request result queue, which store interaction requests issued by the driver and results returned by the device, respectively. Both queues are located in the memory area shared by the Hypervisor and the Virtio daemon and adopt a producer-consumer model. The Hypervisor acts as the producer of the request submission queue and the consumer of the request result queue, while the Virtio daemon acts as the consumer of the request submission queue and the producer of the request result queue. This facilitates the transfer of Virtio control plane interaction information between Root Linux and other virtual machines. It should be noted that the request submission queue and the request result queue are not the same as the Virtqueue. The Virtqueue is the data plane interface between the driver and the device, used for data transfer and essentially containing information about the data buffer's address and structure. The communication springboard, on the other hand, is used for control plane interaction and communication between the driver and the device.

  • Communication Springboard Structure

The communication springboard is represented by the virtio_bridge structure, where req_list is the request submission queue, and res_list and cfg_values together form the request result queue. The device_req structure represents interaction requests sent by the driver to the device, and the device_res structure represents interrupt information to be injected by the device to notify the virtual machine driver program that the IO operation is complete.

// Communication springboard structure:
struct virtio_bridge {
    __u32 req_front;
    __u32 req_rear;
    __u32 res_front;
    __u32 res_rear;
    // Request submission queue
    struct device_req req_list[MAX_REQ]; 
    // res_list, cfg_flags, and cfg_values together form the request result queue
    struct device_res res_list[MAX_REQ];
    __u64 cfg_flags[MAX_CPUS]; 
    __u64 cfg_values[MAX_CPUS];
    __u64 mmio_addrs[MAX_DEVS];
    __u8 mmio_avail;
    __u8 need_wakeup;
};
// Interaction requests sent by the driver to the device
struct device_req {
    __u64 src_cpu;
    __u64 address; // zone's ipa
    __u64 size;
    __u64 value;
    __u32 src_zone;
    __u8 is_write;
    __u8 need_interrupt;
    __u16 padding;
};
// Interrupt information to be injected by the device
struct device_res {
    __u32 target_zone;
    __u32 irq_id;
};

Request Submission Queue

The request submission queue is used for the driver to send control plane interaction requests to the device. When the driver reads and writes the MMIO memory area of the Virtio device, since the Hypervisor does not perform second-stage address mapping for this memory area in advance, the CPU executing the driver program will receive a page fault exception and fall into the Hypervisor. The Hypervisor will combine the current CPU number, page fault address, address width, value to be written (ignored if it is a read), virtual machine ID, and whether it is a write operation into a structure called device_req and add it to the request submission queue req_list. At this point, the Virtio daemon monitoring the request submission queue will retrieve the request for processing.

To facilitate communication between the Virtio daemon and the Hypervisor based on shared memory, the request submission queue req_list is implemented as a circular queue. The head index req_front is updated only by the Virtio process after retrieving the request, and the tail index req_rear is updated only by the Hypervisor after adding the request. If the head and tail indexes are equal, the queue is empty; if the tail index plus one modulo the queue size equals the head index, the queue is full, and the driver needs to block in place when adding requests, waiting for the queue to become available. To ensure that the Hypervisor and Virtio process have real-time observation and mutual exclusion access to shared memory, the Hypervisor needs to perform a write memory barrier after adding a request to the queue, then update the tail index, ensuring that the Virtio process correctly retrieves the request in the queue when observing the tail index update. After the Virtio daemon retrieves a request from the queue, it needs to perform a write memory barrier to ensure that the Hypervisor can immediately observe the head index update. This producer-consumer model and circular queue method, combined with necessary memory barriers, solve the mutual exclusion problem of shared memory under different privilege levels. Since multiple virtual machines may have multiple CPUs adding requests to the request submission queue simultaneously, CPUs need to acquire a mutex lock first before operating the request submission queue. In contrast, only the main thread of the Virtio daemon operates the request submission queue, so no locking is required. This solves the mutual exclusion problem of shared memory under the same privilege level.

Request Result Queue

After the Virtio daemon completes the processing of a request, it will put the related information into the request result queue and notify the driver program. To improve communication efficiency, based on the classification of Virtio interaction information, the request result queue is divided into two sub-queues:

  • Data Plane Result Sub-queue

The data plane result queue, represented by the res_list structure, is used to store interrupt information. When the driver program writes to the Queue Notify register of the device memory area, it indicates that there is new data in the available ring, and the device needs to perform IO operations. Since IO operations take too long, Linux requires the Hypervisor to submit the IO request to the device and then immediately return the CPU from the Hypervisor to the virtual machine to execute other tasks, improving CPU utilization. Therefore, after the Virtio process completes the IO operation and updates the used ring, it will combine the device's interrupt number irq_id and the virtual machine ID of the device into a device_res structure, add it to the data plane result sub-queue res_list, and fall into the Hypervisor through ioctl and hvc. The data plane result queue res_list is similar to the request submission queue, being a circular queue, with the head index res_front and tail index res_rear determining the queue length. The Hypervisor will retrieve all elements from res_list and add them to the interrupt injection table VIRTIO_IRQS. The interrupt injection table is a key-value pair collection based on a B-tree, where the key is the CPU number, and the value is an array. The first element of the array indicates the valid length of the array, and the subsequent elements indicate the interrupts to be injected into this CPU. To prevent multiple CPUs from simultaneously operating the interrupt injection table, CPUs need to acquire a global mutex lock before accessing the interrupt injection table. Through the interrupt injection table, CPUs can determine which interrupts need to be injected into themselves based on their CPU number. Subsequently, the Hypervisor will send IPI inter-core interrupts to these CPUs needing interrupt injection. CPUs receiving the inter-core interrupts will traverse the interrupt injection table and inject interrupts into themselves. The following diagram describes the entire process, where the black solid triangle arrows represent operations executed by CPUs running other virtual machines, and the black ordinary arrows represent operations executed by CPUs running Root Linux.

data_plane_queue
  • Control Plane Result Sub-queue

The control plane result queue, represented by the cfg_values and cfg_flags arrays, uniquely corresponds to the same position of the two arrays for each CPU, i.e., each CPU uniquely corresponds to the same position of the two arrays. cfg_values is used to store the results of control plane interface interactions, and cfg_flags indicate whether the device has completed the control plane interaction request. When the driver program reads and writes the registers of the device memory area (excluding the Queue Notify register), it sends configuration and negotiation-related control plane interaction requests. After such interaction requests are added to the request submission queue, the CPU that falls into the Hypervisor due to the driver needs to wait for the result to return before returning to the virtual machine. Since the Virtio daemon does not need to perform IO operations for these requests, it can quickly complete the request processing and does not need to update the used ring. After completing the request, the daemon will write the result value into cfg_values[id] (for read requests) according to the driver's CPU number id, perform a write memory barrier, then increment cfg_flags[id], and perform a second write memory barrier, ensuring that the driver-side CPU observes the correct result value in cfg_values[id] when observing changes in cfg_flags[id]. When the driver-side CPU observes changes in cfg_flags[id], it can determine that the device has returned the result, directly retrieve the value from cfg_values[id], and return to the virtual machine. This way, the Virtio device can avoid executing ioctl and hvc, causing unnecessary CPU context switches, thereby enhancing the device's performance. The following diagram describes the entire process, where the black solid triangle arrows represent operations executed by CPUs running other virtual machines, and the black ordinary arrows represent operations executed by CPUs running Root Linux.

control_plane_queue

Kernel Service Module

Since the Virtio daemon located in Root Linux user space needs to communicate with hvisor, this document uses the kernel module hvisor.ko in Root Linux as the communication bridge. In addition to being used by command-line tools, this module also undertakes the following tasks:

  1. When the Virtio device is initialized, it establishes the shared memory area where the communication springboard is located between the Virtio daemon and the Hypervisor.

When the Virtio daemon is initialized, it requests the kernel module to allocate the shared memory where the communication springboard is located through ioctl. At this time, the kernel module allocates a page of continuous physical memory as shared memory through the memory allocation function __get_free_pages and sets the page attribute to the reserved state through the SetPageReserved function to avoid the page being swapped to disk due to Linux's page recycling mechanism. Subsequently, the kernel module needs to make both the Virtio daemon and the Hypervisor able to access this memory. For the Hypervisor, the kernel module executes hvc to notify the Hypervisor and passes the physical address of the shared memory as a parameter. For the Virtio daemon, the process calls mmap on /dev/hvisor, and the kernel module maps the shared memory to a free virtual memory area of the Virtio process in the hvisor_map function. The starting address of this area is returned as the return value of mmap.

  1. When the Virtio backend device needs to inject device interrupts into other virtual machines, it notifies the kernel module through ioctl, and the kernel module calls the system interface provided by the Hypervisor through the hvc command to notify the Hypervisor to perform the corresponding operations.

  2. Wake up the Virtio daemon.

When the driver accesses the MMIO area of the device, it falls into EL2 and enters the mmio_virtio_handler function. This function determines whether to wake up the Virtio daemon based on the need_wakeup flag in the communication springboard. If the flag is 1, it sends an SGI interrupt with event id IPI_EVENT_WAKEUP_VIRTIO_DEVICE to Root Linux's CPU 0. When CPU 0 receives the SGI interrupt, it falls into EL2 and injects the interrupt number of the hvisor_device node in the Root Linux device tree into itself. When CPU 0 returns to the virtual machine, it receives the interrupt injected into itself and enters the interrupt handling function pre-registered by the kernel service module. This function sends the SIGHVI signal to the Virtio daemon through the send_sig_info function. The Virtio daemon, previously blocked in the sig_wait function, receives the SIGHVI signal, polls the request submission queue, and sets the need_wakeup flag to 0.

Virtio Daemon

To ensure the lightness of the Hypervisor, this document does not adopt the traditional implementation method of Virtio devices in the Hypervisor layer but moves it to the user space of Root Linux as a daemon providing device emulation services. The daemon consists of two parts

Virtio Block

The implementation of Virtio disk devices follows the conventions of the Virtio specification and adopts the MMIO device access method for discovery and use by other virtual machines. Currently, it supports five features: VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_F_VERSION_1, VIRTIO_RING_F_INDIRECT_DESC, and VIRTIO_RING_F_EVENT_IDX.

Top-level description of Virtio devices - VirtIODevice

A Virtio device is represented by the VirtIODevice structure, which includes the device ID, the number of Virtqueues vqs_len, the ID of the virtual machine it belongs to, the device interrupt number irq_id, the starting address of the MMIO area base_addr, the length of the MMIO area len, the device type, some MMIO registers saved by the device regs, an array of Virtqueues vqs, and a pointer dev pointing to specific device information. With this information, a Virtio device can be fully described.

// The highest representations of virtio device
struct VirtIODevice
{
    uint32_t id;
    uint32_t vqs_len;
    uint32_t zone_id;
    uint32_t irq_id;
    uint64_t base_addr; // the virtio device's base addr in non root zone's memory
    uint64_t len;       // mmio region's length
    VirtioDeviceType type;
    VirtMmioRegs regs;
    VirtQueue *vqs;
    // according to device type, blk is BlkDev, net is NetDev, console is ConsoleDev.
    void *dev;          
    bool activated;
};

typedef struct VirtMmioRegs {
    uint32_t device_id;
    uint32_t dev_feature_sel;
    uint32_t drv_feature_sel;
    uint32_t queue_sel;
    uint32_t interrupt_status;
    uint32_t interrupt_ack;
    uint32_t status;
    uint32_t generation;
    uint64_t dev_feature;
    uint64_t drv_feature;
} VirtMmioRegs;

Description of Virtio Block devices

For Virtio disk devices, the type field in VirtIODevice is VirtioTBlock, vqs_len is 1, indicating that there is only one Virtqueue, and the dev pointer points to the virtio_blk_dev structure that describes specific information about the disk device. virtio_blk_dev's config represents the device's data capacity and the maximum amount of data in a single data transfer, img_fd is the file descriptor of the opened disk image, tid, mtx, cond are used for the worker thread, procq is the work queue, and closing indicates when the worker thread should close. Definitions of virtio_blk_dev and blkp_req structures are shown in Figure 4.6.

typedef struct virtio_blk_dev {
    BlkConfig config;
    int img_fd;
	// describe the worker thread that executes read, write and ioctl.
	pthread_t tid;
	pthread_mutex_t mtx;
	pthread_cond_t cond;
	TAILQ_HEAD(, blkp_req) procq;
	int close;
} BlkDev;

// A request needed to process by blk thread.
struct blkp_req {
	TAILQ_ENTRY(blkp_req) link;
    struct iovec *iov;
	int iovcnt;
	uint64_t offset;
	uint32_t type;
	uint16_t idx;
};

Virtio Block device worker thread

Each Virtio disk device has a worker thread and a work queue. The thread ID of the worker thread is saved in the tid field of virtio_blk_dev, and the work queue is procq. The worker thread is responsible for data IO operations and calling the interrupt injection system interface. It is created after the Virtio disk device starts and continuously checks whether there are new tasks in the work queue. If the queue is empty, it waits for the condition variable cond; otherwise, it processes tasks.

When the driver writes to the QueueNotify register in the MMIO area of the disk device, it indicates that there are new IO requests in the available ring. After receiving this request, the Virtio disk device (located in the main thread's execution flow) first reads the available ring to get the first descriptor of the descriptor chain. The first descriptor points to a memory buffer containing the type of IO request (read/write) and the sector number to be read or written. Subsequent descriptors point to data buffers; for read operations, the read data is stored in these data buffers, and for write operations, the data to be written is retrieved from these data buffers. The last descriptor's memory buffer (result buffer) is used to describe the completion result of the IO request, with options including success (OK), failure (IOERR), or unsupported operation (UNSUPP). This parsing of the entire descriptor chain provides all the information about the IO request, which is then saved in the blkp_req structure. The fields in this structure, iov, represent all data buffers, offset represents the data offset of the IO operation, type represents the type of IO operation (read/write), and idx is the index of the first descriptor in the descriptor chain, used to update the used ring. The device then adds the blkp_req to the work queue procq and wakes up the blocked worker thread through the signal function. The worker thread can then process the task.

After obtaining the task, the worker thread reads and writes the disk image corresponding to img_fd using the preadv and pwritev functions according to the IO operation information indicated by blkp_req. After completing the read/write operation, it first updates the last descriptor of the descriptor chain, which describes the completion result of the IO request, such as success, failure, or unsupported operation. Then it updates the used ring and writes the first descriptor of the descriptor chain to a new entry. Subsequently, it injects an interrupt to notify other virtual machines.

The establishment of the worker thread effectively distributes time-consuming operations to other CPU cores, improving the efficiency and throughput of the main thread in dispatching requests and enhancing device performance.

Virtio Network Device

Virtio network device is essentially a virtual network card. Currently supported features include VIRTIO_NET_F_MAC, VIRTIO_NET_F_STATUS, VIRTIO_F_VERSION_1, VIRTIO-RING_F_INDIRECT_DESC, VIRTIO_RING_F_EVENT_IDX.

Description of Virtio Network Device

For Virtio network devices, the type field in VirtIODevice is VirtioTNet, vqs_len is 2, indicating there are 2 Virtqueues, namely the Receive Queue and the Transmit Queue. The dev pointer points to the virtio_net_dev structure that describes specific information about the network device. In Virtio_net_dev, config is used to represent the MAC address and connection status of the network card, tapfd is the file descriptor for the Tap device corresponding to the device, rx_ready indicates whether the receive queue is available, and event is used for the receive packet thread to monitor the Tap device's readable events through epoll.

typedef struct virtio_net_dev {
    NetConfig config;
    int tapfd;
    int rx_ready;   
    struct hvisor_event *event;
} NetDev;

struct hvisor_event {
    void		(*handler)(int, int, void *);
    void		*param;
    int			fd;
    int 		epoll_type;
};

Tap Device and Bridge Device

The implementation of Virtio network devices is based on two types of virtual devices provided by the Linux kernel: Tap devices and bridge devices.

A Tap device is an Ethernet device implemented in software by the Linux kernel. Reading and writing to the Tap device in user space can simulate the reception and transmission of Ethernet frames. Specifically, when a process or kernel performs a write operation on the Tap device, it is equivalent to sending a packet to the Tap device. Performing a read operation on the Tap device is equivalent to receiving a packet from the Tap device. Thus, by reading and writing to the Tap device, packet transfer between the kernel and the process can be achieved.

The command to create a tap device is: ip tuntap add dev tap0 mode tap. This command creates a tap device named tap0. If a process wants to use this device, it needs to first open the /dev/net/tun device, obtain a file descriptor tun_fd, and call ioctl(TUNSETIFF) on it to link the process to the tap0 device. Afterward, tun_fd actually becomes the file descriptor for the tap0 device, and it can be read, written, and polled.

A bridge device is a virtual device provided by the Linux kernel that functions similarly to a switch. When other network devices are connected to a bridge device, those devices become ports of the bridge device, which takes over the packet sending and receiving process of all devices. When other devices receive packets, they are sent directly to the bridge device, which forwards them to other ports based on MAC addresses. Therefore, all devices connected to the bridge can communicate with each other.

The command to create a bridge device is: brctl addbr br0. The command to connect the physical network card eth0 to br0 is: brctl addif br0 eth0. The command to connect the tap0 device to br0 is: brctl addif br0 tap0.

Before the Virtio network device starts, Root Linux needs to create and start the tap device and bridge device in advance through the command line, and connect the tap device and Root Linux's physical network card to the bridge device, respectively. Each Virtio network device needs to connect to a tap device, ultimately forming a network topology as shown in the following diagram. In this way, the Virtio network device can transmit packets with the external network by reading and writing to the tap device.

hvisor-virtio-net

Sending Packets

The Transmit Virtqueue of the Virtio network device is used to store the send buffer. When the device receives a request from the driver to write to the QueueNotify register, if the QueueSel register points to the Transmit Queue at that time, it indicates that the driver has informed the device that there is a new packet to send. The Virtio-net device will take out a descriptor chain from the available ring, each descriptor chain corresponds to a packet, and the memory buffer it points to contains the packet data to be sent. The packet data includes 2 parts, the first part is the packet header virtio_net_hdr_v1 structure specified by the Virtio protocol, which contains some description information of the packet, and the second part is the Ethernet frame. To send a packet, only the Ethernet frame part needs to be written into the Tap device through the writev function. After the Tap device receives the frame, it will forward it to the bridge device, which will forward it to the external network through the physical network card based on the MAC address.

Receiving Packets

When initializing, the Virtio network device adds the file descriptor of the Tap device to the interest list of the event monitor thread epoll instance. The event monitor thread will loop and call the epoll_wait function to monitor the readable events of the tap device. Once a readable event occurs, indicating that the tap device has received a packet from the kernel, the epoll_wait function returns and executes the packet reception processing function. The processing function will take out a descriptor chain from the available ring of the Receive Virtqueue and read from the tap device, writing the data into the memory buffer pointed to by the descriptor chain, and updating the used ring. The processing function will repeat this step until reading from the tap device returns a negative value and errno is EWOULDBLOCK, indicating that there are no new packets in the tap device, after which it will interrupt to notify other virtual machines to receive packets.

Configuring the Environment

Requirements for Disk Images

The disk image of root Linux needs to install at least the following packages:

apt-get install git sudo vim bash-completion \
kmod net-tools iputils-ping resolvconf ntpdate

Requirements for the Linux Image

Before compiling the root Linux image, change the CONFIG_IPV6 and CONFIG_BRIDGE configs to y in the .config file to support creating bridge and tap devices in root Linux. For example:

cd linux
# Add a line in .config
CONFIG_BLK_DEV_RAM=y
# Modify two CONFIG parameters in .config
CONFIG_IPV6=y
CONFIG_BRIDGE=y
# Then compile Linux

Creating a Network Topology

Before using Virtio net devices, you need to create a network topology in root Linux so that Virtio net devices can connect to real network cards through Tap devices and bridge devices. Execute the following commands in root Linux:

mount -t proc proc /proc
mount -t sysfs sysfs /sys
ip link set eth0 up
dhclient eth0
brctl addbr br0
brctl addif br0 eth0
ifconfig eth0 0
dhclient br0
ip tuntap add dev tap0 mode tap
brctl addif br0 tap0
ip link set dev tap0 up

This will create a tap0 device <-> bridge device <-> real network card network topology.

Testing Non-root Linux Network Connectivity

Execute the following commands in the non-root Linux command line to start the network card:

mount -t proc proc /proc
mount -t sysfs sysfs /sys
ip link set eth0 up
dhclient eth0

You can test network connectivity with the following commands:

curl www.baidu.com
ping www.baidu.com

Virtio Console

The Virtio Console device is essentially a virtual console device used for input and output of data, and can serve as a virtual terminal for other virtual machines. Currently, hvisor supports the VIRTIO_CONSOLE_F_SIZE and VIRTIO_F_VERSION_1 features.

Description of the Virtio Console Device

For the Virtio console device, the type field in the VirtIODevice structure is VirtioTConsole, vqs_len is 2, indicating that there are two Virtqueues, the receive virtqueue and the transmit virtqueue, used for receiving and sending data on port 0. The dev pointer points to the virtio_console_dev structure that describes specific information about the console device. In this structure, config represents the number of rows and columns of the console, master_fd is the file descriptor of the pseudo-terminal master device connected to the device, rx_ready indicates whether the receive queue is available, and event is used for the event monitor thread to monitor readable events of the pseudo-terminal master device through epoll.

typedef struct virtio_console_dev {
    ConsoleConfig config;
    int master_fd;
    int rx_ready;
    struct hvisor_event *event;
} ConsoleDev;

Pseudo Terminal

A terminal is essentially an input and output device. In the early days of computing, terminals were called teleprinters (TTY). Now, terminals have become a virtual device on computers, connected by terminal emulation programs to graphics card drivers and keyboard drivers to implement data input and output. There are two different implementations of terminal emulation programs: one is as a kernel module in Linux, exposed to user programs as /dev/tty[n]; the other is as an application running in Linux user space, known as a pseudo-terminal (PTY).

Pseudo-terminals themselves are not the focus of this article, but the two devices used by pseudo-terminals—PTY master and PTY slave—are used in this article to implement the Virtio Console device.

Applications can obtain an available PTY master by executing posix_openpt, and get the corresponding PTY slave through the ptsname function. A TTY driver connecting PTY master and PTY slave will copy data between the master and slave. Thus, when a program writes data to the master (or slave), the program can read the same data from the slave (or master).

Overall Design of Virtio Console

The Virtio Console device, as a daemon on Root Linux, opens a PTY master during device initialization and outputs the path of the corresponding PTY slave /dev/pts/x to the log file for screen session connection. Meanwhile, the event monitor thread in the Virtio daemon monitors the readable events of the PTY slave so that the PTY master can promptly receive user input data.

When a user executes screen /dev/pts/x on Root Linux, a screen session is created on the current terminal, connecting the device corresponding to the PTY slave /dev/pts/x, and taking over the input and output of the current terminal. The implementation structure diagram of the Virtio Console device is shown below.

virtio_console

Input Commands

When a user types commands on the keyboard, the input characters are passed to the Screen session through the terminal device. The Screen session writes the characters into the PTY slave. When the event monitor thread detects through epoll that the PTY slave is readable, it calls the virtio_console_event_handler function. This function reads from the PTY slave and writes the data into the Virtio Console device's Receive Virtqueue, and sends an interrupt to the corresponding virtual machine.

The corresponding virtual machine, after receiving the interrupt, passes the received character data to the Shell through the TTY subsystem for interpretation and execution.

Display Information

When a virtual machine using the Virtio Console driver wants to output information through the Virtio Console device, the Virtio Console driver writes the data to be output into the Transmit Virtqueue and writes to the QueueNotify register in the MMIO area to notify the Virtio Console device to handle the IO operation.

The Virtio Console device reads from the Transmit Virtqueue, retrieves the data to be output, and writes it into the PTY master. The Screen session then retrieves the data to be output from the PTY slave and displays the output information on the monitor through the terminal device.

Since the PTY master and PTY slave are connected by a TTY driver, which includes a line discipline to pass the data written from the PTY master to the PTY slave back to the PTY master, we need to disable this functionality using the cfmakeraw function to turn off the line discipline feature.

Virtio GPU

To use the Virtio GPU device in hvisor-tool, you need to first install libdrm on the host and perform some related configurations.

Prerequisites

  • Install libdrm

We need to install libdrm to compile Virtio-gpu, assuming the target platform is arm64.

wget https://dri.freedesktop.org/libdrm/libdrm-2.4.100.tar.gz
tar -xzvf libdrm-2.4.100.tar.gz
cd libdrm-2.4.100

Tips: libdrm versions above 2.4.100 require compilation with tools like meson, which can be complicated. More versions are available at https://dri.freedesktop.org/libdrm.

# Install to your aarch64-linux-gnu compiler
./configure --host=aarch64-linux-gnu --prefix=/usr/aarch64-linux-gnu && make && make install

For loongarch64, use:

./configure --host=loongarch64-unknown-linux-gnu --disable-nouveau --disable-intel --prefix=/opt/libdrm-install && make && sudo make install
  • Configure the Linux kernel

The Linux kernel needs to support virtio-gpu and drm related drivers. Specifically, the following options need to be enabled when compiling the kernel:

CONFIG_DRM=y
CONFIG_DRM_VIRTIO_GPU=y

Other GPU-related drivers may not be compiled into the kernel. You need to compile according to the specific device, and you can use menuconfig during compilation. Specifically, go to Device Drivers -> Graphics support -> Direct Rendering Infrastructure(DRM). Under Graphics support, there are also drivers supporting virtio-gpu. If needed, enable the compilation of related fields such as Virtio GPU driver and DRM Support for bochs dispi vga interface.

At the bottom of the Graphics support entry, there is Bootup logo. Enabling this option will display the Linux logo representing the number of CPU cores on the screen at startup.

  • Detect physical GPU devices in Root Linux

To detect physical GPU devices in Root Linux, you need to edit files in the hvisor/src/platform directory to detect GPU devices on the PCI bus. You need to add the interrupt number of the Virtio-gpu device to ROOT_ZONE_IRQS. For example:

pub const ROOT_PCI_DEVS: [u64; 3] = [0, 1 << 3, 6 << 3];

After starting Root Linux, you can check if your GPU device is working properly by running dmesg | grep drm or lspci. If files like card0 and renderD128 appear under /dev/dri, it means the graphics device is successfully recognized and can be controlled by drm.

  • Check if the real GPU device is supported

If you want to port Virtio-GPU to other platforms, you need to ensure that the physical GPU device on that platform is supported by the drm framework. To see the devices supported by libdrm, you can install the libdrm-tests package using the command apt install libdrm-tests, and then run modetest.

  • qemu startup parameters

If hvisor runs in a qemu aarch64 environment, qemu needs to provide a GPU device to root linux. Add the following to the qemu startup parameters:

QEMU_ARGS += -device virtio-gpu,addr=06,iommu_platform=on
QEMU_ARGS += -display sdl

Also, ensure that the startup parameters include smmu configuration:

-machine virt,secure=on,gic-version=3,virtualization=on,iommu=smmuv3
-global arm-smmuv3.stage=2

PCI devices primarily have three spaces: Configuration Space, Memory Space, and I/O Space.

1. Configuration Space

  • Purpose: Used for device initialization and configuration.
  • Size: Each PCI device has 256 bytes of configuration space.
  • Access Method: Accessed via bus number, device number, and function number.
  • Contents:
    • Device identification information (such as vendor ID, device ID).
    • Status and command registers.
    • Base Address Registers (BARs), used to map the device's memory space and I/O space.
    • Information about interrupt lines and interrupt pins.

2. Memory Space

  • Purpose: Used to access device registers and memory, suitable for high bandwidth access.
  • Size: Defined by the device manufacturer, mapped into the system memory address space.
  • Access Method: Accessed via memory read/write instructions.
  • Contents:
    • Device registers: Used for control and status reading.
    • Device-specific memory: such as frame buffers, DMA buffers, etc.

3. I/O Space

  • Purpose: Used to access the device's control registers, suitable for low bandwidth access.
  • Size: Defined by the device manufacturer, mapped into the system's I/O address space.
  • Access Method: Accessed via special I/O instructions (such as in and out).
  • Contents:
    • Device control registers: Used to perform specific I/O operations.

Summary

  • Configuration Space is mainly used for device initialization and configuration.
  • Memory Space is used for high-speed access to device registers and memory.
  • I/O Space is used for low-speed access to device control registers.

PCI virtualization mainly involves managing the above three spaces. Considering that most devices do not have multiple PCI buses, and the ownership of the PCI bus generally belongs to zone0, to ensure the access speed of PCI devices in zone0, hvisor does not process the PCI bus and PCI devices in zone0 when there is no need to allocate devices on this bus to other zones.

When allocating PCI devices to a zone, we need to ensure that Linux in zone0 no longer uses them. As long as the devices are allocated to other zones, zone0 should not access these devices. Unfortunately, we cannot simply use PCI hot-plugging to remove/re-add devices at runtime, as Linux might reprogram the BARs and locate resources in positions we do not expect or allow. Therefore, a driver in the zone0 kernel is needed to intercept access to these PCI devices, and we turn to the hvisor tool.

The hvisor tool registers itself as a PCI virtual driver and claims management of these devices when other zones use them. Before creating a zone, hvisor allows these devices to unbind from their own drivers and bind to the hvisor tool. When a zone is destroyed, these devices are actually no longer in use by any zone, but from the perspective of zone0, the hvisor tool is still a valid virtual driver, so the release of the devices needs to be done manually. The hvisor tool releases the devices bound to these zones, and from the perspective of zone0 Linux, these devices are not bound to any drivers, so if these devices are needed, Linux will automatically rebind the correct drivers.

Now we need to allow zones to correctly access PCI devices, and to achieve this goal as simply as possible, we directly reuse the structure of the PCI bus, meaning the content about the PCI bus will appear in the device tree of the zones that need to use devices on this bus, but other than the zone that truly owns this bus, other zones can only access the device through mmio proxy by hvisor. When a zone attempts to access a PCI device, hvisor checks if it owns the device, which is declared when the zone is created. If a zone accesses the configuration space of a device that belongs to it, hvisor will correctly return the information.

Currently, the handling of I/O space and memory space is the same as for configuration space. Due to the uniqueness of BAR resources, configuration space cannot be directly allocated to a zone, and the frequency of access to BAR space is low, which does not significantly affect efficiency. However, direct allocation of I/O space and memory space is theoretically feasible, and further direct allocation of I/O space and memory space to the corresponding zone would improve access speed.

To facilitate testing of PCI virtualization in QEMU, we wrote a PCI device.

PCIe Resource Allocation and Isolation

Resource Allocation Method

In each zone's configuration file, the number of PCIe devices allocated to that zone is specified by num_pci_devs, and these devices' BDF is specified by alloc_pci_devs. Note that it must include 0.

For example:

{
    "arch": "riscv",
    "name": "linux2",
    "zone_id": 1,
    ///
    "num_pci_devs": 2,
    "alloc_pci_devs": [0, 16]
}

virt PCI

#![allow(unused)]
fn main() {
pub struct PciRoot {
    endpoints: Vec<EndpointConfig>,
    bridges: Vec<BridgeConfig>,
    alloc_devs: Vec<usize>, // include host bridge
    phantom_devs: Vec<PhantomCfg>,
    bar_regions: Vec<BarRegion>,
}
}

It should be noted that phantom_devs are devices that do not belong to this virtual machine; bar_regions are the BAR spaces of devices that belong to this virtual machine.

phantom_dev

This part of the code can be found in src/pci/phantom_cfg.rs. When the virtual machine first accesses a device that does not belong to it, a phantom_dev is created.

The handling function can be found in src/pci/pci.rs under mmio_pci_handler, which is our function for handling the virtual machine's access to the configuration space.

hvisor lets each virtual machine see the same PCIe topology, which can avoid complex processing brought by different BAR and bus number allocations, especially for configurations involving TLB forwarding in bridge devices, saving a lot of effort.

However, for Endpoints not allocated to the virtual machine, they are virtualized as phantom_devs. When accessing the header, it should return a specific vendor-id and device-id, such as 0x77777777, and return a reserved class-code. For such devices that exist but cannot find corresponding drivers, the virtual machine will only perform some basic configurations during the enumeration stage, such as reserving BARs.

capabilities

The capabilities section involves MSI configurations and more. When the virtual machine accesses the capabilities-pointer, it returns 0, indicating that the device has no capabilities, preventing overwriting the configuration of the device's owning virtual machine (e.g., the configuration content in the MSI-TABLE in the BAR space).

command

Additionally, for the COMMAND register, when the virtual machine detects no MSI capabilities, it will turn on traditional interrupts, which involves setting the DisINTx field in the COMMAND register. Hardware requires choosing between MSI and legacy, avoiding contradictions set by different virtual machines (originally, non-owning virtual machines should not set this), hence we need a virtual COMMAND register.

About BAR

This part of the code can be found in src/pci/pcibar.rs.

#![allow(unused)]
fn main() {
pub struct PciBar {
    val: u32,
    bar_type: BarType,
    size: usize,
}

pub struct BarRegion{
    pub start: usize,
    pub size: usize,
    pub bar_type: BarType
}

pub enum BarType {
    Mem32,
    Mem64,
    IO,
    #[default]
    Unknown,
}
}

If each virtual machine sees the same topology, then the allocation of BAR space is completely the same.

Then, when a non-root virtual machine starts, it directly reads the BAR configured by the root to know which BAR spaces each virtual machine should access (determined by the devices allocated to it).

If the virtual machine accesses the BAR and then traps into the hypervisor for proxy, the efficiency is low. We should let the hardware do this, directly writing this space into the virtual machine's stage-2 page table. Note in the pci_bars_register function, when filling in the page table, according to the BarRegion's BarType, find the mapping relationship between the PCI address and the CPU address of that type (written in the device tree, also synchronized in the configuration file's pci_config), and convert the PCI address in the BAR configuration to the corresponding CPU address before writing it into the page table.

The method of obtaining the BAR allocation result from the root-configured BAR as described above mainly distinguishes between Endpoint and Bridge (because the number of BARs is different for the two), accesses the configuration space according to BDF, first reads the root's configuration result, then writes all 1s to get the size, and then writes back the configuration result. Specific code can be combined with endpoint.rs, bridge.rs, and pcibar.rs for reference, with special attention needed for handling 64-bit memory addresses.

hvisor Management Tool

hvisor manages the entire system through a Root Linux that manages virtual machines. Root Linux provides services for starting and shutting down virtual machines and starting and shutting down Virtio daemons through a set of management tools. The management tools include a command-line tool and a kernel module. The command-line tool is used to parse and execute commands entered by the user, and the kernel module is used for communication between the command-line tool, Virtio daemon, and Hypervisor. The repository address for the management tools is: hvisor-tool.

Starting Virtual Machines

Users can create a new virtual machine on Root Linux for hvisor by entering the following command:

./hvisor zone start [vm_name].json

The command-line tool first parses the contents of the [vm_name].json file, writing the virtual machine configuration into the zone_config structure. Based on the images and dtb files specified in the file, their contents are read into temporary memory through the read function. To load the images and dtb files into a specified physical memory address, the hvisor.ko kernel module provides the hvisor_map function, which can map a physical memory area to user-space virtual address space.

When the command-line tool executes the mmap function on /dev/hvisor, the kernel calls the hvisor_map function to map user virtual memory to the specified physical memory. Afterwards, the image and dtb file contents can be moved from temporary memory to the user-specified physical memory area through a memory copy function.

After the image is loaded, the command-line tool calls ioctl on /dev/hvisor, specifying the operation code as HVISOR_ZONE_START. The kernel module then notifies the Hypervisor through a Hypercall and passes the address of the zone_config structure object, informing the Hypervisor to start the virtual machine.

Shutting Down Virtual Machines

Users can shut down a virtual machine with ID vm_id by entering the command:

./hvisor shutdown -id [vm_id]

This command calls ioctl on /dev/hvisor, specifying the operation code as HVISOR_ZONE_SHUTDOWN. The kernel module then notifies the Hypervisor through a Hypercall, passing vm_id, and informs the Hypervisor to shut down the virtual machine.

Starting Virtio Daemons

Users can start a Virtio device by entering the command:

nohup ./hvisor virtio start [virtio_cfg.json] &

This will create a Virtio device and initialize related data structures according to the Virtio device information specified in virtio_cfg.json. Currently, three types of Virtio devices can be created, including Virtio-net, Virtio-block, and Virtio-console devices.

Since the command-line parameters include nohup and &, the command will exist in the form of a daemon, with all output of the daemon redirected to nohup.out. The daemon's output includes six levels, from low to high: LOG_TRACE, LOG_DEBUG, LOG_INFO, LOG_WARN, LOG_ERROR, LOG_FATAL. When compiling the command-line tool, the LOG level can be specified. For example, when LOG is LOG_INFO, outputs equal to or higher than LOG_INFO will be recorded in the log file, while log_trace and log_debug will not be output.

After the Virtio device is created, the Virtio daemon will poll the request submission queue to obtain Virtio requests from other virtual machines. When there are no requests for a long time, it will automatically enter sleep mode.

Shutting Down Virtio Daemons

Users can shut down the Virtio daemon by entering the command:

pkill hvisor

The Virtio daemon, when started, registers a signal handler virtio_close for the SIGTERM signal. When executing pkill hvisor, a SIGTERM signal is sent to the process named hvisor. At this point, the daemon executes virtio_close, recycles resources, shuts down various sub-threads, and finally exits.

Hypercall Description

As a Hypervisor, hvisor provides a hypercall processing mechanism to the upper layer virtual machines.

How Virtual Machines Execute Hypercall

Virtual machines execute a specified assembly instruction, which is hvc on Arm64 and ecall on riscv64. When executing the assembly instruction, the parameters passed are:

  • code: hypercall id, its range and meaning are detailed in hvisor's handling of hypercalls
  • arg0: the first parameter passed by the virtual machine, type is u64
  • arg1: the second parameter passed by the virtual machine, type is u64

For example, for riscv linux:

#ifdef RISCV64

// according to the riscv sbi spec
// SBI return has the following format:
// struct sbiret
//  {
//  long error;
//  long value;
// };

// a0: error, a1: value
static inline __u64 hvisor_call(__u64 code,__u64 arg0, __u64 arg1) {
	register __u64 a0 asm("a0") = code;
	register __u64 a1 asm("a1") = arg0;
	register __u64 a2 asm("a2") = arg1;
	register __u64 a7 asm("a7") = 0x114514;
	asm volatile ("ecall"
	        : "+r" (a0), "+r" (a1)
			: "r" (a2), "r" (a7)
			: "memory");
	return a1;
}
#endif

For arm64 linux:

#ifdef ARM64
static inline __u64 hvisor_call(__u64 code, __u64 arg0, __u64 arg1) {
	register __u64 x0 asm("x0") = code;
	register __u64 x1 asm("x1") = arg0;
	register __u64 x2 asm("x2") = arg1;

	asm volatile ("hvc #0x4856"
	        : "+r" (x0)
			: "r" (x1), "r" (x2)
			: "memory");
	return x0;
}
#endif /* ARM64 */

hvisor's Handling of Hypercall

After the virtual machine executes a hypercall, the CPU enters the exception handling function specified by hvisor: hypercall. Then hvisor continues to call different processing functions based on the hypercall parameters code, arg0, arg1, which are:

codeFunction CalledParameter DescriptionFunction Summary
0hv_virtio_initarg0: start address of shared memoryUsed for root zone to initialize virtio mechanism
1hv_virtio_inject_irqNoneUsed for root zone to send virtio device interrupts to other virtual machines
2hv_zone_startarg0: virtual machine configuration file address; arg1: configuration file sizeUsed for root zone to start a virtual machine
3hv_zone_shutdownarg0: id of the virtual machine to be shut downUsed for root zone to shut down a virtual machine
4hv_zone_listarg0: address of data structure representing virtual machine information; arg1: number of virtual machine informationUsed for root zone to view information of all virtual machines in the system
5hv_ivc_infoarg0: start address of ivc informationUsed for a zone to view its own communication domain information