Raspberry Pi 4 emulation with QEMU virt

I recently upgraded my RaspberryPi 4, which is running Home Assistant and a bunch of other services, to an x86 mini-PC (a Lenovo ThinkCenter M920). To ease the transition, I wanted to leave the RaspberryPi’s OS running as a virtual machine inside the x86 host. This way I can port services over to the x86 machine gradually, but already disconnect the RPi hardware and reuse it for some other project.

QEMU can emulate a generic ARM board using the versatilepb machine type, or it can more faithfully emulate the RaspberryPi with one of the raspi machines. The versatilepb system is limited to 1 CPU and 256 MB RAM so this was not going to cut it. The raspi models do not support the RaspberryPi 4, but do go up to the RaspberryPi 3B with 4 CPUs and 1 GB memory so I tried that one first.

QEMU’s raspi3b machine type

Porting a physical RaspberryPi to a QEMU raspi3b is pretty straightforward, there are many tutorials on the internet but I ended up with the following QEMU command line.

qemu-system-aarch64 \
    -display none \
    -machine raspi3b \
    -cpu cortex-a72 \
    -dtb /rpi/boot/bcm2710-rpi-3-b-plus.dtb \
    -m 1G -smp 4 -serial stdio \
    -kernel /rpi/boot/kernel8.img \
    -append "rw earlyprintk loglevel=8 console=ttyAMA0,115200 dwc_otg.lpm_enable=0 root=/dev/mmcblk0p1 rootdelay=1" \
    -sd /rpi/root.img \
    -device usb-net,netdev=net0 -netdev tap,id=net0,ifname=tap0,script=no,downscript=no

This assumes that you have copied the RPi’s SD card to a file /rpi/root.img on the host, as well as the kernel and relevant DTB file from /boot. Networking goes through an emulated USB Ethernet device, which is what the real Raspberry Pi 3 uses and as far as I could find the only device that works using QEMU’s raspi3b machine type. Since I wanted my emulated RPi to have full networking capabilities, I loosely followed this post to set up a TAP device bridged to the host’s network adapter:

ip tuntap add name tap0 mode tap
ip link set up dev tap0
ip link set tap0 master br0

Since I was already running Home Assistant using Docker on my x86 machine, I also got bitten by this issue which prevented my emulated RPi from communicating with the outside. This was fixed by adding this to my startup script:

sysctl net.bridge.bridge-nf-call-iptables=0

While this setup was fully functional, the speed of the emulated USB Ethernet device was quite terrible: usually less than 1 Mbit/s (I was used to my physical RaspberryPi 4 having a Gigabit Ethernet connection), as well as CPU usage being very high on the host even when the RPi wasn’t doing very much. Luckily, QEMU has a better solution.

QEMU virt generic virtual platform

The virt machine type is a generic ARM virtual platform, unlike the raspi models it isn’t locked to a particular configuration (number of CPUs, memory, devices, etc.). It also doesn’t really model most devices such as disks and network controllers directly, but relies on the guest system’s kernel to use special virtualization-specific calls. This means a suitable kernel must be used, but it also means IO is much faster since the emulated kernel isn’t going through all the motions it usually does to write to device registers, nor does it have to run the entire USB stack to talk to an USB Ethernet controller. Instead, most of the work is done by the host, making things much more efficient.

The kernel you get with Raspbian unfortunately does not support these virtualized disk and network devices. However the Debian ARM distro has a suitable kernel, and since Raspbian is based on Debian it’s pretty easy to install it. So while my RPi was still running in QEMU’s raspi3b mode, I ran the following:

wget http://security.debian.org/debian-security/pool/updates/main/l/linux/linux-image-5.10.0-21-armmp-lpae_5.10.162-1_armhf.deb
sudo dpkg --install linux-image-5.10.0-21-armmp-lpae_5.10.162-1_armhf.deb

There are many kernels to choose from in the Debian package repository, so to maximize the probability of success you should pick one that is pretty close to the one that the RPi is already running (mine was on 5.10.103). My Raspbian was still 32-bit — even though it last ran on a Raspberry Pi 4, it was first installed on a Raspberry Pi B in 2014 — so I picked the 32-bit kernel, although the arm64 one should work similarly.

Running the dpkg --install command inside Raspbian is an important step: it doesn’t just extract the kernel (which we could have obtained by extracting the .deb manually on the x86 host); it also installs the kernel modules in /lib/modules/5.10.0-21-armmp-lpae and it builds the initial ramdisk containing all the required device drivers, including the modules needed to access the virtualized disk and network adapter. We can now copy the kernel and initrd from /boot (vmlinuz-5.10.0-21-armmp-lpae and initrd.img-5.10.0-21-armmp-lpae) to the host machine, and run the VM using a command similar to this one:

qemu-system-arm \
    -nographic \
    -machine virt \
    -cpu cortex-a7 \
    -m 2G -smp 4 \
    -drive file=/rpi/root.img,format=raw,id=hd,if=none,media=disk \
    -device virtio-scsi-device -device scsi-hd,drive=hd \
    -device virtio-net-device,netdev=net0 \
    -netdev tap,id=net0,ifname=tap0,script=no,downscript=no \
    -kernel /rpi/vmlinuz-5.10.0-21-armmp-lpae \
    -initrd /rpi/initrd.img-5.10.0-21-armmp-lpae \
    -append 'root=/dev/sda1 panic=1 console=ttyAMA0,115200'

While functionally pretty much the same as using the raspi3b machine type, CPU utilization on the host has been noticeably lower, and network speeds have gone up to 250 Mbit/s. So everything is now pretty much as fast as it did when ran on a real Raspberry Pi, while the actual RPi hardware will be move on to do other things.


My stash of IN-12B’s should allow for a few more projects…

After a long time of doing mostly virtual (software) stuff, I felt it was time for another hardware project. I’ve always been fascinated by the old Nixie tubes, and had ordered a bunch of them before they get too expensive or become impossible to find. The obvious thing to use them for is a clock. I wanted to go for a mix of old (the Nixies) and new, so the clock is driven by a Raspberry Pi – meaning it will have WiFi and NTP so I never need to update it to the correct time even on daylight savings changes.

Continue reading

Sniper: a new multi-core simulator

Sniper is a next generation multi-threaded, high-speed and accurate x86-64 simulator. This microarchitectural simulator is based on the interval core model and the Graphite simulation infrastructure, allowing for fast and accurate simulation and for trading off simulation speed for accuracy to allow a range of flexible simulation options when exploring different micro-architectures. Using this methodology, we are able to achieve good accuracy against hardware for 16-thread applications.

The Sniper simulator allows one to perform timing simulations for multi-threaded, shared-memory applications with 10s to 100+ cores, at a high speed when compared to existing simulators. The main feature of the simulator is its core model which is based on the interval core model, a fast mechanistic core model. The interval model allows for faster simulations than typical cycle-accurate simulators by jumping between miss events because of long-latency operations. On recent machines, we see speeds of up to 2 MIPS for SPLASH-2 benchmarks, and almost 3 MIPS for SpecOMP benchmarks.

This simulator, and the interval core model, is useful for uncore and system-level studies that require more detail than the typical one-IPC models. As an added benefit, the interval core model allows the generation of CPI stacks, which shows the number of cycles lost due to different characteristics of the system, like the cache hierarchy or branch predictor, to be easily understood.

Sniper is available for download at http://snipersim.org

Revenge of the low-cost supercomputer – part 2

Following up on my TOP500 list by interconnect post, here’s a update with the latest, June 2010 TOP500 list.

First, for comparison, here’s the November 2009 version at the same scale:

And this is how the list looks now:

For one, given the number of recent announcements around it, I was expecting for 10G-Ethernet to gain some more adoption by now. Yet (assuming those details in the TOP500 list – and my parsing of it – are correct), only two new systems use 10G-Eth while those that were in last November’s list have now fallen out.

But something much more interesting seems to be happening at the lower right corner of the graph (look at the three blue (Inifiniband) dots nearest to the legend, interestingly, all three systems are Chinese). The Tianhe-1 hybrid Intel Xeon + ATI Radeon cluster (now at #7) got some company from two more GPGPU clusters, Mole-8.5 at #19 and even the new #2, Nebulae. They are characterized by a rather low efficiency – especially for this part of the ranking. Note that no-one else in the right half fools around with mere Gigabit Ethernet or otherwise has an efficiency that is lower than 70%. Yet the number two of this list only manages a 43% efficiency, and needs 2984 GFLOPS of raw computing power (28% more than the #1 system, Jaguar) to get at a LINPACK score of 28% less than Jaguar’s.

The explanation for this lies in two numbers that are not shown in this graph: power and cost. Just like the commodity-based clusters started to take over from the hard-core custom-built supercomputers some 10 years ago, the GPGPU-based system may very well be on its way to take over the charts. They follow the same basic recipe as those Beowulf-inspired clusters: a not-so-great efficiency, which is cured with loads of cheap, low-power processing power.

The efficiency of the cluster improved drastically over time, using better interconnect such as Infiniband. The cluster idea is now so prevalent that over 80% of the systems in the Top500 today should be categorized as clusters. The next question is how GPGPU will evolve into something that can combine its advantages of low cost and low power with increased efficiency. Nvidia has some clear ideas about HPC being the future for their products (although their first step, Fermi, got executed rather strangely…). ATI/AMD are also hard at work with FireStream and OpenCL. And coming from the other side of the arena, we have the general purpose processors moving towards simpler, but many more cores per chip, an idea embodied in the form of (among many others) Tilera’s TILE family or Intel’s SCC.

Interesting times again loom ahead. Let the games, eh computations, begin!

Historical currency converter web service

Looking for an excuse to try out Google AppEngine, and encouraged by someone on StackOverflow looking for a free web service to convert between currencies at historical dates, I built the Historical currency converter web service. Using a very simple RESTfull API, you can convert between all currencies on the ECB’s list, using exchange rates that date back to January 1999.

Continue reading

Questioner or answerer?

Yesterday on StackOverflow, I came across one of those users that kept asking questions, but didn’t really seem to understand much of the responses. Looking at his profile, it turned out he had asked over a hundred questions, but contributed less than ten answers. I won’t be tempted to start about his capabilities of actually answering any SO questions (although his understanding of other’s answers to his own questions, except when he was able to copy-paste someone’s source code, also didn’t seem to be that great), but it did get me thinking about what a ‘common’ ratio of questions versus answers would be for other SO users (personally, I’m at 1/85 right now). Of course, that triggered my data-analysis and graphing gene…

Continue reading

StackOverflow user diversity

I’ve been wondering what the diversity of knowledge of StackOverflow users would be like. It seemed like an interesting research idea to see how many people have responded only to questions in a very narrow field, and how many others have broader knowledge and can contribute useful answers in more diverse fields. Apparently, there is even supposed to be a badge for that (the Generalist badge), but it didn’t get implemented yet.

It’s easy to do this using tags: some sort of clustering should be applied according to how often each pair of tags shows up at the same question (a user that knows both ASP and ASP.net shouldn’t be considered a ‘diverse’ person, so this should be factored out first), next we can count in how many different clusters that this user has contributed a good answer.

Continue reading