APPLICATION NOTE

RZ/A1 EU_00181 Rev.1.10 XIP for RZ/A1 Jun 13, 2016 Introduction

Target Device

Contents

1. Frame of Reference ...... 2

2. What is an XIP Linux Kernel ...... 2

3. RZ/A1 XIP SPI Flash Hardware Support ...... 2

4. Updating the kernel image ...... 3

5. Kernel RAM usage ...... 3

6. Simple Benchmarks ...... 3

7. Kernel vs Userland ...... 6

8. Files Systems and Storage ...... 6

9. u-boot Modifications ...... 7

EU_00181 Rev.1.10 Page 1 of 8 Jun 13, 2016 RZ/A1 XIP Linux for RZ/A1 1. Frame of Reference Since the Linux kernel and open source community is constantly changing, please keep in mind that this documented was written in August of 2014, and the kernel references were to the Linux-3.14 code base.

2. What is an XIP Linux Kernel When any executable program is compiled and linked, the different portions of the program are combined together in the resulting binary image. For the Linux kernel, the order is basically: text (ie, code) , read only data, initialized data variables, uninitialized BSS variables. You can see this by examining the System.map file. For a traditional standard Linux kernel, this entire image memory map image is placed in RAM. The reason for this is that systems that generally utilize Linux are either PCs or high end embedded MPU designs where code is intended to be run from high speed RAM (DDR memory). A couple years ago, source code and the linker scripts within the kernel were modified such that ROM and RAM sections could be explicitly defined as opposed to just letting the RAM follow the ROM. The main target platform was the Power PC with a parallel NOR Flash. The main purpose was to allow for a faster boot time since the kernel would not have to be decompressed and copied into RAM before execution would begin. Instead, the kernel code could begin execution immediately. The tradeoff however was that NOR execution was slower than DDR execution, and NOR Flash cost more than DDR. Later, some patches were submitted to the mainline kernel for a TI OMAP device (ARM based). Again, the assumption was execution from parallel NOR flash. It should be noted that while traditional kernel utilities like mkimage were modified to produce some level of support for creating XIP kernel images that could be launched using the ‘bootm’ command in u-boot, it was specific to the original Power PC experiment (and a bit of a hack). TI did release an app note related to getting around the Power PC specific nuances, but again, it was somewhat of a work around to for the Power PC specific booting nuances. It is also worth mentioning that there is a section defined in the kernel called ‘init’ who scope is only during boot time. This means that any functions or data structures that are only needed for the boot process and can assumed to only be used once, you can assign them to this unique section. The benefit is that the final operations the Linux kernel will do during its boot process before handing control off to the application space is ‘free’ any init sections, thus reclaiming valuable RAM memory since otherwise it was be wasted taking up space for code that would never be run again. Therefore, one modification of the XIP kernel build was to outline that while the init section RAM data could be freed, the inti section code could not (since it will be located in ROM). For a RAM based kernel for the ARM architecture, the kernel is generally located at a virtual base address of 0xC0000000. Virtual memory mapping is used to remap the RAM’s physical location, say external SDRAM at 0x08000000 (CS2) or internal RAM at 0x20000000 for the RZ/A1. For an XIP kernel for ARM, the kernel’s ROM sections (code and constants) are mapped 16MB below the beginning of RAM, ie, 0xBF000000. This is the same area that is used for loaded modules. See the definition of the macro XIP_VIRT_ADDR in arch/arm/include/asm/memory.h. If you examine the System.map file of an XIP kernel build, you will easily be able to identify what portions are access directly from ROM Flash (0xBFxxxxxxx) and those portions that will be in RAM (0xC0xxxxxx) One more thing to mention is that when driver modules are loaded at run-time, they are loaded in RAM and accesses using address 0xBFxxxxxxx. This may be a little confusing because we just mentioned that 0xBFxxxxxxx was the location of the XIP kernel ROM, but that is the beauty of virtual address mapping. Also, if you have driver code that you need to run as fast as possible, by making it a driver module you can ensure that all the code will execute out of RAM which will give you better performance than a static driver which would be part of an XIP kernel executing out of Flash ROM.

3. RZ/A1 XIP SPI Flash Hardware Support The RZ/A1 has the ability to memory map the contents of SPI Flash into linear accessible/executable memory using peripheral block called “SPI Multi I/O Bus Controller”. Basically this means when the CPU attempts to read data or fetch code from a specific address range, hardware will automatically use the SPI channel to read the corresponding data from within the SPI Flash. Additionally, the hardware has 16 cache lines (8 bytes each) that can be used to prefetch data from flash in order to reduce latency. There is also the option to automatically fill more than 1 cache line with contiguous flash data on a cache miss in order to anticipate future reads. From experiments with the XIP Linux kernel, filling 2 cache lines automatically (ie, always read 16 bytes of SPI Flash), yields the best performance results.

EU_00181 Rev.1.10 Page 2 of 8 Jun 13, 2016 RZ/A1 XIP Linux for RZ/A1

Other features of the XIP interface include supporting both 2-bit and 4-bit address/data interfaces to the SPI flash. This greatly increases the speed at which you can retrieve data. Additionally, there are Double Data Rate (DDR) capabilities so that data is read on every clock edge, meaning you effectively send the address and read the data at a rate twice the clock operating speed. Lastly, the 2 channels of this specialized SPI Flash interface can be used in conjunction with each other meaning you can retrieve data twice as fast since both SPI Flash will be responding to the same flash address. Channel 0 holds all the odd addressed bytes and channel 1 holds all the even address bytes. Therefore in theory it is possible to use 2 SPI Flash devices with the 4-bit wide interface and DDR option to retrieve 16-bits of flash data for each 50MHz SPI clock cycle (the maximum clock frequency of the RZ/A1).

4. Updating the kernel image The use of an XIP kernel does not prohibit the or media device you would like to use. The only exception is that you cannot reprogram the flash devices that you are currently running out of. For example, if the RZ/A1 is running in XIP mode using the Quad SPI interface, you cannot modify (erase/write) to that SPI Flash device since that would require taking the SPI peripheral out of XIP mode and putting it back into SPI mode, which would in turn crash your system. Instead, to update your kernel you would first have to save it someplace else and then reboot into u-boot or some other customer bootloader that executes out of RAM. It might be possible, however, to create a loadable kernel module where you first load the data you want to program into a memory buffer (in kernel space) and then disable all interrupts. Since we know the entire module will be loaded into system RAM at runtime, we can then change the SPI peripheral from XIP mode to SPI mode and erase/program in our data. Of course at time, we need to make sure no functions are used that are outside of that module, including any kernel utility functions since we would have completely disable any kernel code access.

5. Kernel RAM usage Since the main motivation to moving to an XIP kernel is saving RAM, here are some numbers related to what a XIP kernel uses in terms of RAM. For a traditional kernel, all code and data is kept in RAM. Therefore the majority of the kernel image that gets loaded into RAM is static code which obviously does not require a read/write medium to reside in. There here is a comparison of compiling the same kernel as XIP vs standard. Of course the kernel has many options and drivers that can be turned on and off. For this build, only a small number of drivers were included, the largest being the Ethernet driver and TCP/IP stack. The System.map file was used to determine amount of RAM used by basically looking at the address of the very last symbol ‘_end’. SDRAM kernel: 3,803 KB XIP kernel: 332 KB

6. Simple Benchmarks The following simple benchmarks were performed to understand what the performance difference were between and XIP kernel running out of quad SPI flash vs SDRAM. 6.1 Boot Time To measure boot time, the time was measured from the point in u-boot that the kernel boot process is started to the point at which the log message "Freeing unused kernel memory" is displayed because at that point, the file system is mounted and the rest of the time depends on what apps you want to start. For the kernel images tested, all RZ/A1 BSP drivers were enabled. For the SDRAM based image, a uImage was used which had to be copied from SPI flash to external SDRAM, then decompressed before the kernel could begin booting which accounts for the slower boot time. However, for xipImage for the XIP boot, the kernel booting procedure starts immediately because there is no copying and the kernel is stored in QSPI flash uncompressed and ready to be run directly. Dual QSPI flash was used for the XIP interface. Below are the recorded boot times:  uImage SDRAM: 9.40 seconds  xipImage XIP: 3.05 seconds

EU_00181 Rev.1.10 Page 3 of 8 Jun 13, 2016 RZ/A1 XIP Linux for RZ/A1 6.2 TFTP files from PC to Board This test was to measure how long it took to transfer a file from a PC to the board. This test is relevant because the Ethernet driver and entire networking stack is located in the kernel. Therefore the majority of the code that is executing is in the kernel and little is in the application (located in RAM). On the board, since we didn’t want to take into account saving the file, we dump the file to /dev/null which basically just throws it away. The file system caches were cleared before each test in order to make them consistent. Three different file tests were run sizes that transferred files from the PC to the RZ/A1 board:  Retrieve a 1Mbyte file  Retrieve a 100Kbyte file  Retrieve a 10 100kbyte files one after the other Three different system configurations were run:  An XIP kernel with an XIP file system (AXFS) using only internal RAM  An XIP kernel with an XIP file system (AXFS) using only external SDRAM  A normal kernel running from SDRAM using a Squashfs file system that copies/runs from SDRAM Results: Kernel 1MB 100KB 100KB*10 XIP/AXFS w/SDRAM 2533 302 2875 XIP/AXFS (internal RAM only) 2116 253 2342 uImage/ w/SDRAM 2473 476 2644

Kernel Speed Comparison 3500

3000

2500

2000

Time Time (ms) 1500

1000

500

0 1MB 100KB 100KB*10

XIP/AXFS w/SDRAM XIP/AXFS (internal RAM only) uImage/squashfs w/SDRAM

EU_00181 Rev.1.10 Page 4 of 8 Jun 13, 2016 RZ/A1 XIP Linux for RZ/A1

Below are the commands used for the test:

$ ifconfig eth0 192.168.0.55 up

$ echo 3 > /proc/sys/vm/drop_caches ; awk '{print $22}' /proc/self/stat ; tftp -g -r file_1mb.dat -l /dev/null 192.168.0.100 ; awk '{print $22}' /proc/self/stat

$ echo 3 > /proc/sys/vm/drop_caches ; awk '{print $22}' /proc/self/stat ; tftp -g -r file_100kb.dat -l /dev/null 192.168.0.100 ; awk '{print $22}' /proc/self/stat

$ echo 3 > /proc/sys/vm/drop_caches ; awk '{print $22}' /proc/self/stat ; for i in `seq 1 10`; do tftp -g -r file_100kb.dat -l /dev/null 192.168.0.100 ; done; awk '{print $22}' /proc/self/stat

6.3 OpenSSL Speed Test OpenSSL was built and added to the file system. Then then using the built in speed test that comes with openssl, three system setups were compare. Three different system configurations were run:  An XIP kernel with an XIP file system (AXFS) using only internal RAM  An XIP kernel with an XIP file system (AXFS) using only external SDRAM  A normal kernel running from SDRAM using a Squashfs file system that copies/runs from SDRAM The OpenSSL speed test command lines run were: $ openssl speed -elapsed -evp AES-128-CBC $ openssl speed -elapsed -evp AES-256-CBC $ openssl speed -elapsed -evp DES3 $ openssl speed -elapsed -evp SHA1 $ openssl speed -elapsed -evp SHA256 The results were as follows. The OpenSSL speed test determines how many times the system can perform an operation within 3.00 seconds. Then from that result, it calculates the effective performance in the form of a bitrate throughput. For each crypto operation selected, this is done for a variety of message data sizes ranging from 16bytes to 8Kbytes.

SDRAM Kernel (uImage) with squashfs (execute from SDRAM) File System type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 11188 13253 14024 14261 14336 aes-256-cbc 8873 10180 10625 10755 10774 des-ede3-cbc 1623 1653 1667 1664 1662 sha1 3474 9461 19079 26132 29270 sha256 2521 6429 12180 15655 17105

XIP kernel with an XIP file system (AXFS) using only internal RAM type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 11148 13191 14012 14233 14368 aes-256-cbc 8855 10143 10664 10759 10796 des-ede3-cbc 1623 1656 1659 1661 1662 sha1 3477 9446 19093 26077 29311 sha256 2521 6414 12180 15660 17102

EU_00181 Rev.1.10 Page 5 of 8 Jun 13, 2016 RZ/A1 XIP Linux for RZ/A1

XIP kernel with an XIP file system (AXFS) using only external SDRAM type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 11141 13265 14024 14250 14315 aes-256-cbc 8864 10140 10625 10790 10791 des-ede3-cbc 1622 1652 1661 1669 1662 sha1 3466 9468 19063 26203 29257 sha256 2534 6429 12130 15638 17137

Basically the resulting performances were the same. This was expected since the kernel isn’t really used for these calculation intensive procedures. Therefore, pure user application performance should have no bearing on wither you are using a XIP vs SDRAM kernel. Additionally, high computational operations such as crypto often involve an iterative loop operation meaning that the L1 and L2 cache become heavy used and there is less effect from where the application source was loaded from and executed from (internal RAM, external SDRAM or directly form QSPI).

7. Kernel vs Userland The reason why the kernel image can be converted into an XIP image is because the kernel code is always executed in a flat memory model and always resident (not swapped out). Therefore, it was technically possible to change the address mappings such that the read only sections were in mapping to ROM storage and data address sections were mapped into RAM storage. The userland application (located in the file system) on the other hand is a different architecture and not as easily universally converted. When an application process is started, the kernel allocates virtual memory in both user ram and kernel ram. Executable code is then copied from the file storage area into user ram so that it can be executed. Note that most file system storage devices are block devices such as eMMC, SD card, USB, meaning data is read and written in specified blocks (not byte by byte). From the applications point of view, it appears to have a linear address range that it operates in. However in reality, as code is needed, a block of it is copied out of the storage device (via file system) and into RAM at any available location and then mapped using the Memory Management Unit (MMU) so that the application appears to have contiguous memory. Generally this is done in 4KB page blocks. Unless code is specifically requested to run, it will not be copied into RAM. Also, as other processes require RAM, old page will be recycled allowing new code to be run.

8. Files Systems and Storage Depending on your system requirements, you may require a lot of storage and therefore will use an eMMC or SD Card for your file system. On the other hand, you may have smaller file system storage requirements and could use an additional SPI Flash device attached to one of the standard SPI peripheral blocks (referred to as ‘RSPI’ in Renesas terms). The JFFS2 file system comes in handy for this as it not only allows reading of file system blocks, but also allows writing to them as well so you can have a file system that you can dynamically change at run-time. Considering that this is an and you don’t really have a strong requirement to be able to modify your files in your file system during run-time, you could opt to utilize the unused port of the QSPI Flash or NOR Flash that your device is currently operating out of. This is done by creating an MTD device (Memory Technology Device) that basically maps a physical address range to be a block device which allows a file system driver to be mounted on. While you then have you choice of file system to use on that mtd device, since it will be a read-only file system you could use something like squashfs which is very fast and efficient because it is a read-only file system. You could still use others like or JFFS2, but squashfs generally will give you the highest performance. All three of these are natively supported in Linux kernel. As a suggestion, you could do a combination of mediums if you goal was as small of a BOM cost as possible, but still need some R/W capabilities. For example, you could make the unused area of XIP Quad SPI Flash space an mtd block device with a read-only squashfs file system for your main storage partition, then also have a JFFS2 file system on a separate SPI Flash device (interface through an RSPI channel) and simply mount that file system anywhere you want in our root file system. Then you would have the benefits of a fast file system, but still the ability to easily write files back to the system (in your mounted JFFS2 directories only of course). The file system cramfs has the ability to flag individual files to be run as XIP right from the storage medium (assuming the file system image was stored in linear accessible memory). This is done by instead of coping the data to RAM, the MMU was set up to map directly to the physical location of the file. While this file system was read-only (not an issue

EU_00181 Rev.1.10 Page 6 of 8 Jun 13, 2016 RZ/A1 XIP Linux for RZ/A1 for this usage case), it had other limitation such as a maximum file size of 16MB and total file system size of 256MB. Of course if your goal is to try and use a small 64MB SPI flash device to hold your entire system, then maybe these limitations are not an issue for you. It should be noted that in terms of kernel support, while cramfs still exists in the mainline kernel, it has been depreciated in light of squashfs has been proven faster with more features and less restrictions, but does not have an XIP option. The Advanced XIP Filesystem (AXFS) on the other hand attempts to take the XIP functionality that cramfs offered, but release you from some of the limitations. Additionally, individual pages of any file can be marked as either XIP or compressed (ie, need to be copied to RAM) so that you could optimize your system for the best performance vs RAM savings. For example, since RAM execution is faster than ROM (XIP through quad SPI), you could use the tools that comes with AXFS to identify heavily used file pages and mark them to be executed out of RAM instead of XIP. Then your overall performance can be increased by saving time by not copying a code page to RAM that can be executed in place (and only need to be execute once and a great while) and but copy the code pages to RAM that are executed many time and would benefit greatly from RAM execution. AXFS also notices a page does not compress well and the tool will recognize that and will leave it as uncompressed. To use AXFS, you will need to patch the kernel to add AXFS support as it is not in the mainline distribution. AXFS was added to Linux-3.4-LTSI, but not Linux-3.10-LTSI. The AXFS source code can be found here: http://axfs.sourceforge.net/ https://code.google.com/p/axfs/

9. u-boot Modifications Since an XIP Linux kernel is not loaded into RAM and run like a conventional kernel, the standard bootm command in u-boot cannot be used. Instead, Renesas has modified u-boot and created a custom command called ‘bootx’. The bootx command will act similar to what the bootm command does in terms of setting up the R0,R1,R2 registers with the correct value before jumping directly into the Linux kernel code. Before starting the kernel, you first need to set up the XIP mode of the quad SPI peripheral in the RZ/A1. We’ve added a command ‘qspi’ that you can use to change the physical interface to 1 or 2 SPI Flash memories to increase performance. Note that only u-boot needs to take care of this and no Linux kernel driver code needs to deal with any QSPI register settings. The serial driver in u-boot was modified to be able to handle communicating with 2 SPI flash devices at once (both SPI flashes connect to the same channel 0 chip select).When a dual SPI Flash system is used, the odd memory locations are stored in SPI flash 0 and even memory locations are stored in SPI flash 1. Therefore, the addressing gets a little tricky when programming because you will send the same write command and address to both chips, but one chip will get the odd data and the other will get the even data. That means for each physical single byte address, you will be programming 2 bytes of data (meaning your data buffer pointer will increment twice as fast as your address pointer). While the driver rz_spi.c handles this special mode of operation, the spansion.c driver also needed to be modified a little to account for the dual SPI flash scenario. Therefore, if you chose to use a SPI Flash device other than Spansion, you will need to modify the source driver for that SPI Flash manufacture as well. The rz_spi.c driver was modified such that you can select either single SPI flash or dual SPI flash to erase/program at run time. This was done by use the chip select parameter in the sf probe command to flag that you want to perform operations on dual flash. The default chip select 0 (sf probe 0:0) is use to select single flash, while chip select 1 (sf probe 0:1) is used to select dual flash.

EU_00181 Rev.1.10 Page 7 of 8 Jun 13, 2016 RZ/A1 XIP Linux for RZ/A1

Website and Support Renesas Electronics Website http://www.renesas.com/

Inquiries http://www.renesas.com/contact/

All trademarks and registered trademarks are the property of their respective owners.

EU_00181 Rev.1.10 Page 8 of 8 Jun 13, 2016

Revision Record Description Rev. Date Page Summary 1.00 Aug 21, 2014 — First edition issued 1.01 Aug 27, 2014 all Grammar 1.02 Dec 17, 2014 3 Removed benchmark section 1.10 Jun 13, 2016 Added benchmark section back in

A-1