Implementations, Improvements &Implications of Hardware- Assisted CFG-BasedControl Flow Integrity Schemes

DIPLOMARBEIT

zur Erlangung des akademischen Grades

Diplom-Ingenieur

im Rahmen des Studiums

Technische Informatik

eingereicht von

Mario Telesklav, BSc. Matrikelnummer 01252288

an der Fakultät für Informatik der Technischen Universität Wien Betreuung: Ao.Univ.Prof.Dipl.-Ing. Dr.techn. Andreas Steininger Mitwirkung: Univ.Ass.Dipl.-Ing. Stefan Tauner

Wien, 9. März 2021 Mario Telesklav Andreas Steininger

Technische UniversitätWien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at

Implementations, Improvements &Implications of Hardware- Assisted CFG-BasedControl Flow Integrity Schemes

DIPLOMA THESIS

submitted in partial fulfillmentofthe requirements forthe degree of

Diplom-Ingenieur

in

Computer Engineering

by

Mario Telesklav, BSc. Registration Number 01252288

to the Faculty of Informatics at the TU Wien Advisor: Ao.Univ.Prof.Dipl.-Ing. Dr.techn. Andreas Steininger Assistance: Univ.Ass.Dipl.-Ing. Stefan Tauner

Vienna, 9th March, 2021 Mario Telesklav Andreas Steininger

Technische UniversitätWien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at

Erklärung zur Verfassungder Arbeit

Mario Telesklav, BSc.

Hiermit erkläre ich, dass ichdieseArbeit selbständig verfasst habe,dass ichdie verwen- detenQuellenund Hilfsmittel vollständig angegeben habeund dass ichdie Stellen der Arbeit–einschließlichTabellen,Karten undAbbildungen –, dieanderen Werken oder dem Internet im Wortlaut oder demSinn nach entnommensind, aufjeden Fall unter Angabeder Quelleals Entlehnung kenntlich gemacht habe.

Wien, 9. März 2021 Mario Telesklav

v

Danksagung

IchmöchteinsbesonderemeinemBetreuerStefan Tauner für seine Anleitungbei der Erstellung dieser Arbeitund für all unsere bereicherndenDiskussionen danken. Sein Feedbackhat wesentlichzur Qualitätbeigetragen.Des Weiterenbin ichProf. Steininger für die Beratung zum Themader Arbeitund die Möglichkeit, sie in seinem Forschungsbereich schreiben zu können, sehr dankbar. Außerdem möchte ichmichbei meinenFreunden und meiner Familie für ihre Unterstüt- zung in jederHinsicht während meines Studiumsbedanken,vor allem in herausfordernden und stressigenZeiten. Sie haben diese Arbeit überhaupterst möglich gemacht.

vii

Acknowledgements

Iwouldliketoespeciallythank my supervisor Stefan Taunerfor his guidancethroughout creatingthis thesis and all of our enriching discussions. His feedbackhas contributed greatly to the qualityofthe finalwork.Anotherthank yougoestoProf. Steininger for advising on thetopicand the possibilityofwriting it in his department. MoreoverIwant to thank my dearfriendsand familyfor theirsupportinall respects throughout my studies, particularly in challenging and stressful times. Theymade this work possibleinthe first place.

ix

Kurzfassung

Die Möglichkeit, Speicherbereiche mit böser Absicht zu überschreiben, ist immer noch eine ernstzunehmende Sicherheitslücke vieler moderner Systeme. DieseSchwachstellen ermöglichen Code-Reuse Attacks (CRAs), beidenen bereits existierender Code für einen Angriffanstellevon eingeschleustenAnweisungen verwendetwird. CRAs sindinden letzten Jahrenbesondersrelevantgeworden.InCRAs leiten Angreifer den Kontrollfluss einesProgramms so um,dass er vombeabsichtigtenControl Flow Graph (CFG) ab- weicht,der allegültigen Sprünge, Funktionsaufrufe und Rücksprünge in einemProgramm darstellt. Eine weit verbreitete Maßnahme gegen solcheAngriffe istes, ControlFlow Integrity(CFI) sicherzustellen.Inden letzten 15 Jahren hat es dazuviele Vorschläge fürImplementierungen sowohl in Software als auchinHardware gegeben.Das zugrunde liegende Ziel ist immer zu überprüfen, ob dieProgrammausführungdem vorgesehenen CFGentspricht,indemdie Gültigkeit der Ziele vonindirekten Sprüngenzur Laufzeit sichergestellt wird. Hardware-unterstützteAnsätze weisennormalerweise eine bessere Performance auf als ImplementierungeninSoftware. Die meisten existierendenAnsätze wurdenjedochauf unterschiedlichen Plattformen und mitunterschiedlichen - Anwendungen evaluiert,was quantitative Vergleiche der vorgestelltenErgebnisse kaum aussagekräftigmacht. In dieser Arbeitgehen wirzunächst auf dieGrundlagen vonCRAs einund geben einen Überblick über CFIimAllgemeinen. Wir konzentrieren unsanschließend auf Hardware- unterstützte CFI-Ansätze und beschreibenfünf solcher SchemataimDetail,die ver- schiedeneSchutzkonzepte einsetzen. Unser Hauptbeitrag bestehtinder Implementierung der verschiedenen Ansätze auf einer gemeinsamen Plattform,umeinen aussagekräftigen quantitativenVergleich zu ermöglichen. Außerdem präsentierenwir ein neues Schema, dasauf den unsererMeinungnachbesten Konzepten aufbautund diese verbessert. Unsere Ergebnissezeigen, dassCFG-basiertesDurchsetzen vonCFI in Hardware mit geringeren Kosten in Bezug auf Laufzeitund Codegröße erzieltwerden kann, als zuvor in akademischen Konzeptengezeigt wurde.Zugleichwerdenanderswo vernachlässigte Featuresberücksichtigt und die Hardwareanforderungen auf einem ähnlichen Niveau gehalten.

Schlüsselwörter —CRA, ROP, JOP, CFG, CFI, RISC-V,SoftwareSecurity, Shadow Stack, Control Flow Transfer, EXCEC

xi

Abstract

Potentialmemory corruption errorsare still aserious vulnerability of manymodern systems and can be maliciously exploited. Code-Reuse Attacks (CRAs), where code already presentinsome application is used for an attackinstead of injected instructions, are enabled by sucherrorsand have become particularly relevantinrecent years.InCRAs attackers redirectthe control flowofaprogram in away thatitdeviates from itsintended Control Flow Graph (CFG), thatrepresents all valid jumps,functioncalls and returns in aprogram. The enforcementofControl Flow Integrity (CFI)isawell-established mitigation technique against such CRAs, backedbymanyproposalsfor implementations both in software and hardware in the past 15years. Their underlying goalisalwaysto checkwhether program execution follows itsintendedCFG by checking the validityof control flowtransfer targets at runtime.Hardware-assisted approachesusuallyhavea better performance compared to implementationsinsoftware. However, most schemes proposed by academiawere evaluated on differentplatforms and differentbenchmark applications were used, whichmakes quantitativecomparisons of thepresented results hardlymeaningful. In this work we firstreprise the basics of CRAs and giveanoverviewofCFI in general. We then focus on hardware-assisted CFI approaches and describefive concreteschemesin detail, thatcoverawide range of differentprotection concepts. The maincontributionof this work is ameaningfulquantitative comparisonbasedonour implementations of these approachesonacommon platform.Inaddition, we presentanovel hardware-assistedCFI scheme called EXtensiveCFI EnforcementConcept(EXCEC), whichisbased on what we consider thebest conceptsand ideas of theexisting approaches. EXCEC improves some details and offers amore sound protection than most other schemes. Our finalresults showthathardware-enforcedCFG-based CFI can be achieved with significantlylower overheads on runtime and code size than previously presentedin academicconcepts, while stillimplementingfeatures that have been neglected elsewhere and keepinghardware utilization at asimilarlevel.

Keywords —CRA, ROP, JOP, CFG, CFI, RISC-V,SoftwareSecurity, Shadow Stack, ControlFlowTransfer, EXCEC

xiii

Contents

Kurzfassung xi

Abstract xiii

1Introduction1 1.1 Motivation ...... 2 1.2 Code-Reuse Attacks (CRAs) ...... 5 1.3 Defense Concepts ...... 10 1.4 Aims and Contributions...... 11

2Control Flow Integrity(CFI)13 2.1 Threat Model...... 13 2.2 Introduction to Control Flow Integrity (CFI)...... 14 2.3 Control Flow Graph (CFG) ...... 18 2.4 Implementing CFI ...... 21

3Hardware-Based CFI23 3.1 Concepts ...... 23 3.2 Challenges ...... 29 3.3 Hardware-BasedCFI Implementations ...... 31

4Implementation 41 4.1 DevelopmentPlatform &Environment...... 41 4.2 Instruction SetArchitecture (ISA) Extension ...... 43 4.3 Implementation Details...... 44 4.4 Common Implementation Parameters ...... 48 4.5 EXtensiveCFI EnforcementConcept (EXCEC)...... 49

5Instrumentation 57 5.1 Introduction ...... 57 5.2 Challenges for CFI Instrumentation...... 60 5.3 GCC Plugin for CFI Instrumentation...... 62

6Evaluation 67

xv 6.1 Methodology ...... 67 6.2 Benchmarks ...... 69 6.3 Security ...... 73 6.4 Performance ...... 74 6.5 Code Size ...... 78 6.6 Utilization ...... 81 6.7 Final Comparison ...... 84

7Conclusion 87 7.1 Future Work ...... 88

ARaw Data91 A.1 Performance Overhead ...... 91 A.2CodeSizeOverhead ...... 92

ListofFigures 95

ListofTables 97

List of Listings 99

Acronyms 101

Bibliography 105 CHAPTER 1

Introduction

To this day, errorsinprograms can stillbeexploited on most processors.Vulnerabilities, whichenable the manipulationofmemorycontents, are particularly common. We refer to such errorsasmemory errors in theremainderofthis work. An attackertypically wants to change program behavior, e.g., in order to execute malicious code with the privileges of some application underattack. Memory errorscan be exploited to that end. The 2020 Common WeaknessEnumeration (CWE) by MITRE1 reportssuchvulnerabilities as the second most common family of software errors [MIT20]. Recentstatistics released by Microsoftand Google showsimilar results. According to theirstatistics,about 70 %of allsecuritybugsare related to memory errors [ZDN20]. Especially in the embeddedworld with the Internet of Things (IoT) gaining moreand moreattention, where CentralProcessing Units (CPUs) need to be small, efficientand cheap and sometimes no operating system (OS) support for advanced securityfeaturesis available, this becomesanincreasinglyserious threat. Whileitisnot alwayseasily possibletopreventmemoryerrors,conceptsfor detecting and limiting the consequences evolvedinrecentyears. Their maingoal is to detect (intentionaland malicious) deviationsfrom normal program behavior enabled by software flaws.This thesis deals with ControlFlowIntegrity(CFI) and morespecifically with hardware-basedapproachesthereof, which areconsidered an effectiveand promising approachfor enforcingcorrectprogram execution. We first describe the underlying problem in this chapter.InChapter2,the concept of CFI is explained ingeneral with alaterspecialization on hardware-based CFI in Chapter3,wheresomestate-of-the-art approaches are described.Wethen present implementationsthereof andthe design of anoveland extensiveCFI scheme, whichis basedonjoinedconceptsand ideas of existingapproaches,inChapter4.Chapter 5

1http://cwe.mitre.org/top25/archive/2020/2020_cwe_top25.html

1 1. Introduction

describes howprograms can be enriched with additional instructionsrequired for CFI. An extensive quantitative analysisofmultiple CFI implementationsispresented in Chapter 6. Chapter 7followswith afinalconclusion on the topics coveredinthis thesis.

1.1 Motivation

Memory unsafelanguages like Cand ++ potentially entail memory errors because there is no automaticbound-checking on arraysand pointer references. Aparticularlyrelevant error is memorycorruption,where an attackerisable to overwrite memory locations with arbitrarycontents.Sucherrorsare presentinavast number of applications [dCV17]. Table 1.1 listssome unsafebut frequently usedfunctions of theStandard CLibrary,which can be exploited by an attacker. It is up to the programmertohandle thesefunctions withcare [BST+00].

FunctionprototypePotential problem strcpy(char *dest, const char *src) Mayoverflowthe dest buffer strcat(char *dest, const char *src) Mayoverflowthe dest buffer getwd(char *buf) Mayoverflowthe buf buffer fscanf(FILE *f, const char *format, ...) Mayoverflowits arguments scanf(constchar *format,...) Mayoverflowits arguments sprintf(char *str, constchar *format ...) Mayoverflowthe str buffer

Table 1.1: Some unsafefunctions in theStandard CLibrary [BST+00]

Avery simple usage of suchanunsafe function is shown in Listing 1.1. Abuffer of limited size is allocated on the stackand laterfilled with userinputfrom the scanf function. If thestring readbyscanf is longer thanthe buffer size, dataiswritten beyond the allocatedmemory resulting in what is called a buffer overflow.This overwrites subsequent memory locations on thestackbecause stacks usually growdownwards.

1 char buffer[8]; 2 scanf("%s, buffer); Listing 1.1: Potentialbuffer overflow

We showanexemplary stack layout duringanarbitrary function call in Figure 1.1. First, some registersofthe caller are saved-including the return addresswherethe function should jump to after finishing. Parameters forthe callee follow. Local variablesare pushedtothe top of thestack (meaning to thelowest address) during function execution. When afixed-size buffer is declared (aspreviouslyshown in Listing 1.1), itssize is allocatedontop of thestackasalocal array.Functions like scanf readdatainto thatbuffer, but don’t checkbounds.Abuffer overflowoccurs whenthe buffer size is exceeded, leading to overwriting datawritten on the stack before (e.g., the function’s return address).

2 1.1. Motivation

Figure 1.1: RISC-V stacklayout during afunctioncall [PH]

Stack-based buffer overflows, suchasshown in the example above, are arguablythe most prominent examplefor memory corruption vulnerabilities. But there exist other forms as well. Literature listsheap-basedbuffer overflows, format string vulnerabilities, use-after-free vulnerabilities and numeric problems suchasinteger overflows as some otherpotentially exploitable weaknesses [SWK+16, SPWS13]. Manyofsucherrors could be prevented by attentive programmers.For example, there exist safe alternatives for thefunctions listed in Table 1.1 and manual bound checks could oftenbeapplied. An essential problemhere is that it is not feasibletoupdateand rework huge legacy code bases. In addition,some compilers offer features such as address sanitation fordetecting somesorts of memory corruption. These can usuallyonly be usedwith considerable run-time overhead though [SBPV12]. Unfortunately,thereexists no silver bullet forimmediate preventionofmemory corruption errors. Agreatdeal of academicand industrialeffort therefore went into techniquesfor detecting the consequences of attacks rather than preventing them in thefirst place.In[CBP+15], defense methods are classifiedinto prevent-the-corruption and prevent-the-exploit ap- proaches. We will presentdifferenttechniques for the latterinthe subsequentchapters of this thesis. Memory errorsare often triggered by mistake, e.g., by overwriting thereturn address and causing theprogram to jump to an unintendedposition or accessingsome portionof memory after it has been freed.This typicallyleadstounexpected behaviorand,most prominently,program crashes. Butmemoryerrorscan also be exploited. Figure 1.2 gives an overviewofattacks enabled by suchvulnerabilities.Theseattackscan be roughly classified into four families where most of themrequire differentmitigation concepts: Code corruptionattacks try to modify existing code. Control-flow hijackattacks

3 1. Introduction

want to alter program behaviorbychanging the controlflow. Data-only attacks aim to change program logic, e.g., by manipulatingflags used for branches. Andfinally information leaks are used to gain access to dataotherwise never shown to the user, such as passwords[SPWS13].

Figure 1.2: Attackmodel with different kindsofexploits[SPWS13]

We focusonthe second of beforementioned attackfamiliesinthiswork,i.e., on attacks aimingtohijack an application’s control flow. Suchattacks are not an endinthemselves though but usually aim to execute malicious code of theadversary’s choice. In acommon memory corruption attack, twogoals need to be achieved.The attackerhas to inject malicious code andmustmanipulatethe legit control flowinawaythatitisredirected to thepayload [CPM+98]. Mitigation techniquesfor code injection attacks exist and are widely used in commercially available computers. Suchare, forexample,DataExecution Prevention(DEP), Address Space Layout Randomization (ASLR) and stackcanaries, which are allbriefly describedinSection1.3.

There is, however, afamily of attacks called Code-Reuse Attacks (CRAs) where no code needs to be injected but instructions alreadypresentinthe program are used.

4 1.2. Code-Reuse Attacks (CRAs)

1.2Code-Reuse Attacks (CRAs)

In contrast to attacks wheremalicious code isinjected as partofthe attack, e.g., with the help of abufferoverflow, CRAs exploitlegitimatecodethatisalready presentina program. In aCRA, theattackertries to hijack the program’scontrol flowbyredirecting it to avalidaddress of hisorher choice, therebychanging program behavior. Suchcode can reside in theapplication’s sources itself butalso in shared libraries like libc. Atypical attackconsistsoftwo tasks:First, an attackermust manipulate aprogram’s control flowsuchthatitleavesits intended path. This redirection canbeachieved by manipulatingthe target addressofsome indirect control flowtransfer in the program. In general, indirect jumps,indirectfunctioncalls (e.g.,when using function pointers)and function returnscan be exploited to that end. We briefly listed attacks basedonmemory corruption errorsinthe previous section. These attacks can be used to manipulatethe target of an indirect control flowtransfer. As asecondstep, theadversary must findaway to trickthe program into executing some malicious code [Sha07]. CRAs build chains of smallcodesnippets already present in the program called gadgets in order to build malicious functionality. Early exploits usedwhole functions in libc,e.g., for system calls, instead of smallsnippets. Suchattacks are called return-into-libc (RILC) attacks [TEB+11]. Three techniqueshave evolved in recent years for launching CRAs: Return-OrientedPro- gramming (ROP), Jump-OrientedProgramming(JOP) and Call-OrientedProgramming (COP). They are not necessarilyused inanisolated waybut are sometimes combined [CW14]. We describethe three formsofCRAs in thefollowing sections.

1.2.1Return-Oriented Programming(ROP) ROPisaspecial formofCRA where theadversary creates achain of gadgets, each with adesignated functionality, whichall endinreturn statements.Theycan potentiallystart at arbitrary positionsand don’t necessarily beginatfunctionentries. Listing 1.2 shows a simpleROP gadgetfor addingtwo register values.

1 add a4,a4,a3 2 sw a5,0(a0) 3 sw a4,4(a0) 4 ret Listing 1.2: ROPgadgetfor adding tworegister values

The principle behind ROPwas first presented by Shacham in [Sha07]. Incontrast to previous RILC attacks,this approachdoesnot requirefullfunctions as gadgets. Short code sequences ending in return statementsare identified,typically onlytwo or three instructions long. Shacham’sinitialapproachwas limitedtox86 and exploited the geometry of theInstruc- tionSet Architecture (ISA),meaningthe factthat uses variable-length instructions.

5 1. Introduction

Also, x86 is considered avery dense ISA: There is ahigh probabilitythatany random byte stream can be interpreted as asequence of validinstructions. Listing 1.3 showsan examplethereof which is interpretedastwo validinstructionsonx86. If thesame stream is interpreted starting only onebyte later, the code has acompletelydifferentmeaning, as shown in Listing1.4 [Sha07].

1 f7 c7 07 00 00 00 test $0x00000007,%edi 2 0f 95 45 c3 setnzb -61(%ebp) Listing 1.3: Twovalidx86 instructions[Sha07]

1 c7 07 00 00 00 0f movl $0x0f000000,(%edi) 2 95 xchg %ebp,%eax 3 45 inc %ebp 4 c3 ret Listing 1.4: Sequence without thefirst byte hasadifferentmeaning on x86 [Sha07]

In order to be useful foraROP attack, suchasequence onlyneeds to endinareturn statement.Usuallythis is the RET instruction(C3 in Listing1.4) on x86, butother return-likeinstructions arealso possible. Shacham claimed that anysufficientlylarge body of x86 instructions, suchaslibc,contains enough gadgets for doing arbitrary computations. He identifiedacatalog of required gadgets, eachwith awell-defined operation.Theseare linked to achain by using manipulated return addresses for jumpingfrom one gadget to the next one. Thememory corruption exploited for creatingsuchachain is also usedfor enriching it with register values, actingasparameters to thegadgets, at the rightpositions. Shacham acknowledged that there exist multiple solutions for getting a fully-functionalset of gadgets, but theone he described is as follows [Sha07]:

• Loadand Store:Loading and storingregister valuesfrom/tothe memory and loading constantsintoaregister • Arithmetic and Logic:Addition, negation, logicand/or, exclusive or, shifts etc. • ControlFlow:Conditional and unconditional jumps • System and FunctionCalls

Agroup of researchersincludingShacham himselfgeneralized theconcept in 2008 and showedthat ROPattacksare not limitedtox86 and don’t require an ISA with instructions of variable length[BRSS08]. Theyshowedthatitispossible to find aTuring-complete library of gadgetsinaprogram compiled for theSPARC ISA, which only has instructions of fixed length, and concluded that theattack affects all kinds of OSes and platforms. Averybasicexample forhow aROP attackcan be started is shown in Listing 1.5. The vulnerable code is in line 3, where writing to aboundedbufferissimulated with the memcpy function. This would often be the readingofsome user inputinareal-world application.

6 1.2. Code-Reuse Attacks (CRAs)

1 int foo() { 2 char buff[4]={0}; 3 memcpy(buff, "\x00\x00...\x00\x00\x00\x00\x00\x30\x84\x00\x1c", 20); 4 printf("input:%s\r\n", buff); 5 6 ... 7 } Listing 1.5: SimpleROP attackexample

Whenexecuting foo,atfirst the return address, theframepointer and potentially other registersare pushed to thestackfor later recovery.The corresponding stack layout for an exemplarymachine is indicated in Figure 1.3. When entering thefunction, thestack pointer is first increasedand then the registers, whichare modifiedinthis function, are savedtothe stack.This is especiallyrelevantfor register $ra,whichholdsthe return address. Thenlocal variablesare created on thestack. In this example,memory for the local buff array is allocated.

0x FFFF 2006 ... 0x FFFF 2002 ... 0x FFFF 1FFE $ra (return address) $sp (stack pointer) 0x FFFF 1FFA $fp (frame pointer) $sp -4 0x FFFF 1FF6 ... $sp -8 0x FFFF 1FF2 ... 0x FFFF 1FEE ... 0x FFFF 1FFA buff $sp -20

Figure 1.3: Stacklayout for theCcode of Listing 1.5

As memcpy is executed, not only4characters are written to buff,but alsoadjacent valuesonthe stack are overwritten, namelythe savedframe pointer and return address. Laterwhen foo returns to theaddresssaved in the return addressregister,itdoesnot transfer controlbacktoits original call site but to the addresswritten to theregister with themanipulated argument for memcpy. Real-worldROP attacksare more complex and use chainsofgadgetsinorder to perform some task. However, the basicprinciple -overwriting the return address-isalwaysthe same. Figure 1.4 shows howamoreadvanced ROPattack based on chainingofcodegadgets with theintentofmaking asystemcall with certainparametersworks. First, abuffer overflow(as for examplepreviously shown in Listing 1.5) is used to overwrite legit stack contentswith the addresses of the gadgetsused inthe attack, theaddress of thefinal system call andthe parameters required forit. Theresultingstack layout after thebuffer overflowisshown in the right partofthe figure. Instead of causingareturn to the function’s call site, thefirst return instructionnow transfers controltogadget 1,whichpops parameter 1 fromthe stack,storesitin

7 1. Introduction

register ecx and further jumpstogadget 2.The second gadget then movesthis value to register eax and jumpsbacktogadget 1,whichpops parameter 2 from thestack and writesittoregister ebx using gadget 3.Now both parametersare available in the required registers eax and ebx for the finalsystem call, whichisthe target of the finaljump [ZQQQ18].

Figure 1.4: Example of agadgetchain for aROP attack[ZQQQ18]

Both[Sha07]and [BRSS08]build theirproofsofconcept on gadgetsfoundinlibc,which offersavery extensivebodyofcodeand is presentinmost Unix systems.Statically-linked applications,e.g.onbare-metal systems, arelessvulnerable becausethe availabilityof gadgetsislimited. Given aprogram large enough to find asufficientset of gadgets, such attacks are still possiblethough.

1.2.2 Jump-Oriented Programming (JOP) Similarly to return statements jumpingtotargetsbasedonthe return addressregister, alsoindirectjumps transfer controltothe addresscontained in some register.JOP exploits theses indirect jumps. In contrasttoROP,where gadgets must end in return instructions,the gadgets used for JOPattacks have to end in indirect jumps.This concept wasfirstdescribedbyBletschetal. in [BJFL11]. The JOPapproachdoesnot rely on stack-based buffer overflowsand is therefore immune to most defense mechanisms for ROPattacks. Instead of using amanipulatedstack forchaining gadgets, adispatchtable(sometimes called dispatcher gadget)isrequired, which orchestrates jumpsbetween gadgets [KOGP12]. The basic workingsofaJOP attackare shown in Figure 1.5. At the beginning, an initializer is requiredtobreak out of theintendedcontrol flowofthe program. The initializermanipulates thetarget of some indirect control flowtransfer and loadsvalues to registers required forthe attack. To thatend, memory vulnerabilities -suchaspreviously

8 1.2. Code-Reuse Attacks (CRAs) discussed -can be used. Thegadget then redirects the control flowtoadispatcher,which is responsible for chaining thegadgetstogether. Allcodesnippets end in indirect jump instructions,where their target addresses pointbacktothe dispatcher [SAS15].

Figure 1.5: WorkingsofaJOP attack[SAS15]

The dispatchergadget holds a virtual programcounter in memoryorsome register, which is usedfor indexing the gadgetaddresses in adispatchtable. Note that thereisno connection between theCPU’s Program Counter(PC) and this counter. When control flowreachesthe dispatcher,itadvances theprogram counter.This canbeaseasy as making it pointtothe next adjacentword in memory,but alsoadvanced functions are possible, depending on thedispatcher gadget andits associatedtable. The basic principle of adispatcherisshown inFigure 1.6. The program counteristhen used to indexa dispatchtable, which is providedbythe attacker(e.g., by exploitingabuffer overflow), and holds theaddressesofall gadgetsrequired for theattack [BJFL11].

Figure 1.6: Functionalityofadispatcher gadget [BJFL11]

In Cand C++ labels are used for direct jumps, which are checkedbythe compilersothat they transfer controlonly within the function. Thisisnot alwayspossiblefor indirect jumps.Gadgets usedfor aJOP attackare abletotransfer control to arbitraryaddresses also outside of thefunction they themselvesare located in.

9 1. Introduction

1.2.3 Call-Oriented Programming (COP) Yetanotherform of CRA is COP,whereindirectfunctioncalls are exploited.Although JOP and COP both exploit forward control flowtransfers (in contrasttofunctionreturns usedinROP), COP stands out for two reasons. First and more obvious, indirect function callsare usedfor theattackinstead of indirect jumps.Second,nodispatcher gadget is required in many cases. This is because indirect callsare memory-indirect on some ISAs, most prominently x86, meaningthatthe target of some indirect function callisdirectly determined from memoryand notfromsome register [CW14]. Suchcallsappearvery frequently on affected ISAs because indirect calls are alsoused for functions in dynamicallylinked libraries and vtables in C++[CW14].

1.3DefenseConcepts

In order to tackle the possibleattacks mentioned in Section1.1 and Section 1.2, many mitigation techniqueshavebeenproposed by industry and academia. Eachtechnique comes at some cost,likeaperformance penaltyorrequiredhardwaresupport. Further- more, asingle technique usually doesn’t offeruniversalprotection: Each of them aimsat averyspecificthreat model[ABEL09]. The relevanceand usabilityofdefense conceptsalso dependsonthe platform they are usedin. Forexample, CPUsusing the Harvard architecture don’t suffer from code injectionattacks in the same extent as VonNeumann-based CPUs becausedataand instructionmemoryare separated by design.Ithas been shown tough that special forms of code injection are possibleaswell [FC08]. Also, OS support is required forsome of the approaches.Webriefly list various prominentprotectionconcepts:

• DataExecutionPrevention (DEP) is an approachfoundinprocessorsusing the VonNeumann architecture (i.e.,processors, which have ashared memory for data and instructions).The ideaistomark eachmemorylocation to be either writable (for data segments)orexecutable (forinstructions), but never both at the same time. This way, code injected by exploitingsome memory error cannot be executed. The technique is alsoknown as Execute-Never (XN),Non-Executable (NX)Memory and W ⊕ X(Write XOReXecute) [ARM15, dCV17]. DEPis supportedbybothAMD and IntelCPUsand nowadays commonly usedbymodern OSes [ZQQQ18]. In the ARM ISA, theXNbit canbeused to define whether some memory region is executable[ARM15]. In principle, concepts similartoDEP also exist for baremetal systems which have aMemoryProtectionUnit (MPU)for enforcing certainpermissionsonphysical memory regions [CAS+17]. CRAs cannot be prevented with DEPthoughbecause only existingcodeisused for theattacks.

• Address Space Layout Randomization (ASLR) randomly positions program segments like .data, .text and thestackinmemoryinorder to prevent an adversaryfrom knowingcertainaddresses required forattacks. Everytime the

10 1.4. Aims andContributions

program restartsthe address-space layout changes, therebymaking attacks based on static addresses very unlikely.However, it has been shown thatASLR is ineffective on 32-bitarchitecturesbecauseaddresses canbedetermined using brute force attacks.Also,knowledgeabout asingle address is sufficient to deductthe segment’s base addresssince onlythe starting positions of segments are randomized [SPP+04].

• Dynamic Information FlowTracking (DIFT) marks datafrom insecure sources (e.g.,user inputs) and tracks flagged values throughout the CPUpipeline and memory.Data from such untrusted inputs is considered tainted.The idea is to followtainteddatainorder to seehow it propagates (i.e.,taintsother data)and, most importantly,toknowwhen it is usedindangerousways[NS05]. DIFT can be implemented in hardware, whichrequiressignificant changes to the datapath and memory,orinsoftware, whichintroduces heavy performance overhead.The methodisalso vulnerable to many false positives [KOGP12].

• Stack canaries are specific words placedonthe stack next to thereturn address. Before returning from afunction, this word is checkedagainst an expected value. Using abuffer overflowinordertooverwrite the return addresschanges this canary word tooinmost casesand leads to adetection of theattack. It is very important to choose canarywordsinaway they cannotbeguessed, otherwise theword could be included in abuffer overflow and an attackwould not be detected [CPM+98].

• Gadget Checking dynamically tracksthe targets of indirect control flowtransfers and monitorscall frequency and lengthofthe targets.Exceeding certain thresholds is considered aviolation. Theapproachisvulnerable to ahigh false positiverate though, especiallyinprograms with manysmallfunctions [ZQQQ18, CXS+09].

• ControlFlowIntegrity(CFI) makes surethat the execution flowofaprogram follows acertainpathcalled Control Flow Graph (CFG), whichisoften determined ahead of execution (e.g.,through source code analysis, binaryanalysis or execution profiling) [ABEL09]. To that end, run-time checks are added to theprogram in order to compare theactual control flowwith the intended one. That affectsall kindsofcontrol flowtransfers,most importantlyindirectfunctioncalls,returns from functionsand indirect jumps. Chapter 2goesintodetailonCFI in general and Chapter 3describessome concrete implementations.

1.4Aimsand Contributions

The most important aimofthis thesis is to establish ameaningfulquantitative com- parison of various existing hardware-assisted CFG-based CFI approaches,whichwere originallyevaluated on differenthardware platforms and withdifferentsetsofbenchmark applications in theirrespective originalproposals. Afurther goal is designing anovelCFI schemeand comparing it to existing approaches in termsofperformance, code size and hardware utilization.

11 1. Introduction

The following steps are taken to achievethesegoals:

1. Fiveexisting CFI schemes are implementedand evaluated on acommon hard- ware platform.This involves,amongother things, the considerationofhardware constraintsand the (un-)availabilityofplatform characteristics like branchdelay slots.

2. The best conceptsand ideas of existingschemesare identified and combinedina noveldesign called EXtensiveCFI EnforcementConcept (EXCEC), thatisalso implemented in hardware.Theseused conceptsare distributed across multiple independent approachesand acombination resultsinanextensiveand powerful CFI protection. EXCEC alsoimprovesseveral implementation details, comes with dedicated recursion handling, support for setjmp callsand alsoenforcesCFI duringInterruptService Routines(ISRs).

3. Aconcept for instrumentation,which is requiredfor all hardware-assisted CFI schemes, is realized by meansofaGNU Compiler Collection (GCC) plugin. Custom CFI instructions are injected at compiletime basedonsourcecode. It is particularly relevanttoidentify the correct and precise positions forCFI instructions.

4. Almost 40 benchmarksfor embeddedsystemstestsare ported to our development platform,resultinginarepresentativeand comprehensiveset of applications. Timing of thebenchmarkbodyexecutionisunifiedfor all programs and limited hardware resources on thedevelopment platform are respected.CFG information forindirect control flowtransfers,whichisrequiredfor instrumentation,isextractedfor all benchmark applications.

Our evaluation resultsshowthat CFG-based CFI can be achieved with significantlylower overheads on runtime and code size than previously presentedinotheracademic concepts while hardware utilization can be kept at asimilar leveland elsewhere neglectedfeatures can be implemented.The evaluation alsohints at various weaknessesand strengths of existingschemesand allowsfor obtaininganoptimalcombination of CFI enforcement techniques. To the best of our knowledge, we are thefirst to establish suchcomparisons. Previous works focused on qualitativeevaluations and onlyused the overheads stated inthe evaluationsofthe original proposals of therespectiveCFI schemes forcomparisons. Thoseare of course not particularly meaningfulbecause theevaluationswere performed using completelydifferentsetups.

12 CHAPTER 2

Control FlowIntegrity(CFI)

In recentyearsmanyapproacheswere presented fordealingwith Code-Reuse Attacks (CRAs).Alargepart of them are based on theconcept of Control Flow Integrity (CFI), which can be implemented on software side,with dedicated hardware support or as a mixture of both. CFI does not prevent attacksbasedonflawsinsoftware. Theconcept ratherdealswith the possible consequences thereof by detecting and consequentlypreventingCRAs. Thischapter first explainsthe underlying threatmodel used for CFI in Section2.1. We then go into detailonthe concept of CFI itselfinSection 2.2 and alsodescribeproperties and challengesconnectedtothe necessaryControl Flow Graphs (CFGs). As CFI is an abstract concept andnot limitedtoeither software or hardware approaches, different conceptsare eventually discussed in Section2.4.The subsequentChapter3latergoes into detail on hardware-enforced solutions.

2.1Threat Model

The commonly used threatmodel assumesthatanadversary is in fullcontrol of the entire data memoryofanapplication underattack when working on CFI [ABEL09]. This specifically applies to (but is not limitedto) buffer overflows(as described in Section 1.2), which canbeusedtomanipulatethe targets of indirect jumps,indirect callsand function returns. Thisthreatmodel is mainlyrelevantfor programs written in memory unsafelanguages like Cand C++.High-level languages,suchasJavaand C#, trytoprevent this kind of directlyexploitable memoryerrors by design duetotheirtypesafety.Incase their type checking system or theirinterpreter are attacked, even suchlanguages can lead to vulnerable programs though [CPM+98, dCV17].

13 2. ControlFlowIntegrity (CFI)

De Clercq et al. list trusted hardware, full knowledgeofmemory and code layout by the adversaryand the absence of other attackvectors than thepossibilityfor usualmemory corruption (e.g.,buffer overflows) as additionalassumptions[dCV17].

2.2Introduction to Control FlowIntegrity(CFI)

The original CFI concept referred to throughoutthis thesis waspublished by Abadiet al. in 2005 in [ABEL05]and laterextended in [ABEL09]. The authors described the wholepurpose of theirconcept as follows: "The CFI securitypolicy dictates that software execution must followapath of aCFG determined ahead of time".Thissection first clarifies some terminology and later goes into detailonthe concept proposed by Abadi and hiscolleagues. AControl Flow Graph (CFG) is an abstract representation of theallowedcontrol flowof aprogram. Averybasicexample is shown in Figure 2.1. Nodesinthe graph are basic blocks, which arecontinuous sequences of instructions. Basicblockshave exactly one entrypoint andone exit point, where onlythe instructionatthe exit canchange the program’scontrol flow[dCV17]. Section 2.3coversCFGsinmore detail.

1 2 5

3 4

Figure 2.1: Abstract CFGexample

The edges in aCFG can be classifiedintoforward facing- and backward facingedges. Function callsand jumps,regardless whether they are direct or indirect, are considered forward edges, returns from functions are backward edges. The identifying property of the latteristhatthese return to thecall sitewherethe function call originated from. Loops,suchasthe one in theCFG exampleinFigure 2.1, or other forms of jumpsare notconsidered backward edges because they are not preceded by afunction call. Burow et al. state thatmost CFI mechanisms are based on atwo-stage process [BCN+17]. As afirst step, the CFGofaprogram has to be generated duringananalysis phase, which is often done in astatic and offline way. The processofgenerating theunderlying CFGishighly complex. Section 2.3addresses this topic and points out some limitations. Then, the enforcement phase ensures thatthe actual control flowduring executionisin sync withwhatthe CFGcreated in the first phase allows. We distinguish between coarse-grained and fine-grainedCFI,whereDeClercq et al. define fine-grained CFI as "a policy whichonly allowscontrol flowalong valid edges of a CFG" and coarse-grained CFI as arelaxedpolicy, whereonly certain rulesare enforced, suchasthat the target of function callsalwayshas to be the first instructionofafunction

14 2.2. IntroductiontoControl Flow Integrity (CFI)

[dCV17]. Implementations for coarse-grained policiesare usually more efficientinterms of performance overhead and significantly less complex becausenoCFG information is required. It hasbeen shown thoughthatsuchpoliciesnot alwaysprovide effective protection [GABP14]. Every program contains instructionswhichtransfer control to some other position: jumps, function callsand returns. Conditionalbranches are considered as jumpshere. Targets of suchinstructions caneither be determined statically or dynamically. Explicit, direct jumps and callshavestatic targets.Those are hard-coded into theprogram, already known at compiletime and need no further run-time checks whenitcomes to forward edge CFI enforcement.This is becausethe targets cannotbemanipulated underthe assumptions of acertainthreatmodel,aspresented in Section2.1. Thingsget more complicated with dynamictargets, which areonly determined at run-time, such as when usingfunctionpointersand returning fromfunctions. TheCFI concept by Abadiand his colleagues suggests to add run-time checks for all of the beforementionedvariantsofindirectcontrol flowtransfers.The main ideaistoenforce that every jump, call andreturnonly targets oneofits alloweddestinations, i.e., nodes within the bounds of theprogram’sCFG,byaddingdynamic checks to the application. These checks are based on thethree instructions shown in Table2.1, whichare to be understood as abstract and implementation-independent at that point[ABEL09].

Instruction Description label ID Insertedatfunctionentries.Only announces the ID call ID, DST Transfers control to theaddress in register DST and checks whether the ID matches the ID announced by label ID ret ID Transfers control backtothe calling function and checks whether the ID matches the ID announced at thetarget

Table 2.1: CFI instructions proposed by Abadi et al. [ABEL09]

Figure 2.2 shows asmallexample program alongwith its CFGand the required CFI instructions fromTable2.1. sort2 callsasort functiontwice,eachtime with adifferent function pointer to some comparator function. The code of sort itselfisomitted in the figure. This function can be anysort algorithm. The comparators form an equivalence class, indicated by sharing thesame label 17 at theirentry,meaning that callingany of them is allowedfromthe same call instruction. At compiletime,itisonly known that sort callsany function of that equivalence class. Therefore the dynamic run-time check call 17,R at thefunctioncallisadded, which compares the ID contained in thecall (17 in this example) with the ID at thetargetofthe dynamic call (label 17). Checking CFG-compliantfunctionreturnsworkssimilarly,but every functionreturn is instrumented with the proposed ret ID instruction.This is duetothe fact thatevery returnisbasically an indirect jump, where the jump target is popped from thestack.

15 2. ControlFlowIntegrity (CFI)

Figure 2.2: CFI example with imprecise CFG[ABEL09]

The exampleinFigure 2.2 alsoshows thatCFG-based CFI still allowssome control flow transfers, which are actuallynot valid at run-time: sort could return to anyofits two call sites in sort2 and couldcall anyifthe comparator methods in theequivalence class 17.Eventhoughonly one is actuallyvalid duringexecution, both are partofthe legitimate CFG. This imprecisionallows exploits within avalidCFG.

Figure 2.3 presents asimilar CFGwith differentlabelsfor thetwo comparator functions. However, this introduces more complexitybecause CFI nowneeds to be able to handle multiplelabelsfor thesame control flowtransfer.

Figure 2.3: Example with imprecise CFG[GABP14]

Even the most precise CFGcannot guaranteethatafunction, whichiscalled from multiple call sites, returnstothe call site last usedfor itsinvocation. ACFG as the one shown in Figure 2.3 is considered being fully-precise,i.e., removing asingle edge fromthe graph would breakfunctionality. Attacks based on control flowtransfers within the boundsofaCFG are called Control-Flow Bending (CFB).Manydefense concepts

16 2.2. IntroductiontoControl Flow Integrity (CFI)

cannot preventCFB, and even CFI basedonafully-preciseCFG does not offer perfect protection in all cases [CBP+15].

Abadietal. proposedtocomplement the label-based forward edgeCFI with a shadow stack,whichisastackonly forreturn addressesinparallel to the actual stack. Integrity of theaddresses canbecheckedbycomparingthe valuesonbothstacks [ABEL09]. ThisimprovesCFI protectionsignificantly [CBP+15]. We discussboth concepts, i.e., label-based checks as well as shadowstacks, in Section 3.1.

Towards aconcreteimplementation of theabstractinstructions, the authorsonly discuss CFI implementationsinsoftware becausehardware supportfor new instructionswas considered unrealistic at the timeofwriting the paper. Listing2.1 shows an exemplary software implementationofthe label ID instructionfor the86Instruction Set Archi- tecture (ISA).The new instructionisadded right at the beginning of indirectly called functions.

1 mov eax, [esp+4];first instruction of destinationcode 2 3 ------ischanged to ------4 5 DD 12345678h ;label ID, as data 6 mov eax, [esp+4];destination instruction Listing 2.1: CFI instrumentation of avaliddestination (x86) [ABEL09]

The changesrequiredfor an indirectfunctioncallare presented in Listing 2.2. Before actuallytransferring controltothe destination,the label at beginning of thetarget function is loaded and compared with the label assignedtothe call site. The call is only executedwhen no mismatchisdetected, otherwise control is transferredtosome error handler.

1 call [ebx+8];call afunction pointer 2 3 ------ischanged to ------4 5 mov eax, [ebx+8];load pointer into register 6 cmp [eax+4], 12345678h ;compare opcodes at destination 7 jne error_label ;ifnot ID value, then fail 8 call eax ;call function pointer 9 prefetchnta [AABBCCDDh] ;label ID, used upon return Listing 2.2: CFI instrumentationofanindirectfunctioncall (x86) [ABEL09]

In addition,another label is definedrightafter the call at theposition to which thecallee returns. This label is usedfor checking validityofareturn target, as shown inListing2.3. Instead of simply transferring control backtothe return address, labelsare compared again-following the same principle as before.

17 2. ControlFlowIntegrity (CFI)

1 ret 10h ;return, and pop 16 extra bytes 2 3 ------is changed to ------4 5 mov ecx, [esp] ;load address into register 6 add esp, 14h ;pop 20 bytes off the stack 7 cmp [ecx+4], AABBCCDDh;compare opcodes at destination 8 jne error_label ;ifnot ID value, then fail 9 jmp ecx ;jump to return address Listing 2.3: CFI instrumentationofafunctionreturn (x86) [ABEL09]

Ideal forms of CFI, alongwith other securitymechanisms suchasDataExecution Prevention(DEP) and Address Space Layout Randomization (ASLR), offer strong protectionagainst awide rangeofattacks [GABP14]. The CFI concept alsosupports securitymechanisms other than theprotectionagainst CRAs. [ABEL09]lists protection against thecircumvention of Inlined ReferenceMonitors (IRMs), Software Fault Isolation (SFI)and Software Memory Access Control(SMAC)as examples thereof. An IRM enforcessome securitypolicy (e.g.,limitedaccesstothe stack) by injecting checking code directly into an application [ES00]. Similarly,SFI protects memory access operationsbychecking whetherthe respectiveaddress lies within an allowed range. Theconcept basicallyemulates ahardware-basedMemoryManagement Unit(MMU) and therebyenablescertainprotectionmechanisms on systems without dedicated hardware supportfor thisfeature.SMACisamore advanced SFIapproach which canenforce awiderrange of securitypolicies, suchaslimitedaccess to certain memory regions for specificpieces of code [ABEL09]. The abstract idea of CFI is also usedfor reasonsnot necessarilyrelated to malicious deviationsfrom the intended control flow. [VTVW+17]describeshow to inject faults into software forfaulttolerancetesting using CFGinformation of aprogram. [Van20]applies CFI techniques fordetecting invalid controlflow transfers due to externaldisturbances. Forinstance, electromagnetic interference canlead to bitflips, whichpotentially impact branchdecisions and targets.

2.3Control FlowGraph(CFG)

CFI enforcement for an application is based on itsCFG.Wealready brieflyintroduced suchgraphsbefore,but to be more precise:CFGsinthe contextofprogram execution are directed and potentially cyclic graphswherethe setofnodes is formed by all basic blocksofthe application and theset of edges by all control flowtransfers between these blocks. ExemplaryCFGswere shown in Figure2.1, Figure 2.2and Figure2.3. ACFG is needed in order to perform CFI run-time checks.Thesechecksmakesure thatany branchonly targets nodes thatare consistentwith the CFGinordertodetect and preventamanipulated controlflow.The principle is pictured in Figure 2.4, where CFIhas to makesure thatprogram execution onlyfollows theCFG depicted with its

18 2.3. Control Flow Graph (CFG) blue nodes and edges. CFG-basedchecks for CFI duringprogram execution can be implemented in manyways. Some approachesare presented in the followingsection and in the subsequent Chapter 3.

Figure 2.4: Exemplary flowfor breakingout of theCFG [DDE+12]

In apreparatory step, however, the CFGmust first be generated by extractingbasic blocks, control flowtransfers andthe potentialtargetsofthese. Identification of blocks and control flowtransfer is usually trivialand onlydepends on theinstructionencoding of theusedISA.

Target nodes of direct branches can be easily derived: Those are hardcoded into the application and need no CFI protection on forward edges. Alsothe targets of all function returns areimplicitly known at run-time for most cases, even though they are indirect branches.This is becausereturns target theirthe original call site of theirfunction in most cases.Rare exceptions to this statement are brieflydescribed in Section 3.2. Protectionconcepts, suchasthe beforementionedshadowstack, can be effectivelyapplied forthese backward edges [ABEL09].

Compilingaset of targetsfor indirect jumps and calls is very complex thoughand not possibleinall cases.Indirectcallsare notonly usedexplicitly likeinthe example from Figure 2.2, but also havesome importantimplicit uses. Calls to functions in dynamically linked libraries are implementedasindirect control flowtransfers since the target address is not known at compiletime.Also the resolutionofvirtualcalls in C++with vtables is basedonindirectfunctioncalls.

Depending on theeffort made it is often only possible to generateequivalenceclasses forthe setofvalidjumporcall targets[BCN+17]. One simple approachistomatchthe type of afunction pointer with the typesofall available functions. However, this often leads to avastly over-approximated setsofpotential targets.

19 2. ControlFlowIntegrity (CFI)

Listing 2.4shows asimple example for this problem. Static analysissolelybasedon function typesisnot abletofind the exact call targets of theindirectcontrol flowtransfers in lines14and 17, although thereader intuitively knowsthat either of thetwo callsonly hasone possibletarget.

1 void foo(int a) { 2 return; 3 } 4 5 void bar(int a) { 6 return; 7 } 8 9 void baz(void) { 10 int a=input(); 11 void (*fptr)(int); 12 if (a) { 13 fptr =foo; 14 fptr(); 15 }else { 16 fptr =bar; 17 fptr(); 18 } 19 } Listing 2.4: Example forproblemswith static source code analysis[BCN+17]

Moreadvancedanalysis approachesconsiderthe executionflow leading to an indirect jumporcall and can thereby limit the setofcall targets more precisely [BCN+17].

The efficiency of CFI directlydependsonthe precisionofthe underlying CFGbut static analysis is notalwaysable to deriveaprecise graph.However, this restrictionisoften accepted in theinterests of simplicity [GABP14].

It is suggestedtouse only one labelfor CFI protection of allindirect control flowtransfers in [ABEL09]. Thisofcourse introducesanextreme relaxation of theCFG,but does notrequire complexstatic analysis. Practical implementationsofthis idea, suchas IntelControl-FlowEnforcement Technology (CET)[Int19]orBranchRegulation (BR) approachesasin[KOGP12]evengoone step further and mark all functions with the same label,thereby ensuring thatsome indirect branch at least targetsthe very beginning of afunction andnot some arbitraryposition. We describethe two concepts in Chapter 3.

There alsoexist slight improvements of this simplifiedCFG assumption in literature: Using differentlabelsfor function returns,indirectfunctioncallsand indirect jumps at least createsthree equivalency classes. Anothermethodfor simplifyingthe CFGiscode duplication, i.e., inlining methods in order to preventindirect control flowtransfers in the first place.The drawbacks are apparentthoughand make this approachnot practicable in manycases [ABEL09, GABP14].

20 2.4. Implementing CFI

2.4Implementing CFI

CFI can be implementedinsoftware, with hardware enforcementorasacombination of both. After theoriginal CFI proposal in [ABEL09], support for CFI has also been added to some compilers, tools and operatingsystems (OSes) in recentyears [BCN+17]. Microsoftadded afeature called Control Flow Guard (CFG) to its Visual C++ compiler with Visual Studio2015 for basic forward edgeprotection. CFGruns on Windows8.1 and Windows10and canbeenabled by setting abuild option,whichisdisabledby default. The compilerthen prepares aset of valid targetsfor indirectfunctioncalls, which arewritten to adedicated section of thebinary header. Duringexecution, the kernel checks whether indirectcallsadhere to thisstatically determined set[msc18]. AlsoClang offers optional CFIenforcement options for forward edges.Differentchecks can be enabled, e.g. for type-checking of pointers and some casts, but require Link Time Optimization(LTO) to be active[claa].Indirectcallsare securedbyreplacing the calls with jump tables, which containonly valid targets,and leadtoanexceptionincase none of thetableentries matches the dynamic call target at run-time [clab]. Clang’s support for CFI is used in parts of theAndroid OS [and20]. Backward edge protectionisneither providedbyMicrosoft’s CFGnor by Clang’sCFI concept. GNU Compiler Collection (GCC) does not come with similar conceptsatall but requires third-partyextensions or hardware mechanisms, suchasIntel CETorARM Pointer Authentification Code (PAC) [Cor19, Ben20, QT17]. The academiccommunity also proposed manysoftwareimplementationswith different approaches. We onlybrieflylist some distinguished techniques here. To the best of our knowledge, none of theseconceptsachieved recognitionbeyond thescientific field.The reader is referred to works suchas[BCN+17]for adetailedlookonsoftware-based CFI. HyperSafe [WJ10], whichwas presentedin2010, replacesindirect branch addresses with indicestoajump table and strictly safeguards thetable. The concept introduces an overhead of less than 5%on average. The CF-Locking [BJF11]approachfrom 2011 marks all indirectcontrol flowtransfers with locks (i.e.,simple flags).The locks are set rightbefore an indirect branchand reset immediately afterwards. Detectionof CFI violationisbasedonthese locksalwaysbeing unset before an indirect control flow transfer occurs.The bin-CFI scheme [ZS13]proposed in 2013 is alsobasedonjump tablesand instruments indirect control flowtransfersinbinariesinaway thatthe original branchinstruction is replaced with ajumptablelookup. An overhead of about 4%is entailed forCprograms, respectively 8.5 %for C++ applications. In thesame year CCFIR [ZWC+13]was presented, which replaces indirect control flow transfers with directjumps whereverpossible, specifically for function returns, by adding springboards to the code. Indirect branches are onlyallowedwithin the addressrange of these springboards.Every outside branch,including those jumpingintothe springboard and from there to an external address, is adirectone. CCFIR hasaperformance decrease of less than 4%.

21 2. ControlFlowIntegrity (CFI)

MoCfi [DDE+12]followsacompletelydifferentapproachbyreplacing branchinstructions with jumpstoasharedlibrary,whichisdynamically linked and performs CFI checks. The concept both protects backward edges by meansofashadow stack as well as forward edges by checking eachindirectbranchtargetagainst aset of allowed targets. The ROPecker [CZM+14]scheme from 2014 is implementedasanOS kernelmodule and analyzes control flowbytriggeringeventsatrun-time. The modulereacts to theseevents and detectsCFI violations by looking for anomaliessuchasahighnumberofsmall, chained gadgetswith areported performanceoverhead of about 2.6 %. In 2014 MCFI [NT14]was presentedasanapproachfor indirectbranchprotection with labels similar to theoriginal CFI concept from [ABEL09], but with the significant advantage of support forseparatelycompiledcodemodules. The concept introduces about 5%overhead. Most software-sideCFI implementationsdegrade performance significantly.Different studies report an average performance overhead of up to 20 %[KOGP12, CXS+09, ZWC+13]for some concepts,which is not accepted in most productionenvironments [dCV17]. The functional scopeand protection precision of course also impactperformance. Meaningful comparisons forthis aspect are difficult though, because differentplatforms and differentmethodologies for benchmarks are usedand some of theproposalsentail strong dependencies. Other limitations introduced by some of theconcepts listedabove are considerably increasedbinary sizeand required OS support(e.g., forkernel modules). In addition,all CFI-related memory elementsresideinthe normal memory,but needtobeprotected form outside manipulation in some way.This is atypical taskfor theOS, e.g.for kernel-levelprotection[KOGP12]. PracticalimplementationsofCFI in commerciallyavailable hardware are not widely used yet. Theconcept for Intel’shighly anticipated CET wasalready specifiedin2016, but the Tiger Lake (CPU) family released in late2020 is thefirst to be equipped with theCFI protection technique [Gar20]. Intel Control-FlowEnforcement Technology (CET)isdiscussed in Section 3.3.5.ARM introducedafeature called Pointer Authentification Code (PAC)with its ARMv8.3additionin2016, which authenticates targets of indirect control flowtransfers before using them.Tothatend, the upper bits of 64-bit addresses are used to store aPAC for detecting corrupted addresses.The ISA is extendedwith instructions forcomputing and checking thecodes. Thefeature is only available for theARM 64-bitextensionthough, because thestandard 32-bitvariantdoes notofferenough availablebits in the registers for storing thePAC [Bra16, QT17] Hardware-enforcedCFI is covered in the followingChapter3andmultipleapproaches thereofare listed, some of which in more detail. The remainder of this work focuses on fine-grained and hardware-enforced variantsofCFI becausethis is still averyactive researcharea.

22 CHAPTER 3

Hardware-BasedCFI

Literatureconcerning hardware-based CFI describesmanydifferentapproaches. For example, De Clercq et al. compiled asurvey of 21 techniquesin[dCV17]with both very coarse-and fine-grainedimplementations. Thischapter sums up some advanced academicapproaches for fine-grainedCFI enforce- ment, both fromthe perspectiveofabstract concepts in Section 3.1aswell as embedded in concrete implementations in Section 3.3. Section3.2 alsolists challengesarising when workingonhardware-enforced CFI.

3.1Concepts

Concepts for hardware-assisted CFI protection can be classifiedintotwo groupsbasedon the directionofcontrol flow transfers theyaim to protect. Forward edge protection is about CFIenforcementonindirectjumps and calls. While the target is statically known for directcontrol flowtransfers,itisonly determinedat run-time forindirectones. Thismakes some sort of checkagainst theintended CFGso important. Techniques forforwardedgeprotectionneedtomakesure thatajump or call only targetsanallowedaddress, where the setofallowedaddressesisdetermined at compile-time or even before. Backward edges occurwhenever some function returns.From atechnical pointof view,every backward edgeisanindirectjump because thetarget is onlydetermined at run-timeand read fromaregister.Ideally,backward edge protection makessurethata functionreturnmay only target the call site where the function call originated from. It is sometimes arguedthatprotectionofforward edgesislessrelevantcompared to securingbackward edges, simply duetothe fact thatindirectcontrol flowtransfers on forward edges are rare in comparison to function returns [DHP+15]. Inall of the

23 3. Hardware-Based CFI

benchmark applications used for evaluations in this thesis(see Section 6.2 for alist), only about 3%ofall control flowtransfers are indirectjumps or callsand 15 %are function returns. Therestare direct function callsand branches.The evaluation wasperformed on aRISC-VCPU with only statically linked libraries and source code only writteninC. Using C++ and dynamic linkingcan increase the shares of indirect control flowtransfers. The dominance of theirdirect variantsis, however, notinquestion. AnotherimportantpropertyofCFI concepts is their statefulness.Manyfunctionsare invokedbymultiple callers,whichleads to backward edges to eachofthese call sites in the CFG. A stateful policy onlyallows returning to the most recentcall siteand triggersaCFI violationfor all other cases, even for targets validwith respecttothe CFG. Figure3.1 depicts suchacase, where thefunction foo can be called from two functions. The invocation of foo fromfunction bar,representedbythe green arrow, must be followedbyareturninstruction targeting bar again.The other backward edge constitutes aCFI violationinthis case at run-time, although being presentinthe statically determinedCFG.

bar foo

baz

Figure 3.1: Example for statefulCFI

Someprominentconceptsfor CFI enforcementonforward and backward edges are listed in the followingsections. Note that noneofthe approachesiscapable of enforcing full CFIonits own.

3.1.1 ActiveSet The active-set approachfor backward edge protection keepstrackofall currently active functions by marking them activeinaninternal array,asindicated in Figure3.2. This can be done by assigning auniquelabel to eachfunction, whichisused to indexthe active-set.Noexplicit CFGisrequiredfor this approach. Everytime the executionofa function starts, itsflag is markedactiveinthe active-set.Uponreturning fromafunction, its corresponding flag is marked inactive. It is onlyallowedtoreturntoactive functions: Whenever afunctionreturns to some return address, it is immediatelycheckedwhether the correspondingfunctionatthe targetiscurrently active[dCV17, DHP+15]. Recursion needstobespecifically consideredfor active-setsbecausewhen the most deeply nested call returns, it would unset the flag of its function in theset.Inthis case, all subsequentreturns of thesame function failbecausetheir respectiveflag is not set anymore. Counters for recursion-depth,asusedin[DHP+15], are one possible solution.

24 3.1. Concepts

... main foo bar baz ...

Figure 3.2: Active-setexample

It has been shown thoughthat keeping aset of activecall sites does not providesufficient backward edgeCFI protection. Theapproachisonlypartially stateful becausewhile the set of activecall sites is known,the order in which these aretraversed is not. Since returning to anyactivecallsite is possiblewithin the active-set policy,exploits like skipping functions whenreturning or directly jumping backtomain are possible[TW17].

3.1.2 ShadowStack Ashadowstack(or return addressstack) is an additional memory beside the usualstack for redundantlykeepingtrack of return addresses. This approach is used forprotecting backwardedges and introducesastateful CFI protection. Wheneverafunction is called directly or indirectly, the addressthis function should return to is pushed to theusual stackand additionallytothe shadowstack. Figure 3.3 showsthe connectionbetween the two stackelements. Rightbefore returning from thefunction, thetopmost valuesare popped off both stacksinorder to get thereturn addressand theresults are compared. The return is onlyvalidifthe two addresses match [dCV17]. The basicconcept of shadow stacks wasalready proposed two decades ago in [CH01].

stack ... parameters 1 shadowstack return address 1 ...... returnaddress 1 parameters 2 returnaddress 2 return address 2 returnaddress 3 ...... parameters 3 return address 3 ...... Figure 3.3: Shadowstackexample

In other words, ashadowstack makes surethat afunction can onlyreturn to itsoriginal call site. Thisisoften referred toasacall-ret pair and constitutesperfectstatefulness. It has been established that ashadowstack (or anyother stateful approach) is required for an effectiveCFI protection[CCAI16]. Anotheradvantage of this concept is that no CFG

25 3. Hardware-Based CFI

informationisneeded beforehand because theallowedvalues and thesequenceofreturn addresses is implicitlygiven by the last in -first out (LIFO) propertyofthe stack. However, there are constructs in Cand C++ whichbreakthe strict concept of call- ret pairs.For example, C++exceptions, setjmp/longjmp and tail-call elimination introducespecial cases, which have to be handled separately when using ashadowstack [CCAI16]. Suchchallengesfor CFI are listedinSection 3.2. In addition,shadowstacks require significantamounts of memory.This canbeimple- mentedusing registers, which comeswith additional costsinterms of hardware utilization but memory access does not noticeablyaffect performance.The other possibilityissome fencedoffsegmentofthe main memory,whichischeapbut access is slowerand it must be guaranteed that theshadowstack’s memory sectionisprotected fromunauthorized modification [DBGJ19, DHP+15].

3.1.3 Labels Label-based approachesuse unique identifiers for checking whether thetarget of an indirect jumporcall matches itsintendeddestination. To thatend, labels are added to the very beginning of functions and code blocks, which arethe target of indirect control flowtransfers.Theselabelsare checkedwhen performingthe transfer. In case an attacker manipulated thecontrol flowofthe program, alabel mismatchisdetected[dCV17]. Depending on the implementation,the label checkisperformed rightbefore jumping(as in the initial CFIconcept [ABEL09]), or immediatelyafter (likein[CCAI16]). Listing 3.1 and Listing 3.2 showpseudo-assembly code forchecking alabel rightafter an indirect jump. Thecallingfunctionsetsthe expected label of thecallee. It dependson the concrete implementation howthe indicated setlabel functionalityisrealized. The callee compares this label with theone at itsentry point(indicated by checklabel). If thecall waswithin the intended controlflow,the two labels match.

1 : 1 : 2 ... 2 checklabel 0x42 3 setlabel 0x42 3 ... 4 call $reg 4 ... 5 ... 5 ... Listing (3.1)Setting label in caller Listing (3.2) Label checkincallee

In principle,label checks can be usedfor both forward and backward edge protection. Securingafunction return can be implemented in the same way as theexemplary indirect call in Listing3.1 and Listing 3.2. There is one issuethough: Backward edgeprotection using labels is stateless, meaning it can onlycheck whether thereturntoaspecific label is legitimate according to theCFG,but not if thereturn targetsthe most recentcall site. Thisisaproblemfor functions, which are called from multiple call sites because areturn to anyofthem appears to be validaccordingtothe CFG[dCV17].

26 3.1. Concepts

Functions,whichare indirectly called frommultiple differentcallers,are difficult for protectionusing labelsbecause thecallee can onlycarryone label checkinits entry.This introduces acoarsegrained CFGsinceall callers must use this very same label too, thus rendering them indistinguishable for the CFI checkatthe callee. Consider the example in Listing 3.1and Listing 3.2 again.Ifbar is to be called notonlyfrom foo,all other callers must use thesame label. However, there existsolutions for this problem, such as the trampolines usedbySullivan et al. in theirhardware-assisted CFI schemepresented in [SAD+16]. When using trampolines,indirectcallsare redirected to an injected pieceofcodeunique foreachcall, which performs the CFI checkand then forwards controlflow to the intended target, if it is consideredvalid.The trampoline code checks whether thetarget addressisin accordance with theallowedCFG and triggersanexceptionifaviolationisdetected. Figure 3.5 illustrates thebasicoperationofatrampoline. Section 4.5 later goes into moredetailontrampolines.

Figure 3.5: Operation of atrampoline for indirect function calls [SAD+16]

3.1.4 PolicyMatrix Apolicy matrixisatwo-dimensional array,whereone dimensioncorrespondstothe addresses of all indirect function callsofaprogram and theotherdimensiontothe start addresses of all indirectlycalled functions. Figure 3.6 shows an exemplary policymatrix. Each entrycontains aflag indicatingwhether the function call at the respectiveaddress is allowedtotarget acertaincallee. Direct callsare notpart of thematrix [DBGJ19]. The matrixneeds to be filledwith CFGinformation on indirect callsatrun-time, e.g.at the very start of program execution. Giventhe usualaddresswidth of 32- or 64 bits, neither caller- nor callee addresscan be usedasindices to the matrix directly but needto be translated to (smaller)indices in some way.This computationcan be very expensive forlarge matrixdimensions in termsofhardware requirements and critical pathlength. Instead of using thedecoded caller addressasindexfor rows, labelscan be used. These needtobeknown in advance, i.e.,when the matrix is loaded into memory,and indirect control flowtransfers are to be annotated withthe labels.

27 3. Hardware-Based CFI

callees ... lt gt ...... main+48 s foo+12 er l bar+128 cal baz+16 ...

Figure 3.6: Policy matrix example

Similarly to labels, apolicy matrixcan be usedfor both forward and backward edge protectioninprinciple. However, the same limitation, namely the statelessnessof backward edgeCFI checks,applies.

3.1.5 Furtherconcepts Forthe sake of completenesswebrieflylist some furtherhardware-basedCFI concepts in this section,whichare either considered beingtoo coarse grainedordonot use CFG information. We makethis limitationgiven the title of this thesis.

BranchRegulation (BR) BR defendsagainst CRA attacks by checking if control flowtransfers target an addresses within the same function (i.e.,for local jumps) or theverybeginningofanother function (i.e., function calls). The conceptsthereby preventsjumps to arbitrary addresses, making CRA attacks with chainsofgadgets significantlyharder.BRdoesnot require anystatic analysissteps or knowledge about theprogram’sCFG but only enforcescoarse-grained CFI [KOGP12]. Theconcept blurswith the original label-approachdescribedinChapter 2when only a smallnumberofdifferentlabelsisused.

Cryptographic approaches Cryptographicapproachesuse encryption or signing techniquesfor CFI enforcementand not necessarily rely on CFGs. The mainidea is to detect manipulated control flowtarget addresses by meansofcryptography.Suchconcepts are able to provide very fine grained CFI butinvolvechallenges likeasignificant performance- andhardware overhead. In addition, the protection of keys needstobeconsidered [ZQQQ18]. Twoexamples forthisclassofCFI enforcement techniques are HCIC and SOFIA: HCIC usescryptographyfor detectingmanipulated branch targets and solvesthe problem of protecting encryptionkeysbyusing physical unclonable functions (PUFs) in avery

28 3.2. Challenges innovativeway [ZQQQ18]. SOFIAencrypts instructions with information dependent on the control flow. When executingthe program, control flowinformation is again used for decryption and onlypaths are abletoperformavaliddecryption.Although introducing averyfine-grainedCFI,SOFIA entails morethan 100 %runtime overhead [DCGÜ+16].

FiniteState Machine(FSM) Concepts basedonFinite State Machines (FSMs)checkfor correctsystem behaviorby onlyallowingcertainsequences of execution, which arerepresented by internalstates. Implementationsthereofcan be very coarse-grained, i.e.,mightonly checkspecific properties like thefact that function callsare onlyallowedtotargetthe first instruction of some function, quitesimilartoBR. Morefine-grained approacheslikethe one presentedin[RKHB12]enforce asecure sequenceoffunctioncalls by comparing thecall order at runtime with apre-computed modelofacorrect execution.

3.2Challenges

Implementing theconceptsdescribed in Section3.1 entails to consider and solve some challenges, whichare briefly described here.

• Recursion introduces problems forCFI [CCAI16, DHP+15]. Eveninsmall pro- grams, recursivefunctioncallscan easily stackupveryhighand therebyoverflow ashadowstack. The active-set approachonthe other handisnot vulnerable to overflows,but does notkeep trackonrecursion depth. This introduces some imprecision forreturnsfrom recursivelycalled functions.

• Tail-Call Elimination is acompileroptimization, whichintroduces an exception to the commonprinciple thatevery function returns to thesite it wascalled from. When enabled, theoptimization generates code for certainfunctioncalls,where a functiondoesnot return to itsown call site but to the call site of itscaller.Naturally, this breaksbackward edge CFI with shadowstacks [CCAI16]. Ways to handle tail-call eliminationwhen enforcing CFI are, for example, disabling this compiler optimization (e.g.,with the GCC flag -fno-optimize-sibling-calls)or excluding theaffected code from instrumentation.

• C++exceptions alsodon’tfollowthe principle of call-ret pairs.When an exception occurs,the function causingitjumps to the end of the try block by unwinding the stack until the addressofthe last instructioninthis block is foundwithoutexecutingthe functionreturns in between. The simplified control flowthereof is depicted in Figure3.7. If notspecifically handled, this leadsto inconsistencies between shadowstackand the common stack[DZL16].

29 3. Hardware-Based CFI

Figure 3.7: Simplifiedexecutionflow with aC++ exception[DZL16]

• setjmp/longjmp is alanguage feature of Cfor performingjumps across function boundaries. Acalltosetjmp stores thecurrentprogram state, e.g., important registers, in abuffer.This state canthen be restored with longjmp.The two functions allow to jump to aprevious position in theprogram without unwinding the stackstepbystep[JTC18], whichagain introducesaproblemfor CFI enforcement with shadow stacks.

•The exit function available in Centails asimilar problem.Itcan be calledfrom arbitrarypositions in the program, terminates executionimmediatelyand onlycalls functions registered with atexit before.This leavesCFI components, suchasa shadow stack, in apotentially dirtystate and mightlead to problems for subsequent executions. The _Exit functionposes aspecial formofthis problemand leads to termination without even calling atexit functions [JTC18]. However, CFI-related components canbereset upon an exit because theprogram terminates anyways at this point.

• Multithreading needs to be considered on some platforms and requires the duplicationofCFI-relatedcomponents in the CPUormore advanced solutions. Thiscan be problematic formemory-intenseelements like shadowstacks [CCAI16].

• CFGgeneration is averycomplex topic.The edges forfunctionreturns,direct callsand jumps canbeeasily deduced. Fortheirindirectvariants on theotherhand it is very hard, andsometimes even not possible, to findout the setofpossible branchtargets. Section 2.3describedsome basics of CFGgeneration.

• Instrumentation of source code or programs is requiredfor all concepts presented in this chapter. Thisisthe process of injecting(custom) instructionsinto the program in order to control CFI enforcement.Instrumentationisparticularly challenging for shared libraries and programs, where no source code is available. Also issues like dynamic linking and separatecompilation of therequiredCompilation Units (CUs) needstobeconsidered.Chapter5covers the process of instrumentation in detail.

30 3.3. Hardware-BasedCFI Implementations

• Access to required run-timeinformation,e.g., pipelineinternalsormemory elements, mightbelimitedand canalsobeproblematic. Internalsignals like the ProgramCounter(PC) might not be accessible and changestothe CPUcore are notalwayspossible. • Hardware constraints limitthe possibilities for additional elementsfor CFI (e.g., shadow stack, policy matrix or controllers),whichhavetobecarefully dimensioned. Boththe requirements forproperCFI as well as theavailable hardware resources establish lower, respectivelyupper limits. Storage elementscan be implemented in multiple ways. While,for instance,using existingmainmemory implemented with BlockRAMs (BRAMs)blocksfor ashadowstackisvery efficientinterms of hardware requirements, it introducesnew challengeslikeaccess latencyand security concerns, which have to be handled specifically.Flip Flops (FFs) on theotherhand are very fast but expensiveand particularly limited.

3.3Hardware-Based CFI Implementations

Thissection summarizes aselection of concretehardware-basedCFI implementationsof the concepts listed in Section 3.1 and alsodescribes an already specified and released implementation by Intel. Most of theapproaches listed here use multipleconceptsin order to support both forward and backward edge protection, while others focusoneither. We selected fouracademic implementation (FIXER [DBGJ19], HAFIX [DHP+15], HCFI [CCAI16]and HECFI[SAD+16]) and one industry approach(IntelCET [Gar20]) fora more detailed evaluation based on the mixofCFI conceptsand implementation details they use.These schemes pose agood sample of differentconcepts. However, there exist more hardware enforcedCFI concepts thanthe subsetcovered in this section. EspeciallyBRand cryptographic approaches, which were both briefly mentioned in Section3.1.5,are highly activeresearchtopics. Giventhe title of this thesis, we limit our focusonfine-grained CFG-based CFI concepts herethoughand refer the reader to further literature on other approaches, e.g., theextensivesurvey of hardware-based CFI by de Clercq and Verbauwhedein[dCV17]. It is worth noting at this pointthat software-sideforward edgeCFI,asdescribed in Section 2.4, is sometimesconsideredtobesufficiently efficientbecause function returns appear waymore frequentthan indirectjumps and calls[TRC+14]. Afocus on backward edges therefore seems to be areasonablerestriction. The implementationsdescribed here shareacommonunderlying principle: All of them extend some CPU with amodule forCFI enforcement,eitherbymodifyingthe core itself or using some kind of co-processor,and introduceISA extensionswith custom instructions forannotatingaprogram withCFG information. Whileextending core internalsoffersagreatdeal of possibilities,its practicalitymight be limited because existing Intellectual Property (IP) soft-core CPUs are often used in real-worldapplications.Modifying these is complex, expensiveornot possibleatall in

31 3. Hardware-Based CFI

manycases.Anotherapproachfor enforcing CFI in hardware is utilizing thecore’s debug interface. Not all required signals mightbeavailable forsomeinterfaces though [dCV17]. Adirectcomparison between theimplementationsseems temptingatthis point, but cannot be established in ameaningfulway yet. Theimplementationsare partly based on differentplatforms and the authorseachuse differentbenchmarksfor evaluation. Chapter4presents implementationsfor thefive conceptsofthis section on acommon platform,namelya32-bit RISC-V System-on-a-Chip(SoC). Based on those,meaningful comparisons in termsofperformance,codesize, hardware utilization and securityare possible, which are eventually discussed in Chapter 6.

3.3.1 FIXER:FlowIntegrityExtensionfor EmbeddedRISC-V De et. al proposed FIXER in [DBGJ19], whichprotects both forward and backward edges by combiningthe well established concepts of ashadowstack(as described in Section3.1.2)and apolicy matrix(Section 3.1.4).They implemented theirapproachon RocketChip [AAB+16], a64-bitRISC-V SoC platform, and benchmarked on aXilinx ZynqField Programmable GateArray(FPGA). RocketChip offersthe possibilityto extend theprocessor’sfunctionalitywithout modifyingthe core itself by using aco- processorcalled Rocket Custom Coprocessor (RoCC) for custom instructions. All CFI elements, including therequired memorymodules, reside there. Forbackward edge protection, theinstructionsshown in Listing 3.3 (lines3-5) are inserted in frontofevery function call.First, the current PC is written to register t0 and incremented in order to pointtothe return addressofthe callusingstandard RISC-V instructions. Then the RoCC custominstructioninline 5pushes the contentofregister t0 to the shadowstack. In addition,the instructions shown in Listing 3.4 (lines5-6) are insertedinfrontofevery function return. The RoCC instruction in line 5pops thetopmost element off the shadow stackand writes it to register t0.The conditional branchinline 6thencompares t0 to the return addressstoredinregister ra and jumpstosome error handlingcodeincase a mismatch is detected.

1 : 1 : 2 ... 2 ... 3 auipc t0,0 3 ... 4 add t0,t0,14 4 ... 5 CFI_CALL 5 CFI_RET 6 call bar 6 bne t0,ra,_cfi_error 7 ... 7 jr ra Listing (3.3) FIXER: Directfunctioncall Listing (3.4) FIXER: Functionreturn [DBGJ19] [DBGJ19]

Forward edge protection is performed by using apolicy matrix, whichispopulated before executing theprogram and queriedatevery indirect functioncall.The matrix contains a

32 3.3. Hardware-BasedCFI Implementations

bit foreachcaller-calleepair indicating whether therespectivecall is allowed. To that end, acustom CFI_FWD instruction containing both the addressofthe callaswellasthe dereferenced function pointer is inserted right before an indirect call.The CFI module in the RoCCthen checks andreturns thecorresponding bitinthe policymatrix.

Although theselected CPU platformwith the RoCC appearstoofferavery convenient way of extendingcore functionality, it comes at significantperformance costs. The CFI instrumentation for this approachshown in theabove listingsrequiresmultiple instructions in each case.The overheadcouldbereduced to asingle instructionineach caseifaccesstothe core itselfisavailable.

3.3.2HAFIX: Hardware-AssistedFlowIntegrityExtension

Davi et al. presentedahighlynoticed approachcalled HAFIX [DHP+15], which uses the concept of active-sets (as described in Section 3.1.1).They focus on backward edge protection andargue that forwardedge CFI is less challenging interms ofperformance and cantherefore be efficientlyimplementedinsoftware. This is because function returns occursignificantly more often than indirect callsand jumps.

The paper describes theimplementation of two differentconceptsfor backward edgeCFI: AshadowstackapproachimplementedonaLEON3 microprocessor with theSPARC V8 instruction set and an active-set approachonsome IntelSiskiyou Peak x86 platform. We focus on thelatterinthis summary because the first does not significantlydifferfrom othershadowstack implementations.

In contrast to most other (hardware-based)CFI approaches, HAFIX does not required apre-computed CFG. Each function is assigned auniquelabel,whichcan be as trivial as astrictlyincreasingcounter. At thebeginning of eachfunction, a CFIBR instruction with that label is insertedatthe very first position. TheCFI module contains amemory element directly indexed by this label, whichholdsone bit foreachpossiblefunction indicating whetherthe function corresponding to the label is currently active. An internal state machine makesure thateachfunctioncallisimmediatelyfollowedbysuchan instruction, otherwise acontrol flowviolationisdetected.Listing3.5 and Listing 3.6 show therequiredadditionalinstructions.

1 : 1 : 2 CFIBR 0x10 2 CFIBR 0x14 3 ... 3 ... 4 call bar 4 ... 5 CFIRET 0x10 5 ... 6 ... 6 ... 7 CFIDEL 0x10 7 CFIDEL 0x14 8 ret 8 ret Listing (3.5) HAFIX: Functioncall Listing (3.6)HAFIX: Called function

33 3. Hardware-Based CFI

Rightbefore afunctionreturns, it is flagged inactivewith a CFIDEL instructionusing the same label as the initial CFIBR.When afunctionreturns, the backward edgehas to targetaCFIRET instruction.This is again enforcedbythe state machine internalto the CFI module. CFIRET checks if thefunctioncorresponding to its label is currently flagged active and raises an exception if the return targets an inactivecallsite. As discussed in Section3.2,recursionischallenging whenitcomes to CFI. In imple- mentationswith shadowstacks, recursion can overflowthe stack and therebybreak CFI checks.However, active-set approacheshave adifferentproblem: They flagsome recursivelycalled function inactivewhen the first return of that function occurs,e.g.with the CFIDEL usedinHAFIX. When thenextreturn forthe samefunction happens,a CFI violationiswrongly detectedbecause theassociated flagwas already reset. Functions called recursivelyare instrumented differentlyinHAFIX for that very reason: Instead of usingaCFIBR instructions at thefunctionentry,aCFIREC containing the usuallabel is issued. The CFI module increases an internal counter associated to the label of thisfunction on everyrecursivecall,and decreasesitonevery return. The counter is implementedasaregister forfast access.Itisargued that using aseparate instructionfor recursivecallsprevents many memory operations on the active-set, which is implementedusing BRAMs, because only the counter registerisaccessed instead. The implementation onlysupportsnon-nested recursivefunctions because it uses asingle counterfor keepingtrackofrecursion depth. Oneofthe benchmark applications used in our evaluation,namely slre1,contains suchaform of recursion where function calls are nested like foo-bar-bar-foo-bar-bar-foo ...,whichisnot compatiblewith HAFIX for thereason describedabove. The active-set concept used here has been proventobeineffectiveinsome cases. Theodor- ides etal. showedthatmultipledifferent attacks on thecontrol floware sill possible despite having CFI backed with an active-set [TW17]. Itisalso possible to bypass CFIDEL instructionsusing stack unwinding, which eventually leads to all functions being flagged active [CCAI16].

3.3.3 HCFI: Hardware-enforced Control-Flow Integrity Christoulakisetal. presented an approachcalled HCFI in [CCAI16]. Theyusedthe concept of labels(as described in Section 3.1.3) for protecting forward edges andashadow stack(Section3.1.2) for backward edges. To thatend, they implemented theirapproach on aLeon3 32-bit SoC with the SparcV8 ISA andevaluated it on an FPGA. Listing 3.7and Listing 3.8showexample code for howcaller and callee functions are instrumented forindirectfunctioncalls in pseudoassembly. Note thatthe platform used heresupportsabranchdelay slot,whichinmanycases allowsfor averyefficientCFI implementation because the compilerisnot alwaysable to fill this slotwith meaningful instructions and would onlyissue a NOP instead. Rightafter calling afunction, a

1https://github.com/embench/embench-iot/tree/master/src/slre

34 3.3. Hardware-BasedCFI Implementations

SETPCLABEL instruction carrying alabel is executed. Thisinstructionnot onlysetsthe label used for CHECKLABEL at function entries,but alsopushes the currentPCtothe shadowstack. The delayslot is alsoused for backward edge protection, i.e.,for the CHECKPC instruction at function returns,whichpops off the shadowstackand compares thevaluewith the returntarget from theactual stack.

1 : 1 : 2 ... 2 CHECKLABEL 0x13 3 call $s2 3 ... 4 SETPCLABEL 0x13 4 ret 5 ... 5 CHECKPC Listing (3.7) HCFI: Indirect functioncall Listing (3.8) HCFI: Calledfunction

Similarly to indirect function calls, alldirect ones needtobeinstrumented too. Only nowaSETPC instruction is used, whichworks the same wayasSETPCLABEL but does not carryalabel and alsosuppresses the CHECKLABEL checkatthe entry of callee. In addition,aninternal FSMmakes sureonly valid sequences of CFI instructionsare executed. Forinstance, SETPCLABEL must immediatelybefollowed by CHECKLABEL. Mismatchescause exceptions inany case. The authorsalsoadded support for setjmp / longjmp constructs and specific recursion handling to showthat theseknown challenges, whichwedescribed in Section 3.2,can be overcome. Additional instructions announcethat a setjmp,respectivelyalongjmp,is on theway.While the former stores the currentshadowstackpointer into aseparate register, the latter unwinds the shadowstack down to that position. Recursive function callsare flagged on theshadowstackinorder to prevent astack overflow. WhenaSETPC instructionarrives and itsPCmatches thetop of theshadow stack, therecursion flag for this position is set and thePCisnot pushed to theshadow stack. Upon a CHECKPC instruction, the top of theshadowstackiscompared again and in case itsrecursion flag is set, it is notpoppedfrom the stack.Todetectthe end of recursion,alsothe second addressonthe stack is compared if comparisonwith the topmost address resulted in amismatch. Consider therecursivefactorial calculation shown inListing3.9 as abasicexample.The stacknever grows throughout thecompleteexecutionofthe recursivefunction, whichis the intended behaviorand very effective. HCFI does not keep trackofrecursion depth though, whichintroduces aproblemfor certainkindsofapplications. The problemis thatthe concept only allows forsimple formsofrecursion. Somemore complexrecursivefunctions seem to break theconcept for recursion handling suggestedbyChristoulakisand his colleagues. Forinstance, the recursive towers implementation from theRISC-V Benchmarksuite2 apparentlycannot be handled by 2 https://github.com/ucb-bar/riscv-benchmarks

35 3. Hardware-Based CFI

1 int factorial(int n) { 2 if (n >= 1) { 3 returnn*factorial(n-1); 4 }else { 5 return1; 6 } 7 } Listing 3.9: Recursive factorial function

HCFI andwrongly leadstoanexception. Listing 3.10 showsthe simplified Ccodeof the affected function.The problemhere is thatthe end of recursion cannot be detected properly because of themultiple nested recursivefunctioncallswithin the towers_- solve_h function, which allhavedifferentreturn addresses. Since there is no counter of recursion-depth associated with eachitem on theshadowstack, there is no way of knowingwhen arecursion is finished. The authorssuggestion to alsocomparethe second item on thestackdoesnot solve this problem here.

1 void towers_solve_h( ... ){ 2 int val; 3 ... 4 5 if (n==1){ 6 ... 7 val =list_pop(...); 8 list_push(...); 9 ... 10 }else { 11 towers_solve_h( n-1, ... ); 12 towers_solve_h( 1, ... ); 13 towers_solve_h( n-1, ... ); 14 } 15 } Listing 3.10: Simplifiedrecursive towers function

3.3.4 Strategy Without Tactics -Policy Agnostic Hardware- EnhancedControl-FlowIntegrity Sullivan et al. suggested avery extensiveapproachin[SAD+16], whichfollowsthe same basicprinciple as HCFI [CCAI16]but alsoprotects indirect jumps anduses theconcept of trampolines forhandling indirectcalls of functions used by multiple callers. This allowsfor greater CFGprecision.The authorsalso extendedaLEON3 core and the SPARCV8 ISA with custom instructionsfor marking indirect calls and jumps,function entries,and the targets of indirect jumps and backward edges.Sullivan et al.’s concept is called HECFI in the remainder of this thesisfor easierreference.

36 3.3. Hardware-BasedCFI Implementations

Both directand indirectfunctioncallsare extendedwith apreceding CFIBR instruction containing alabel,whichispushed onto aLabel StateStack(LSS). TheLSS is basically ashadowstackfor labels. Thisapproachisamajor difference to other conceptsusing ashadowstackbecause most of them push fullreturn addresses.The first instruction executedafter afunctionreturn has to be a CFIRET containing thesame label as the preceding CFIBR instructionofthe corresponding function call.Everything else constitutesaCFI violation.Listing3.11 and Listing3.12showexemplary functions.

1 : 1 : 2 ... 2 CFICHK 0x3 3 CFIBR 0x2 3 ... 4 CFIPRC 0x3 4 ... 5 CALL $s2 5 ... 6 CFIRET 0x2 6 ... 7 ... 7 RET Listing (3.11) HECFI: Indirect call Listing (3.12) HECFI: Callee function

Forsecuring indirect calls, additional CFIPRC instructions containing adifferentlabel are also inserted rightbefore thecall instruction.This label must matchthe label of a CFICHK instructionrightatthe entry of thecallee.The checkperformed there is only activeimmediatelyafter CFIPRC instructions,but suppressed for direct function calls. Sullivan et al. argue that using two instructions with differentlabels, i.e. one eachfor the forward-and backward edge policy,benefits security. This is necessary because thecallees mightbecalled from multiple sites, whichwouldotherwise needtouse thesame label for backward edgeprotection. However, similarconcepts(such as HCFI, see Section 3.3.3) push the currentPCtothe shadowstackinsteadoflabels. Thisenables them use the same instruction for both announcinganindirectcall and push thePCtothe stack, thereby saving one instructionincomparison to theconcept by Sullivanetal. Indirect jumpsuse thevery same concept with the onlydifference thataCFIPRJ instruction is usedinstead of CFIPRC and no stackisinvolved because jumpsdon’timply abackward edge. Thisisimportantbecause theimplementation uses an internal state machine to ensure the correct order of CFI instructions. When multiple indirect calls legitimately target the same function, all CFIPRC - CFICHK pairs would have to use the same label. This would therebyintroduceacoarse-grained CFG, potentially enabling exploits on thecontrol flow. Sullivan et al. proposed to use trampolines to solve thisproblem,suchasbriefly introduced in Section 3.1.3. The authorsonly describeavery basicuse case though, namelyone where an indirect call only hasone potentialtarget.The more general case for indirectcalls is havinga set of possibletargets. In order to support this case, trampolines need to be organized as ajumptable, i.e. by comparing theactual call targetwithalistofallowedones and redirect the jump accordingly. Afailedlookupinthis table leads to aCFI exception.

37 3. Hardware-Based CFI

3.3.5 Intel Control-FlowEnforcementTechnology (CET)

Intelpublished thespecification of an upcoming commercialsecurityfeature called CET forits x86CPUsin2016 [Int19, Gar20]. Thisrepresented amilestonebecause most other hardware-enforced CFI approaches neverleft the academicstage. As depicted in Figure 3.12, Intel’sconcept consistsoftwo components directly integrated into the core.

Figure 3.12: IntelControl-FlowEnforcement Technology [Gar20]

An ordinary shadow stackisused for backward edge protection. Its basicworkings are very similartothe shadowstackimplementationsofthe academicCFI approaches presented above. There is one significantadvantage though:Intel uses thecommon ISA instructions forfunction calls andreturns forinterfacingwiththe shadowstack andcan thereby save alot of otherwise additionally requiredinstructions.

Forforward edges, Intelspecified aconceptcalled Indirect BranchTracking (IBT), which makes surethatall sorts of indirectcontrol flowtransfers are onlyallowedtojumpto some legittarget address. Thisisvery similartoMicrosoft’s CFG, whichwas briefly discussed in Section2.4 [msc18, grs16], andthe general concept of BR,asdescribedin Section3.1.5.

There are no labelsinvolved, meaning IBTonly usesasingle equivalency class (similarly to using asingle label, as presentedin[ABEL09]) for all indirect branchtargets. To thatend,the entriesoffunctions potentially called in an indirect wayaswell as indirect jumptargetsare instrumented with an ENDBRANCH instruction. This new instruction is normallyinterpreted as a NOP on thex86 ISA, ensuringbackward-compatibilityfor CET-enabled applications on older machines. Only newIntel CPUs interpret ENDBRANCH as theonly legit target of indirect control flowtransfers and therebyensureCFI with an internal FSM[Int19].

The Tiger Lake mobile CPU family released in late2020 is thefirst to actuallyimplement CET in hardware after the announcement in 2016 [Gar20].

38 3.3. Hardware-BasedCFI Implementations

Adding the ENDBRANCH instruction to allpossible indirectbranchtargetsremains atask forcompilermakers. Naiveimplementationssimply add ENDBRANCH to all function entries and all jumptargets, therebysacrificing performance.More intelligentsolution would involvecomplex control flowanalysis duringcompilation. We compiledsimple test programs with ICC19.01, GCC 10.2 and Clang 11.0.0having full CETenabled. Disassembled binarieshintthat ENDBRANCH is added to all functions entries and all labels in the usercode, regardlesswhether they are used as targetsofindirectcontrol flowtransfers.This is because thecompilers cannot be surewhether some function is indirectly calledacross differentCUs. Only when functions are declared static or LTO is enabled forcompilation, thecompilers could derivewhichfunctions are actuallythe targets of indirect callsinour tests.Inthese cases,instrumentation with ENDBRANCH is done in amore fine-grainedway.Nohardware with CETwas available fortesting yetat the time of writing this thesis.

39

CHAPTER 4

Implementation

The hardware-enforced CFI approaches previously described in this thesisare not fit for adirectcomparison because theirimplementation environments and evaluations differ toomuch. We therefore implementedaselection of theseschemesonacommon platform, namely a32-bit RISC-V SoC, to enableabetter comparison. In order to coverand examinedifferent CFG-basedconcepts, this set consistsofFIXER [DBGJ19], HAFIX [DHP+15], HCFI [CCAI16], HECFI[SAD+16], which are all academicproposals, as well as Intel’s commercially availableControl-FlowEnforcementTechnology (CET). Thischapter first gives an overview of the software- and hardware platformused for our implementationsinSection 4.1. Section 4.2 describes howthe ISAextensionrequired for custom CFI instructions wasimplemented. Thensome remarks on the original concepts are listed and necessary changesfor an implementation on our platformare discussed in Section4.3. Section4.4 defines and explainsacommonset of parameters,whichis needed forameaningfulcomparison. Eventually,anovelscheme for hardware-enforced CFI called EXtensiveCFI Enforcement Concept(EXCEC) is presented in Section4.5,whichisbased on what we consider the best conceptsand ideas found in theexisting approachesfrom Chapter 3and is enriched with some custom enhancements.

4.1DevelopmentPlatform&Environment

The PULP platform1 and itsPULPissimo microcontroller architecture2 are used as a base forall implementationspresented in this thesis.The platform providesopen-source hardware designs as well as therequiredsoftware-side components.

1https://pulp-platform.org 2https://github.com/pulp-platform/pulpissimo

41 4. Implementation

PULPissimo is an open-source SoC platformdeveloped by ETH Zürich and the University of Bologna. Its implementation in theVerilog hardware description language (HDL) is fullysynthesizableand operates withasingle core[pul].PULPissimo allows developers to select either of theCV32E40P3 (formerly known as RI5CY) or Ibex4 (previously Zero-riscy) CPU cores for theSoC. We decidedtouse CV32E40P,whichisa32-bit in-order corewith four pipelinestages and implements theRISC-V baseISA,including the extensionsC(compressed), M(multiplication) and optionallyF(floatingpoint)[cv3]. Whilethe platform canbefully simulated,all evaluationswere performed on aXilinx ZedBoard5 and simulationwas mainly usedfor testingpurposes.The Zynq Z-7020 SoC on theboardalsocontains an ARM Cortex-A9 with considerablehardware resources, but onlyits programmable logic(PL) part wasused,i.e., the board’s FPGA, which offers 53,200 LookupTables (LUTs) and 106,400 Flip Flops(FFs). ThePULPissimo platform operates with16MHz and has atotal of 512 KiB of memory on theZedBoard. CFI-relatedfunctionalitywas implementedfor all schemes directly within this core, e.g., by addingCFI modulestothe execute-stage of thepipeline, because PULPissimo’sfull sourcecodeisavailable. All of thefive before mentioned CFI approachesand anovel scheme proposed by us (later described in Section4.5) were implemented in amodular way so that they can be exchangedinthe SoCwith minimal effort. Other changes to CV32E40P onlyaffect the instructiondecoder,wheresupport for various custom instructionswas added,and the pipeline controller forexceptionhandling. The latter receives exceptionsfromthe CFI modules andreacts upon them. On software-side, the PULP platform providesamodifiedRISC-V toolchain6 with manyPULP-specifics,including forks of RISC-V GCC andthe respectiveGNU C Library.Anotherimportantcomponentisthe PULP software developmentkit (SDK)7, which contains platform-specific configurations and function implementations. Boththe PULP RISC-V GNU toolchainand the PULP SDKare required forcross-compilation of applications for theRISC-V PULPissimoplatform and execution on theFPGA. The PULP platformalsocomes with acustomRISC-V ISA extension and adds, for example, hardware loops.Applications can be compiled with and without this extension, which affects runtime in some cases.Manyprograms showabetter performance when being compiled with PULP extensions. At the time of writing thisthesis,the PULP SDK onlysupported compilation for programs written in C, but not C++. However, all of theapplications used for evaluation and testingare written in Canyways. Theofficial setofPULP examples8 alsoonly contains Cprograms.Inaddition, usage of C++conceptslikeexceptions and inheritance

3https://github.com/openhwgroup/cv32e40p 4https://github.com/lowRISC/ibex 5http://zedboard.org/product/zedboard 6https://github.com/pulp-platform/pulp-riscv-gnu-toolchain 7https://github.com/pulp-platform/pulp-sdk 8https://github.com/pulp-platform/pulp-rt-examples

42 4.2. Instruction SetArchitecture (ISA) Extension

results in larger binaries. This canbecomeaproblem on platformswith limited hardware resources. As already mentionedabove,our developmentplatform onlycomes with a totalof512 KiBmemory, which constraints the size of executableprograms.

4.2Instruction Set Architecture (ISA) Extension

In general, most hardware-based CFI proposals require custom instructions, i.e.,exten- sions to therespectiveISA.Suchinstructions must be added to thetoolchain on software side, but also hardware support is needed. The RISC-V baseinstructionset supports 32-bitinstructions with the six formatsshown in Figure 4.1. Thissimplifiesdecoding as compared to complex ISAs like x86 [AW17].

Figure 4.1: RISC-V baseinstruction formats[AW17]

The RISC-V GNU toolchainused inour setupcan be easily extendedwith custom instructions. To that end, theformat of newinstructions needs to be defined as exemplary shown in Listing4.1 fortwo instructions similartothe ones usedfor HCFI in [CCAI16].

1 SETPC 31..20=0 19..15=0 14..12=0 11..7=0x006..2=0x1D1..0=3 2 SETPCLABEL imm20 11..7=0x016..2=0x1D1..0=3 Listing 4.1: Examplesfor custom instruction format

Here astill available opcode (bits [6:0], seeFigure 4.1) is used to identify custom CFI instructions in general and thefunctioncodeinbits [11:7] is set unique for each instruction. In accordance to Figure4.1,these bits are usuallyused forthe rd register or for immediatevalues in other instructions. We preferredtouse an instruction format abletotransport a20-bitimmediatevalue though, so therespectivebits are used as a functioncodeinstead. Note that the SETPCLABEL instruction uses the imm20 fieldfor the label, while SETPC does not use aparameter. These instructions canthen be used justasany other assembly instruction, as for instance depicted in Listing4.2.

43 4. Implementation

1 ... 2 SETPC 3 call 4 ... 5 SETPCLABEL 0x42 6 call s3 7 ... Listing 4.2: Examplesfor using custom instructions

However, thesedefinitions only allowfor injectingcustom assembly instructions and are not sufficienttoimplementanextensiontothe programming language.Section 5.3 later describes howinstructionsare internally representedinGCC. In simple words, when addinginstructions onlylikedescribedhere, thecompilerknows about theinstruction’s syntax, but not about itssemantics. This issufficientfor our application. Changes to thetoolchain are onlynecessaryfor compiler-based instrumentation ap- proaches. If no tool support is requiredand instructions are onlyadded after compiling and linking, i.e.,insome binarypost-processing step, it is not necessarilyneeded to modify thetoolchain at all becauseonly the hardware is affected. On hardware-side,the core’sdecoder must be extended in order to support custom instructions. This component is usually implemented as some sortofswitch, whichfirst identifies the instruction format (as shown inFigure 4.1) and denextracts the distinct fields based the thatformat.

4.3Implementation Details

Key components of this thesis are implementationsofexisting hardware-basedCFI conceptsonacommon RISC-V platform, which were presentedbytheir authors for differentISAs. We pickedFIXER [DBGJ19], HAFIX [DHP+15], HCFI [CCAI16], HECFI [SAD+16]and Intel’s CET[Int19]and implemented theschemesasclose to theiroriginal proposalsaspossible. However, some implementation details are not disclosedbythe authorsand therebyleave room for interpretation.Suchambiguities are noted here wherever necessary.Inaddition, porting to our RISC-V platformrequiredcertainchangesdue to differences resulting fromthe used hardware platforms and ISAs, which arealso documented in this section. All CFI schemes were implementedasseparatemodules, whichare simply attachedtothe core’spipeline. Thepipelineitself only requires minorchanges in our setup: Merely the additional custom instructionsneedtobehandledinthe decoder and are then forwarded to the respectiveCFI module.Furthermore,all exceptionsignals for CFI violations are wiredtothe pipelinecontroller.Adedicatedexceptionhandlerfor sucherrorslogs the exception cause, resetsthe coreand terminatesprogram execution. Performance impacts are consideredaproblemofshadowstacks, which are only mapped to the main memory [CCAI16, TW17]. All memory elementswere implemented with

44 4.3. Implementation Details

FFsinstead of BRAM elements for this reason. ImplementationsbasedonFFs hardly introduceany performance overhead, but FFsare arather limited and expensive resource. The functional scopeofanimplementation not onlyimpactsparameters like security and hardware utilization overhead, but also stronglyaffectsthe set of supportedapplica- tions.For example, CFI conceptswithout specific recursionhandling can exceed their shadow stackcapacitymore easily,but recursionhandling is hardfor complexcases. Other challenging language concepts likeC’s setjmp or exit functions and compiler optimizations like call-tail elimination can break CFI. Suchproblemswere describedin Section3.2. Most of thepreviously presentedCFI approachesdependoncustom instructions, partly withoutany parameters.This raisesthe questions as to whether suchinstructions are actuallyneeded.Consider, for example, the SETPC instruction used in HCFI. Its purpose is to push the return address of itsassociatedfunctioncalltothe internalshadowstack and to temporarilydisable forward edge protection for indirect branches.The same task can be achieved by extendingsemantics of actual call instructions, whichare already part of the respectiveISA, if access to internal pipeline signals is possible. We added custom instructionswith general CFI managementcommandstoall of theim- plemented schemes. CFI_ENABLE and CFI_DISABLE are used to turn CFI enforcement on before entering main,respectively off after returning from main.This way, startup and exit code is excludedfrom CFI enforcementbecauseactiveprotection is not required there. An additional CFI_RESET instructionisused to reset theinternal state of the CFI modulesbefore aprogram enters main.This is aprecautionary measure for thecase where thestate of theCFI module is leftdirtybecause of improper program termination. CFI_RESET and CFI_DISABLE are alsoexplicitly usedfor thesame reason when exit callsoccur. The following paragraphs list some findings,required modifications and ideas for improve- mentsinthe various CFI implementations.

FIXER The concept for FIXERwas originally proposedasacoprocessor to a64-bitRISC-V SoC [DBGJ19]. Our developmentplatform (as described in Section 4.1) is a32-bitRISC-V SoCthough, where the CFI module canbeintegrated directlyintothe core. Amain consequenceofimplementing theCFI module as acoprocessor is that multiple instructions are needed for getting thecurrentPC, which is required forbackward edge CFI. Thisisbecause thecoprocessor does not have access to thecore’s internal signals. Figure 4.2 shows howadirect function call andits associatedreturn are instrumented in the original FIXERconcept.Otherapproaches are abletoachievethe same functionality with only oneadditional instruction eachinstead of two or three. In order to enableafair comparison to other CFI concepts, which areintegrateddirectly into the pipeline,weimprovedthis behavior in ourimplementationsothatalsoonlyone

45 4. Implementation

1 : 1 : 2 ... 2 ... 3 auipc t0,0 3 ... 4 add t0,t0,14 4 ... 5 CFI_CALL 5 CFI_RET 6 call bar 6 bne t0,ra,_cfi_error 7 ... 7 jr ra Listing (4.3) Caller instrumentation Listing (4.4)Callee instrumentation

Figure 4.2: Instrumentedcodefor direct function call andreturn in FIXER [DBGJ19]

instructionisneeded for eachaction. Naturally,the number of instructionsrequiredfor the original FIXER has aconsiderablenegative effect on performanceand code size.

The paper is rathervague on howthe policy matrix forforwardedgeprotectionis implemented.Initially fillingthe matrix requires knowledge about both the addresses of indirectly calledfunctions as well as theaddresses of theindirectfunction calls. The latter cannot be easily determined across multiple CUsatcompile time withoutsignificant changestothe sourcecode. We thereforedecidedtonot use the caller-addresses to index the matrix, but simplelabelswhichact as indexintheir place.This tweak onlyhelps whenfilling the matrix and lookingupflags therein,but does notchangethe approach in termsofinstruction countorperformance overhead.

However, it is stillnecessarytotransform 32-bitcallee addresses to matrixindicesbecause the latterhaveasignificantly smaller value range, so some sort of addressspace translation is required.The authorsofFIXER brieflymention address decoders forthis job. Our implementation uses an additional array for decoding, which holds some unique part of indirectly called addresses. When looking up aflag in thematrix, the CFI module queriesthis arrayinordertoget an index for matrixaccess.

Thisisexpensiveinterms of critical pathlengthand hardware utilization. We note that this approachfor addressdecoding does not scalewell and offersroomfor improvement. Forinstance,apipelined implementation of thelookup could solveproblems with the lengthofthe criticalpath. Manyother concepts, suchasset-associative memories like they are usedincaches, arenot applicable herethough.

HAFIX

The proposal for HAFIX suggests to use either ashadowstackoranactive-setfor backward edge CFI [DHP+15]. Only the latter approachisconsidered in this thesissince shadowstacksalready occurinalmost all otherhardware-basedCFI proposals.

Our implementation of theactive-set is as described in theoriginalpaper, with the only difference that our set is smaller because lessmemory is available on ourdevelopment

46 4.3. Implementation Details platform and alsotypical programs containless functions in theembedded world. The selected valuesfor suchparameters are explained in Section 4.4. The approachfor recursion handlinginHAFIX does not support nested recursion, as explicitly statedbythe authors. The problemisthatonly one counter for recursion depth is used, whichcannot be shared for nested recursivecalls. Ourimplementation of HAFIX does not trytofixthis shortcoming. The slre benchmark, which is one of the applications used for evaluation (see Section 6.2),can not be executedwith HAFIX for thatreason.

HCFI The concept for HCFI wasimplementedfor theSPARC ISA [CCAI16], which allowed the authorstomakeuse of branch delay slots. OurRISC-V platformdoesnot support this concept, so it wasnecessarytochangethe position and accompanying semantics of some instructionsinour HCFI implementation.This conceptualdifferencemight have a small negativeperformance impact.Other changes were not necessary forthis approach. The recursion concept of HCFIfails in our implementation and triggersanCFI exception in the towers application from ourset of benchmarks. The reason forthatwas explained in Section3.3.3.

HECFI The proposal for HECFIwas also implementedfor SPARC, butdoesnot use branch delay slots [SAD+16]. Our implementation largely follows theapproachfromthe paper, only trampolines require additional explanation here because onlyavery basiccase with just one potentialcalltarget is described by the authors. We implemented trampolines as jump tablescontaining allvalidjump targets and added therespectivecodeatthe endofthe function where thecorrespondingindirectcall is located. During execution, the actual target addressiscompared with the table entries and onlytransfers to valid targetsare allowed.

Intel Control-FlowEnforcementTechnology (CET) IntelCET is the onlynon-academicapproachconsidered in this thesis. We focus on the basics in our implementation because theoriginal concept targets commercialfull-scale CPUs.Backward edge protection with an internal shadow stackfollowsestablished concepts, but forward edge CFI involves some details to consider. Protectionofindirectfunctioncalls and jumps onlyrequiresthe custom ENDBRANCH instruction(without anylabel or indexasparameter) at callee side. CFI forindirect forward edges is triggered implicitlybythe respectiveinstructionsand does notrequire anyinstrumentation on side of thecaller.This entails averylow performance overhead and theabsence of acustominstruction at caller sidealsomeansthat every indirect forward edge is automatically CFI enforced. OurCET module expects any JALR and

47 4. Implementation

JR instructiononforward edges to be immediately followedbyENDBRANCH.However, this is aproblemfor portions of code,whichcannot be instrumented with ENDBRANCH. The consequenceisthatitwas necessary to disable CFI enforcementfor programs with indirectcontrol flow transfers in affected functions. Basictests in asetup with direct and indirect function callsshowedthat Intel’s ICC 19.0.1, GCC10.2 and CLANG 11.0.0add ENDBRANCH to theentry of every function, including main.Only when functions are declared static or LTOisenabled, the compilers inject ENDBRANCH in amore fine-grainedway.Itremains unclear to us whether this is thecommon compiler behaviour or moreadvancedconcepts are usedfor detecting the targets of indirectcontrol flowtransfers.Weslightly changedthis concept in our implementation to enableafair comparison with other CFI schemes and onlyinstrument entries of functions,whichactually are indirectly called.

4.4Common Implementation Parameters

In order to achieveagoodcomparability, acommonset of parametersisused for all hardware-basedCFI implementations. These depend on actual requirements of the applications to be executedbut are also limited by the available hardware resources of our developmentplatform (seeSection 4.1).The valuesassignedtoeachofthe parameters are based on theset of benchmark applications used in this thesis(later described in Section6.2).

Parameter Value ADDRESS_WIDTH 32 SHADOW_STACK_SIZE 128 INDIRECT_CALLS 64 INDIRECT_JUMPS 64 INDIRECTLY_CALLED 64 SETJMP_CALLS 8 NUM_FUNCTIONS 1024 RECURSION_DEPTH 128

Table 4.1: Common implementation parameters

The SHADOW_STACK_SIZE is requiredbyall implementationsusing ashadowstack. The size allowsfor 128 nested function calls, which is sufficientfor our set of benchmark applications.Inmost cases, 32-bitinstructionaddresses (defined by ADDRESS_WIDTH) are stored on thestack. As later described in Section 4.5, we challenge theassumption that full addresses needtobestoredand presentanalternativeapproachwhereonly 18 bits are pushed. HECFIdoesnot store return addresses on its shadowstack, but labels unique to the caller functions. The NUM_FUNCTIONS parameter is used to derivethe required stack width

48 4.5. EXtensiveCFI EnforcementConcept (EXCEC) forthis case, whichresolvestolog2(NUM_FUNCTIONS) =10.The NUM_FUNCTIONS parameter is alsousedfor thesize of theactive-set used in HAFIX. Only flags are stored in this memory element, so no width parameterisrequired.

The numbers of indirectfunctioncalls andofindirectly called functions are especially rel- evantfor FIXER. INDIRECT_CALLS and INDIRECTLY_CALLED define the dimensions of thepolicymatrixinthis approach. Doublingthese twoparameters quadruples the required memory forFIXER’s matrix. Considering available hardware resources, both are set to 64. Thisisarather optimisticvalue and should be increasedifhardware resources allow.Thesetwo valuesalso define the width of labelsused for forward edgeprotection in other CFI concepts. Label-based approaches, suchasused in HCFI, thereforeonly need to handle 6-bitlabels(log2(INDIRECT_CALLS) =6). INDIRECT_JUMPS follows the same principle and describes the number of supportedindirect jumps.

Given the rare occurrences of setjmp calls, SETJMP_CALLS is set to 8. This allowsfor 8unique appearances of setjmp.

The RECURSION_DEPTH parameter definesthe maximalsupported recursion depth, which is requiredinapproaches using recursion counters. In thecontext of this work it makes no sense to use avaluelargerthan SHADOW_STACK_SIZE because theultimate goal is to achievecomparabilityand some other CFI schemes have no dedicated recursion handling but simply push allrecursivecallstotheir shadowstack.

We note thatthis selection of valuesfor common parameters is not universally applicable but onlyreflects therequirements forour setofbenchmark applications and theavailable resources in our hardware setup.

4.5 EXtensiveCFI EnforcementConcept (EXCEC)

When analyzing the existing hardware-basedCFI schemes previously described in Sec- tion3.3, one comes acrossmanypromising concepts and ideas.Unfortunately,those are distributed across multiple approaches. Thissection presents anovelhardware-based CFI schemecalledEXCEC.Itisbasedonexisting approaches, combinestheir best components and adds some enhancements.

EXCECrepresents an improvementinterms of at least one of theaspectssecurity, performance-orcodesize overhead in comparison to eachofthe existing implementations while keepinghardware utilization at asimilar level. Suchcomparisonsare extensively covered in Chapter 6.

EXCECisdesigned forCFI enforcement on forward and backward edges. As opposed to most other CFI schemes, alsoindirectjumps are protected. In addition,support for setjmp callsand dedicated recursion handlingisadded and interrupts as well as CFI enforcementinInterrupt Service Routines (ISRs) are considered.

49 4. Implementation

CFI Enforcement on Backward Edges

We decidedfor ashadowstackfor theprotectionoffunctionreturnsbecauseitwas established that other conceptslikeactive-sets do not provide sufficientprotectionfor function returns [CCAI16, TW17]. The requiredmemory for thestackresidesdirectly in the CFI module and is implementedwith FFsinsteadofBRAM elements. Thisway, the memory is both secure fromoutsidemodificationand alsodoesnot introduce any performance impact because registers can be accessed withinone clock cycle.

There are some importantdetails involved whenusing shadow stacks though. First, the position of therespectiveCFI instructions has greatimpact on checking theintegrityof backward edges. Some approaches perform thecheckrightwhenafunction returns, i.e., by the function’s secondtolastinstruction before the actual return or in anybranch delay slot of thereturn. Othersdosoatthe target of thereturnand additionallymake surethat the backward edge is immediatelyfollowedbysuchacheckusing some sortof FSM. We consider thefirst concept, i.e. checking thereturn addressbeforeexecuting the return instruction, favorable becauseitislesscomplex.

As asecondissue thequestionarises as to which values should actuallybepushed to the shadowstack. Forinstance,HECFI uses simple labels whileHCFI pushesfull 32-bit addresses to the stack.Anenhanced versionofthe latterapproachisused in EXCEC because it reducescomplexity wheninstrumenting code with the underlying CFGsince no labels are to be managed. Custom instructionsare notused becauseitispossible to utilize theexisting CALL,respectively RET,instructionsofthe RISC-VISA forpushing to and popping from the stack in oursetup.

As another tweak for theshadowstackonly thebits [18:1] of 32-bitaddressesare stored, as depicted in Figure4.3,whichcuts memory demand almost in half. The base address of PULPissimo’s 512 KiB L2 memory is 0x1c000000 forapplications builtwith the PULP SDK. Thismeansthatthe 13 most significantbits (MSBs) of all instruction addresses have aconstantvalue, which makes storingand comparing them meaningless. In addition,instruction addresses arealwaysevenand instructions have afixed lengthof 4bytes in the base RISC-V ISA[PW17], sothe twoleast significantbits (LSBs)carry no information. However, PULPissimosupportsthe RISC-V extension forcompressed (i.e.,16-bit) instructions, so onlythe LSB can be ignored.Thesetweaks are very platform-dependentthoughand mightnot be applicable for others.

Figure 4.3: Part of addressused for shadow stackinEXCEC

50 4.5. EXtensiveCFI EnforcementConcept (EXCEC)

CFI Enforcement on Forward Edges Whileapolicy matrixfor forward edgeprotection, suchasused in FIXER, is very intuitiveand offerssecurityatanoptimalgranularity,its memory demandsoutweigh the benefits of theconcept.Wetherefore optedfor using labelsinaway similar to HCFI instead of apolicy matrix. The labels are compared between caller and callee upon every indirect function call and indirect jump. In addition, Trampolines are used in order to achievefine-grainedCFI forfunctions, which are indirectly called frommore than one call site. Thisconcept wasintroduced in [SAD+16]. The workings of trampolinesinEXCEC areshown foranexemplaryindirect function call in Figure 4.4. Trampoline code is added to thevery end of thefunction, where thetrampoline is used.Normal control flownever reachesthis code,but the indirect function call redirects controlflow there. Thecustom CFI instructions usedin Figure 4.4 are explainedindetailinthe nextsection.

foo: _tr_foo: ... cfi_check 0x42 addi sp, sp, -4 lw s3, 0(sp) lt: sw s3, 0(sp) addi sp, sp, 4 cfi_check 0x60 la s3, _tr_foo la t0, ... cfi_call 0x42 beq s3, t0, lt +4 ret call s3 la t0, ... beq s3, t0, gt +4 gt: ret cfi_check 0x0 cfi_check 0x70 ... ret

Figure 4.4: Workingsofatrampoline for indirect functioncalls in EXCEC

Whenanindirect function call occurs, its target addressisfirst savedtothe normal stack. Thisisdepicted by the first twoinstructions on theleft side inFigure4.4.Then the address(register s3 in this example) is overwritten with the trampoline’saddress so thatthe indirect call transfers control to trampoline code instead of itsactual target.A CFI_CALL instruction carrying auniquelabel is added rightbefore theindirect control flowtransfer. Thecall itself is then redirected to atrampoline, which first checks the label to ensures thatthe trampoline is unique for eachcall and cannot be reused from others. Afterwardsthe trampoline retrievestooriginal call target from thestackagain and compares it to theaddresses in the jump table (lt and gt in the example in Figure 4.4).Incase the12-bitoffsetsencoded in thetable’s beq instructions are not sufficient, direct jumps, whichallowlarger offsets, could be usedincombination with conditional branches instead. Only if amatching table entryisfound, control flowis directed to the respectivefunctionwith adirectbranch, bypassing the CFI checkatits entry. This way, the indirectly called functions (e.g., lt and gt)can stillbeinvoked from other callerswithout using the same label for all of theircall sites. The CFI_CHECK 0x0 instruction at the end of thetrampoline explicitly triggersanexceptionincase the table contains no matching entry.

51 4. Implementation

ISAExtension The RISC-V toolchainofferstoadd custom instructions as an easyway to extend the base instruction set.This is done by adding the 32-bitcustom instructionsfor CFI enforcementshown in Table 4.2.

Instruction instr[31:12] instr[11:7] instr[6:0] CFI_CALL label imm20 0x11 0x77 CFI_JUMP label imm20 0x12 0x77 CFI_CHECK label imm20 0x13 0x77 CFI_SETJMP index imm20 0x14 0x77 CFI_LONGJUMP 0x0 0x15 0x77

Table 4.2: CFI instructions for EXCEC

Custom instructionsfor announcingfunctioncalls,returnsand jumps, suchasthe SETPC and CHECKPC instructionsofHCFI, are not needed because fullaccesstopipeline internalsisavailable and already existingsignals for thevariouscontrol transfer types can be used. It is possibletodirectly use the branchsignalsfromthe core’s decoder as commands for theCFI module and therebysignificantly decrease both performance and code size overhead. The imm20 fieldofRISC-V instructions is usedfor label and index values. CFI semantics of thechanged RISC-V JAL, JALR and RET instructions, as well as of theinstructions listed above in Table 4.2, are defined as follows:

• JAL and JALR are theexisting instructions fordirectand indirect function callsin the RISC-V ISA. When CFI enforcementisenabled, they (also)announce the call to the CFI module,whichpushes the addressofthe subsequentinstruction to the shadowstackfor backward edge protection.

• RET is theexisting RISC-V instructionfor function returns. If CFI enforcementis enabled, this instruction additionally pops the topmost value off the shadowstack and checks whether it matches the addressused for thefunctionreturn. In case a mismatch is detected here, the CFI module issuesanexception.

• CFI_CALL label and CFI_JUMP label announceanindirect function call, respectivelyanindirect jump, forsafeguardingofforwardedges. To thatend, the label encodedasimm20 in the instructionisstoredwithin the CFI module. These instructionsmust be followedbyaCFI_CHECK label with thesame label.

• CFI_CHECK label is placed at the very first positionoffunctions,which are calledindirectly,and alsoatthe target of indirect jumps.This instructionchecks whether the label encodedwithin apreceding CFI_CALL or CFI_JUMP instruc- tion matches itsown label.Incase amismatchisdetectedhere, theCFI module issuesanexception. Theinstruction is usuallysuppressed and onlyenabled by

52 4.5. EXtensiveCFI EnforcementConcept (EXCEC)

preceding CFI_CALL and CFI_JUMP instructionsbecause forward edgeprotection is notrequiredfor direct callsand jumps.However, any CFI_CHECK with label 0x0 is considered aCFI violation.This allowsfor triggering explicit exceptions.

• CFI_SETJMP index is placed immediatelyafter a setjmp call.Depending on whether therewas apreceding CFI_LONGJMP,this instructioneitherstoresthe currentshadowstackpointer in EXCEC’s setjumpmemory,whereby index is usedasanindextothis memory in order to support multiple different setjmp calls, or restores this very pointertoapreviously stored value.

• CFI_LONGJMP setsaninternal flag suchthat asucceeding CFI_SETJMP call unwinds the shadowstacktoapreviously stored position.

Although the imm20 fieldofCFI_CALL, CFI_JUMP, CFI_CHECK and CFI_SETJMP instructionsallowslabelswith up to 20 bits, smallerlabelsare usedinternallyinorder to save memory.The selected values for such parametersare explained in Section4.4. The correctsequenceofCFI instructions, whichisdepicted in Figure4.5,isenforcedwith an FSMinthe CFI module. CFI_CALL and CFI_JUMP must be immediately followed by indirect branchinstructionsand CFI_CHECK.The CFI_CHECK itself canappear anywhere, but is suppressedunless it is enabled by apreceding CFI_CALL or CFI_JUMP or carries thelabel 0x0.This canbeused to explicitly trigger exceptions when reaching invalid positions like the end of trampolines.Moreover, the RET instruction,whichpops thetopmost addressfrom the shadowstack, is onlyvalid if the last JAL instruction had pushedthe very same value.

stack full stack full

JAL JALR CFI_CALL (push to stack) (push to stack)

CFI_CHECK invalid IDLE (compare label) flow

label RET mismatch CFI_JUMP (checkstack) JR

stack empty returnaddress mismatch invalid flow

Figure 4.5: Valid (CFI)instruction sequences in EXCEC

Instrumentation makes surethat CFI instructionsonlyappearatvalidpositions: CFI_- CALL and CFI_JUMP,for example, onlyoccurininfrontofJALR and JR instructions. Other sequences, that don’t comply with what is shown inFigure 4.5, are considered to be CFI violationsand implicitly trigger exceptions.

53 4. Implementation

Architecture &Interface Figure4.6 showshow EXCECisincludedintothe CV32E40P core’spipeline. All of the CFI enforcementfunctionalityispackedintoaseparate module. Theonly modifications to the original coreitself are extensionstothe Decoder in order to support the additional CFI instructions and processingofCFI exceptions in the pipeline Controller,whichacts upon CFI violationsand transfers control to an exception handler.

Figure 4.6: CV32E40P pipeline with EXCEC(modifiedfrom [pul])

An importantfeature of anyCFI module is separationand safeguardingofall CFI related registers and memory elementsinaway thatinstructions unrelated to CFI cannot modify theirstate.Only memory elementsimplementedwith FFsinternal to theEXCEC module are used instead of mainmemory.The parameters defined in Section4.4 are used for dimensioning theseselements. Table 4.3 describes theinterfaceofour EXCEC implementation.The instructions for forward and backward edge protection as well as for thesupport of setjmp callswere described above. Signalswith identicalnames correspond to theseinstructions. The cfi_enable_i, cfi_disable_i and cfi_reset_i signals are used for enabling CFI enforcementbefore entering main,respectivelydisablingCFI after returning from mainand forresetting all registersinthe EXCECmodule. This way, platform setup code can be excludedand program executionalwaysstarts with acleanCFI state. The outbound exception signals in Table 4.3 are routedtoPULPissimo’s pipeline controller and informabout the cause of some CFI violation. The invalid_branch_o signal is asserted whenaCFI_CHECK instructionresults in alabel mismatch. Thisisthe case whenthe label used with CFI_CHECK is differentfrom the label in apreceding CFI_CALL or CFI_JUMP instruction. Deviations fromthe expected sequence of CFI instructions, i.e.,adifference to theflow shown in Figure4.5, are indicated by the invalid_flow_o signal.Anexample thereofisthat indirectfunctioncalls,asannounced with CFI_CALL, must be followedbyJALR and CFI_CHECK.The invalid_ret_o signal shows whether

54 4.5. EXtensiveCFI EnforcementConcept (EXCEC)

Signal Dir. Type Description clk in logic clock signal rst_n in logic reset signal jump_done_i in logic forstalling pc_if_i in logic [31:0] PC in fetch stage ra_i in logic [31:0] return addressfor currentcall jal_i in logic currentinstruction is a JAL ret_i in logic currentinstruction is a RET cfi_call_i in logic upcoming instruction is a JALR cfi_jump_i in logic upcoming instruction is a JR cfi_check_i in logic checkfor indirect branches cfi_sj_i in logic setjmp call upcoming cfi_lj_i in logic longjmp call upcoming cfi_label_i in logic [19:0] label for forwardedgeCFI cfi_index_i in logic [19:0] indexfor setjmp cfi_enable_i in logic enableCFI cfi_disable_i in logic disableCFI cfi_reset_i in logic reset CFI invalid_branch_o out logicexceptionfor label mismatch invalid_flow_o out logicexceptionfor invalid instr. sequence invalid_ret_o out logicexceptionfor return addressmismatch stack_empty_o out logicexceptionfor empty shadowstack stack_full_o out logicexceptionfor fullshadowstack

Table 4.3: EXCEC interface the return addressonthe shadowstackmatches the one used within some function return. And finally, stack_empty_o and stack_full_o indicate an empty,respectivelyfull, shadowstack. Strictly speaking,the latterisnot aCFI errorbut rather an indicator that further checks won’t work becausenomore return addresses can be stored on theshadowstack. In some other approachesthis case is handled by moving theoldestpartsofthe shadow stacktomainmemory in order to free some space. The stack_full_o signal is also asserted in case one of therecursion counters overflow. Adedicated exceptionhandler,whichprints aCFI-related errormessage,resetsthe internalstate so that theCFI module is not left dirtyand then terminates program execution, is implementedfor CFI exceptions. The proposed scheme alsoconsiders interrupts,whichare particularly importantin embeddedsystemsbut mostlyneglected in other academicCFI proposals. The EXCEC moduleinthe CPUcore makes surethat allsequencesofCFI instructions, suchasCFI_-

55 4. Implementation

CALL and thesucceeding JALR and CFI_CHECK,are executed atomically by delaying interrupts if needbe. This ensures avalidstate in the CFI module and onlyadds interrupt latency in the worst case.Also,EXCEC is abletoenforce CFI during ISRs: All control flowtransfers executed there are protected in the same way as transfers occurring during normal program execution. This is possiblebecause CFI enforcementcan be kept activethe whole time sincenon-standard control flowtransfersare used to jump to the ISR, respectivelyback, and neitheraffectsthe CFI state. When an interrupt occurs, the CPU core implicitly sets the PC to the ISRand no return addressispushedtothe shadow stack. At the end of theISR adedicated mret instructionisused for returning, which alsoentailsnoshadowstack operations. EXCECisbasedonconceptsofvarious other CFI proposals. It uses ashadowstack similar to theone in HCFI [CCAI16]and FIXER [DBGJ19], but does not need dedicated CFI instructions for interfacing the shadowstack. The concept does notrequire instru- mentation of function returns and direct callsbecause existing RISC-V ISAinstructions can be usedtoannounce theseoperations.This implicit approachisalso usedinIntel’s CET [Int19]. Furthermore,EXCEC only stores areduced number of bits in orderto reduce memory demands. Forforward edge protection, thescheme uses HCFI’slabel approach, butextenditwith an advancedversion of thetrampolines used in HECFI [SAD+16]. Dedicated recursionhandling is implementedwith counters foreachshadow stackentry,whichisanextended variation of the single recursion counterusedinHAFIX [DHP+15]. Support for setjmp is implementedasinHCFI. We believethatEXCEC offersasoundand efficientprotection against CRAs. Basic attackexamples are immediatelydetected by thehardware module.Aquantitative evaluation of EXCEC in termsofits introduced overheads is presentedinChapter6 alongwith evaluations of other CFI implementations described in this thesis.

56 CHAPTER 5

Instrumentation

Most hardware-enforcedCFI concepts require custom instructions,meaningthatthe respectiveISA needs to be extended.Thesenew instructions must be added at specific positions in the code. Software-based approachesalso require additional instructions, onlyhere already existing ones are used. Thegeneral process of enriching programs with additional instructions is called instrumentation. Thischapter first describes the instrumentationprocedureand possiblemethods thereof in Section 5.1. Then some challengesare discussed in Section5.2 and aGCC plugin developedfor all instrumentation tasks of this thesisispresented in Section5.3.

5.1 Introduction

Inserting or changing instructionsisrequiredinseveral areas, which are not limited to instrumentationfor CFI. Other applications include emulation(translation of instructions between twoISAs), optimization (e.g.,for runtime patching to avoid down times) and observation (profiling and tracing)[WMUW19]. Inthe contextofemulationthe basic concept is alsocalled dynamic binarytranslation [Pro02]. Depending on theuse case and the available artifacts,instructionscan be added or modifiedduring compilation, i.e., based on source code,orbymanipulatingsomealready compiledbinary in the absence of itsassociated code. In principle,some sortofinstrumentation is required formost kindsofCFI concepts, regardlessofwhether they are software-orhardware-based. Forexample, the CFI schemes described in Chapter 3follow very different approaches: In [DHP+15]instrumentation is performed fully automated by acompilerplugin, but the concept onlyadds CFI protectiononbackward edges. [CCAI16]intercepts the compileratassemblylevel and instruments the code with aPythonscript before the linker step is executed. [SAD+16] usesanIDA Pro pluginfor automatically instrumenting backward edges. Forward edges

57 5. Instrumentation

require manual steps though.This last example is solely based on abinary and doesn’t require its source code. Using adisassemblerlikeIDA Pro forinjecting instructions is one approachofan overall instrumentationconcept called binaryrewriting.The main ideaistomodify semantics of aprogram by adding, changing or removing instructionswithout having the program’ssource code at hand. Also, recompilation of modifiedapplications is not necessary.Thereexist static and dynamicvariants:Staticrewriting describes theprocess of instrumentation in an offline waybefore actually executing an application.Dynamic rewriting changesthe instructions at runtime.The availabilityofdebug symbols or unstripped binaries is agreathelphere [WMUW19].

(a)static (b) dynamic

Figure 5.1: Stepsfor binaryrewriting [WMUW19]

Figure 5.1 shows therequired steps forbinary rewriting. The abstract principle is the same for static and dynamic cases.Atfirst, the binary needs to be parsed,whichinvolves translating the rawbinary stream to mnemonics. Thisisparticularly importantfor ISAs with dynamicinstructionlengths because therethe binary stream cannot be simply split into chunks of thesame size for determiningdistinctinstructions. Thegoal of thesecond phase is to analyze the instructions in order to recoverthe structure of the program, which includesidentifying its basicblocks, functions and the program’sCFG.Asathird phase,the program is transformed,meaning thatnew instructionsare added at specific locations or existing ones are altered.Transformation can be done in a minimal-invasive way by addingnew parts to theprogram and redirecting the controlflow there or by performingafull-translation,whichallows manipulationatany position but requires additional steps.The last phase generates code again based on thepreviously transformed program [WMUW19]. The parsing-and analysis phases, i.e., steps 1and 2fromFigure 5.1a, are not explicitly required when performingsource code-based instrumentation during thecompilation processbecause theoutputs of theanalysis phase (basic blocks, function boundariesand CFGinformation) arealready implicitly available in thecompiler. Transformation and codegeneration are of course needed anyways.

58 5.1. Introduction

Techniques, whichdon’thave access to sourcecodeordebug symbols, are not always abletorecover allkindsofcontrol flowtransfers [BCN+17]. Also, morebasictasks like distinguishing code from in-lined dataand detecting function boundaries are hard problems,whichneed to be considered [WMUW19]. On the other hand, concepts depending on sourcecodeordebug symbolshavealimited scopeofapplicationbecause theseartifacts are not alwaysavailable[GABP14]. Furthermore, it is notpossibletoextract all symbol namesinstripped binaries. This limitationbecomes relevantwhen trying to matchaCFG built fromfunctionnamesand labels with aCFG based on basic blocks generated from such abinary. Atypical use casefor instrumentation withoutsymbolinformationisreplacing certain instructionswith other instructionsofthe same length. Offsetsand addressesdon’t have be adapted in this case [WMUW19].

Positions forInstrumentation Depending on therespectiveCFI concept, programs have to be instrumented with additional instructionsatsome very specific positions. These are independent of the technique usedfor instrumentation.The exact positions also dependonthe ISA, because, for instance,the (non-) existence of branch delay slotscan determinewhether instructions needtobeplaced right beforeorafter certainpositionsinthe code. Thefollowing enumeration containsrelevantpositions usedinmanyexistingCFI schemes:

• At function entries,still before anyother instruction. Recursively called functions need specialinstrumentationinsome approaches. • At theend of functions.Insome casesitcan alsobeusefultowrite instructions after the last instructionofsome function. Those are neverexecuted within the normal control flow, but constitute apractical position for placing additional code like trampolines. • At returnstatements.This is the commonplace forchecking whetherthe return addressisinaccordance to theCFG and/or theshadowstack. Note that returns can occurmore than once within afunction. • At direct function calls.Usually this is usedfor shadow stackoperations forbackward edge protection. Aspecial formare recursivefunctioncallsbecause recursion canpotentially overflowa(shadow)stackand might introduceuncertainty in the control flow. Someconceptstherefore support special annotation of recursive calls, for instanceinorder to keep trackofrecursion-depth. • At indirect function calls.Inadditiontothe steps required forbackward edge protection, alsoinstructionsfor forward edges can be inserted here. Some approachesalsoadd CFI instructions right after afunctioncall at theaddress the callee returns to as an alternative approach for checkingbackward edges.

59 5. Instrumentation

• At indirect jumps,similarly to indirectfunctioncalls, but without anyoperations related to backward edges becausejumps are unidirectional. Note that switch statementsinCare sometimes translated to indirect jumps by the compiler. • At targets of indirect jumps,e.g. labels in afunction, forchecking function-local forward edges similarlytoCFI at function entries.Ifthe compilergenerates indirect jumps for switches, their case statements are also consideredtargets. • At setjmp/longjmp calls, which leadtocontrol flowtransfers outside of the strict call-ret pair concept of usual function calls. Thisproblemwas described in Section3.2. • Before thecall of main and after itsreturn.These positions are usedtoenable CFI enforcement, respectivelydisable it again. In addition,CFI canbereset before entering main suchthatevery program execution startsfrom acleanCFI state. •Atexit calls, similarlytowhathappens when returning from main.

5.2 Challenges forCFI Instrumentation

The difficultyofinstrumentation dependsonthenature of therespectiveCFI concept. In principle,the backward edge protection approachesdescribed in Section 3.1 are easier to implement becausethere are no dependencies between sourceand target of control flowtransfers.Consider,for example, thebackward edge CFI instrumentation used for HCFI[CCAI16]shown in Listing 5.1 and Listing 5.2. Theinjected instructions are notbound to certainfunctioncalls or returnsinany way because they don’t carry any parameters.Notethat the CFI instructions in Listing 5.1 and Listing 5.2 are placed after the branchinstructionsthey belong to. This is possiblebecause theISA usedfor the original evaluation of HCFI supportsbranchdelay slots.

1 : 1 : 2 ... 2 ... 3 call 3 ... 4 SETPC 4 ret 5 ... 5 CHECKPC Listing (5.1) HCFI: Instrumentation of a Listing (5.2) HCFI: Instrumentation of a direct functioncall directly called function

Forward edge protection with labelsonthe other hand, as described in Section3.1.3,has dependencies:Suchlabel checks requireinstrumentation of caller and calleewiththe same label.Listing5.3 and Listing 5.4showinjected CFI instructions forHCFI, only nowfor an indirect functioncall.Inthis case, labels specifictothe caller and thecallee are encoded withthe respectiveinstructions. However, there aremore issues to consider apart from the position of instructions. Some other challenges are listed in the followingparagraphs:

60 5.2. Challenges for CFI Instrumentation

1 : 1 : 2 ... 2 CHECKLABEL 0x13 3 call $s2 3 ... 4 SETPCLABEL 0x13 4 ret 5 ... 5 CHECKPC Listing (5.3) HCFI: Instrumentation of Listing (5.4) HCFI: Instrumentation of indirect function call indirectly called function

Availability of artifacts enables or limitsthe various approaches for instrumentation and alsoimpacts the possibleCFI precision. If no source code is available, compiler-based approachesare not viable at all.Onthe other hand, instrumentation difficultyinversely correlateswith the degree of available information: Accesstosymbols, debug information or even sourcecodeallows forfiner analysis and simplifies instrumentation.

Ideally, every CU is available as sourcecodeand all CUs are compiled and linked at the same time. Unfortunately,manylibrariesare onlylinked to applications statically or dynamically but compiled earlier.

Shared libraries can be instrumented at theirown compile time, during alinker phase with LTOorbyperforming static or dynamicbinary rewriting.For someCFI concepts the former approachintroduceseitheravery coarseCFG or alot of overhead in termsof unnecessary instructions because therequiredcall information from theprograms which use the library is not availableyet.

Consider,for example, thedemandfor instrumenting programs with label checks during compilation (i.e.,without binary rewriting)with static linking.When caller and callee are to be found in CUs, whichare compiledatdifferent points in time, fine-grained CFI is usually much harder.Insuchcases, the resulting CFGcan loose itsprecision because the same label has to be usedinmanyplaces. One possiblesolution for this caseis LTO, whichallows to perform instrumentationfor all CUs previously compiled with LTO enabled during linking.

Extraction of CFGinformation,which is required forinstrumentation,isacomplex problem. While function returns and direct jumpsand calls canbeeasily translated to theirrespectiveedges in aCFG,finding the targets of theirindirectvariants is way morecomplex. Afrequently usedmethodfor finding possible indirectcall targetsis matching the type of function pointers with the signatures of functions, whoseaddresses are assignedtofunctionpointers anywhereinthe program. This introduces avery highfalse-positive rate though[LH19]. When neither source code nor anysymbols are available, even extractingthe CFGfor branches is not trivial [WMUW19].

61 5. Instrumentation

5.3GCC Plugin for CFI Instrumentation

All of thehardware-basedCFI schemes implemented for this thesisrequire instrumenta- tion. Given the extent of benchmark applications used for evaluation,itisnot feasibleto manually add therequiredinstructionsfor eachconcept.Anautomated and scalable methodfor injectingCFI instructions is required.Wetherefore performall instrumen- tationtasksfor this thesisinacompiler-based way. Our toolchain, as described in Section4.1, comes with GCC -whichcan be extended with aplugin. This requires the availabilityofsource code for all CUs where CFI enforcement is needed, so the approach is notapplicablefor commercial-off-the-shelf (COTS)applications. However, we believethatthis decisionislegit within the contextofthis thesis fortwo reasons. First, instrumentation is averyimportantand necessarytooltowards CFI enforcement but no focus in thestrict senseofthe topic of our work. The interested reader is referred to further literatureonotherinstrumentation conceptslikethe extensive surveyofbinary rewriting techniques by Wenzl et al. in [WMUW19]. The second reason is that compiler-based code injection allowsfor very fine-grained CFI enforcementand does not suffer from the same problems as the analysis phasesofbinary rewriting,which were explainedbeforeinSection 5.1and Section5.2.

GCC Compilation Phases GCC offers alarge subsetifits internalApplication Programming Interface (API) to plugins and therebyallows to perform analysis, optimization and even modification tasks [StGDC20]. Thelatterisrequiredfor CFI enforcement,whereadditional instructions needtobeinjected at specificpositions into someprogram. When compiling an application, GCC runs throughmultiple phasescalled passes.At the very beginning the Parsing pass is executed,whichreads and parses the input filesand transformsthem to afirst intermediate internalrepresentation.Then,the Gimplification pass is invoked, whichfurther converts to another representation called GIMPLE.Then multiple Inter-Procedural Optimization (IPA) passes followand perform transformations across function boundaries, such as constant propagation, func- tion inlining and symbol cleanup. After that, Tree Static Single Assignment(SSA) passes perform local optimizations like removalofuselessstatements, tail call elimination and loop unrolling. Finally Register Transfer Level (RTL) passes are executed, which perform more optimizations and generatecodeatRTL [StGDC20].

Plugin Registration &Execution GCC pluginsmust be registered forthe desired phase in whichthey are to be executed. Our plugin wasimplementedasaRTL passbecause GCC’s internalrepresentation of the code at this pointisalready close to the finalassemblyoutput. Thebasiccoderequired for registering aGCC pluginisshown in Listing5.5, where GCC_PLUGIN is theclass of the actual pluginand must be derived from rtl_opt_pass.GCC runs through all of

62 5.3. GCC Plugin for CFI Instrumentation

thesepasses separatelyfor eachCUand onlylinks the resultingobjectfiles together at the very end.

1 static struct plugin_info info ={"1.0", "CFI instrumentation}; 2 3 int plugin_init(struct plugin_name_args *plugin_info, 4 struct plugin_gcc_version *version) { 5 register_callback(plugin_info->base_name, PLUGIN_INFO, NULL, &info); 6 7 GCC_PLUGIN *gcc_plugin =new GCC_PLUGIN(context); 8 struct register_pass_infopass_info; 9 pass_info.pass=gcc_plugin; 10 pass_info.reference_pass_name ="*free_cfg"; 11 pass_info.ref_pass_instance_number =1; 12 pass_info.pos_op =PASS_POS_INSERT_AFTER; 13 14 register_callback(plugin_info->base_name, PLUGIN_PASS_MANAGER_SETUP, 15 NULL, &pass_info); 16 17 return 0; 18 } Listing 5.5: Exemplary registration of aGCC plugin

When the respectiveoptimization passisexecuted,GCC callsthe execute function of registered plugins for eachfunctionofthe currentCU. execute contains theactual plugin functionalityand allowstoiterate all basic blocks andinstructions of thecurrently optimized function. Listing 5.6shows asimple example of execute where first the names of thecurrent function and thefile in whichitresidesare extracted. Then the code iterates allinstructions of all basic blocks with the FOR_BB_INSNS and FOR_- EACH_BB_FN macros and prints debug representations.

1 GCC_PLUGIN::GCC_PLUGIN(gcc::context *ctxt) 2 :rtl_opt_pass(cfi_plugin_pass_data, ctxt) {... } 3 4 unsignedint GCC_PLUGIN::execute(function * fun) { 5 char *function_name =(char*)IDENTIFIER_POINTER(DECL_NAME ( current_function_decl) ); 6 char *file_name =(char*)DECL_SOURCE_FILE (current_function_decl); 7 basic_block bb; 8 bool recursiveFunction =false; 9 10 FOR_EACH_BB_FN(bb, cfun){ 11 rtx_insn* insn; 12 FOR_BB_INSNS (bb, insn) { 13 debug_rtx (insn); 14 } 15 } 16 17 return 0; 18 } Listing 5.6: Exemplary execute functionfor aGCC plugin

63 5. Instrumentation

Insertion of Instructions

Instructions are objects of theclass rtx_insn,that represents nodes in GCC’sinternal code tree in general.All of thespecific positions described before in Section5.1 can be identified using fields of the rtx_insn instancesand the macrosprovidedbyGCC. Suchinstancesare typicallynested and contain, for example, sub-nodesfor theregisters required for their respectivefunctionality. Thetypes of each node can be checked. It is possibletoidentify various differentinstruction typeslikefunctioncalls and returns,but alsomore advancednodes likejumptables generated for switchstatements.Listing 5.7 shows an examplefor howcommon return instructionsare composed from rtx_insn nodes duringthe RTLphase of GCC. Here the outer node also containsasub-node for the register ra,whichcontains thereturn address.

1 (jump_insn 466 465 469 (parallel[ 2 (simple_return) 3 (use (reg:SI 1ra)) 4 ]) "kernel/init.c":184974 {simple_return_internal} 5 (nil) 6 -> simple_return) Listing 5.7: Debug representation of areturnstatement

Once the suitable positions are identified,new instructionscan be injected. Forsimplicity assembly instructions are added in string format and not built from distinct rtx_insn nodes.Support for thelatterwouldrequire deeper changestoGCC itself. When adding instructionsonlyasstrings, it is sufficient to define theirformats, as showninSection 4.2. GCCknowsnothingabout the semantics of such instructions.

Assembly instructions inlined in Ccodeare alsorepresented as string nodes, similarly to the instructions injected duringinstrumentation.This is aproblemfor our detection approachthoughbecause anyofthe relevant positions identifiedwith rtx_insn node typescan alsooccur as an inline assembly node, thatcannot be identified in the same way.Itwould be possibletoparsesuchstring-formatinstructions and manually extract the required information.However, thiswas notimplementedfor this thesis.

It is possibletoperformcertainGCC passes in aunifiedway for all statically linked CUs by using LinkTime Optimization(LTO) instead of executing theplugin separately for eachCU. LTOmoves some optimization passes to thelinker phase,whichallows forjoint optimizationsofthe whole program instead of isolatedones forevery CU.This feature canbeenabled for GCC by compiling and linkingwiththe -flto option.

By using LTOitisalso possibletoinstrument libraries,thatare notcompiledatthe same time as theuser codeofsomeprogram butonly linked to the application. LTO must alsobeenabled forall shared libraries where instrumentationisrequired. Thisis particularly relevant for commonlibraries like libgcc and runtime libraries suchasthe ones of the PULP SDKinour setup.

64 5.3. GCC Plugin for CFI Instrumentation

Limitations Unfortunately,itwas not possibletocompile libgcc with LTO. Our GCC plugin therefore onlyinstrumentsthe usercodeofapplications and all functions of thePULPSDK. Exclusions are added to theplugin so thatcallsoflibgcc functionsare specifically handled. Otherwise calls of suchfunctions would push their return addresstothe shadowstack, but this addressisnever popped becauseofthe missing instrumentationoffunction returnsinlibgcc.However, NOP instructions are used instead whereverpossiblein ordertoachievethe same overhead of run time and code size. In principle,the GCC pluginalso works with C++ code. Indirect function calls usedfor vtable implementations, which are avery importantconceptofobject-oriented program- ming in C++,are detected and instrumented as intended.However, the PULP SDK only supportedcompilation of Ccodeatthe time of writing this thesis,soinstrumentation tests were onlypossiblewith the normal RISC-V GNU Toolchain1,i.e., notwith its PULP fork,withoutactuallyexecuting the resulting applications with theCFI hardware. Instrumentation of most indirect control flowtransfersrequires explicit CFGinformation. Only the targets of function returns (which are, strictly speaking, indirect) and indirect jumps resulting fromswitchstatements canbeautomatically determined. We use manually created CFGs and match node nameswith function namesextractedfrom the rtx_insn in our approach. The following section describes howthis information is generated and howthe graph is represented.

5.3.1 CFGInformation Fine-grained CFI instrumentationrequiresCFG information in order to perform the instrumentationwith necessary CFI instructions. Alarge partofthe required information is providedimplicitlybythe sourcecode, specifically thetargetsofdirectjumps and calls but alsothe targets of function returns.However,informationregardingindirectcalls and some jumpsneedstobeprovidedand cannotbeeasily extracted. Compilingthe setofpossibletargetsfor indirect branches is ahard problemand involves some imprecision on theCFG in terms of over-approximation of allowedtargets in most cases[dCV17, BCN+17]. Forthis thesis,CFG generation for indirect control flow transfers is considered outofscope. Manually created graphs for all of theapplications usedfor evaluation (see Section 6.2 for alist) are provided. Theresulting CFGs are probably more precise thanany automaticallygenerated onescan be.However, extraction of CFGinformation is adistinct and very extensivefieldofresearchonits own. We believe it is reasonable to put our focusonCFI itself,including the accompanying hardware- and software implementations. Towards finding ways for encoding the CFGinsome program, we first looked at ways forenriching sourcecodedirectly with CFGinformation. Unfortunately,Cprovidesno suitableannotation mechanism like,for example,Javadoes. Attributes suchasnoline

1https://github.com/riscv/riscv-gnu-toolchain

65 5. Instrumentation

are remotelyrelated to annotations and can carry parameters,but cannotbeplacedat arbitrarypositions[Att]. Anotheridea wastouse comments of awell-define format,but such arestrippedbythe pre-processor and canalso notbeused. These reasonsleft us with using aconfiguration filefor holding CFGinformationof indirect control flowtransfers foreachprogram. Aconvenient sideeffect of this approach is thatthe sourcecodeitself does notneedtobechangedinany way. The configuration file is completelyexternal, does not harm source code semantics or syntax and couldalso be dynamically selected dependentonthe actual CFI implementation. An exemplaryconfiguration filefor theCoreMarkbenchmark2,whichuses an indirect function call for dynamicallyselecting acomparator function, is showninListing5.8. The filefirst lists functions, that are potentially called in an indirectway and assigns labels to them. These labelsare used forforward edge protection on the calleeside in some CFI enforcementconcepts.

1 #indirectly calledfunctions 2 core_list_join.c cmp_idx label:13 3 core_list_join.c cmp_complex label:13 4 5 #indirect function calls 6 core_list_join.c core_list_mergesort line:551 label:13: core_list_join.c: cmp_idxcore_list_join.c:cmp_complex Listing 5.8: Example foraCFGconfiguration file

The second section in Listing 5.8 contains alist of actual indirect function calls, including theirposition in thesource code (file name, function name and line number), the label providedfor forward edgeprotection on callerside andalist of allowed call targets.The same principle isused forexplicit indirectjumps too.

2https://github.com/eembc/coremark

66 CHAPTER 6

Evaluation

We implemented FIXER[DBGJ19], HAFIX [DHP+15], HCFI [CCAI16], HECFI[SAD+16], IntelCET [Gar20]and ournovelCFI schemecalled EXCEC presentedinSection 4.5 on acommon hardwareplatform and executed the same set of benchmark applications for allofthese approaches in order to get comparablebenchmark results. Here we presenta wide range of evaluationsand showthat EXCECoutperforms existing academicschemes in termsofperformanceand code size. Previous works in this field, suchasthe survey of hardware-based CFI by De Clercq and Verbauwhedein[dCV17], list various indicators likeperformance-and hardware overhead, if available at all, as stated by the authorsofthe respectiveproposals. Thoseare based on differentplatformsand are benchmarked withdifferentsetsofapplications though, which makes adirectcomparison not very meaningful. The interested readerisreferred to the original papers as well as to [dCV17]for detaileddescriptionsofimplementedCFI concepts and aqualitativesecuritycomparison of respectiveproposals. Thischapterfirst describes the methodology for all measurements in Section 6.1 and lists the benchmark applications used for evaluation in Section 6.2. Comparisons and benchmark resultsregardingsecurity, performance,codesize and hardware utilization are discussed in Section 6.3, Section 6.4, Section 6.5 and Section 6.6, respectively. Eventually afinal comparison with asummary of thesedifferentaspectsispresented in Section 6.7.

6.1Methodology

The 32-bitPULPissimo1 SoC withits CV32E40P2 CPU corewas used forall implemen- tations.The platform is open-source and fully synthesizable, which enables convenient

1 https://pulp-platform.org 2https://github.com/openhwgroup/cv32e40p

67 6. Evaluation

modificationsofthe core. This microcontroller architecture wasdescribedbeforeinSec- tion 4.1 in moredetail. Platform-specific differencesbetween our setup and the platforms usedfor evaluation by the authorsofthe original CFI proposals need to be considered wheninterpretingbenchmarkresults. We thereforeemphasizethat the rankingspresented hereare onlyvalidinour setup and do not necessarilyrepresentcriticism towardsthe original CFI implementations.

Benchmark Problems &Workarounds The performance and code size evaluations in Section 6.4 and Section 6.5 are based on the set of benchmark applications listed in Section 6.2. The slre and towers applications are excludedfrom all aggregatedperformanceindicators becauseofincompatibilities causedbyrecursive calls with two of the CFI implementations. As described in Section 5.3, we were not able to instrument code in libgcc.Unfortunately, this limitationrequires certain exclusions fromCFI instrumentation. ForCFI concepts, which needexplicit instructions forbackward edge protection, function calls are not instrumented with CFI instructions. This is because it is not possibletoinstrument the respectivebackwardedges too, whichwould result in CFI violations becauseofshadow stackinconsistencies. However, NOPs are added in order to achievethe same number of injected instructions so that these exclusions neither influence performance overhead norcodesize. Themissing possibility to instrumentfunctions in libgcc alsoposes a problemfor indirectforwardedges forIntel CET because with this approachthe CFI hardware automatically enters astate where an ENDBRANCH instruction is expected upon every JALR and JR.Itisnot possibletoinject these ENDBRANCH instructionsintargets residing in libgcc though. As aworkaroundCFI enforcementisdisabledingeneral for IntelCET when collecting benchmark results. Thisisallows to get meaningful run-time results at least.

Applied Optimizations Allapplications arecompiledusing the PULP toolchainwithits GCC7.1.1, thePULP SDKand the compiler options -O2 -g0 in order to enablecommon optimizations and disableoutput of anydebuginformation. However, call-tail eliminationisdisabledwith the GCC option -fno-optimize-sibling-calls because of itsincompatibilitywith CFI. Thisproblemwas described in Section3.2. Most applications are onlylessthan 0.001% slowerwhennot usingcall-taileliminationand some are not affected at all. There are certain casesthoughwith clearlynoticeableeffects: Forasimple factorial implementationthe slowdownisalmost 3000% because callsand returnsare no longer optimized away.This is, however, independentofany CFI enforcementstrategy.

Instruction Alignment Instrumentation influencesthe time required forinstructionfetching frommemory. This is because anyinstrumentation,not necessarily forCFI reasons,changes thealignment

68 6.2. Benchmarks of symbolsinthe .text section of someprogram, whichaffectsand sometimeseven improvesperformance.When comparing run-time results of an application with and withoutCFI enforcement, alarge shareofthe differences, whichcannot be assigned to additional CFI instructions,stems frommemory effects suchasnon-idealfetches from instruction memory.Theseeffectsare notrelatedtothe nature of CFI instructions themselves but canalsobecausedbyadding or removing arbitraryother instructions. RISC-V offers dedicated Control and Status Registers (CSRs)for getting both the current number of processedinstructions and the cyclespassed during execution. We believe thatcounting instructionsinstead of cycles resultsinmore meaningfulstatistics sinceit clearsthese memoryeffectsfromour data.

Effect of ISA Variations Compilingwith the PULP ISAextension affectsrun-time performance of programs in most cases in apositive way.Thereare, however, alsocertaincases where execution of programs compiled with the extension is actuallyslower. These effectsare notrelatedto CFI. Thereforethe average number of executedinstructions forruns with and without PULP extensions are used throughoutthe evaluation, unless statedotherwise. The unmodifiedPULPissimo SoC and program executions without anyCFI enforcement actsasabaseline for all comparisons.

6.2Benchmarks

Ameaningful set of benchmark applications is required fortesting,comparing and evaluating hardware-based CFI implementations. Throughout theevaluation partofthis work CoreMark3,most applications from theUCBerkeley RISC-V Benchmarks4,all programs of theEmbench-IoT Benchmarksuite5,some of thesmallerapplications in the MiBench Benchmarksuite6 as well as threesmallcustomrecursiveapplications are used. We believethis setofbenchmarks constitutesarepresentativemix of applications for embeddedsystems. Given our single-threadedenvironment, the multi-threaded part of theUCBerkeley RISC-V Benchmarks is not used.Furthermore,manyofthe applications of theMiBench suite containcodeorinputdatatoo large for oursetup (see Section 4.1) or requirefile IO, which is not possibleonour platform without major modificationsofthe respective programs. Only asubsetofthis suite is used forthis reason. Table 6.1 lists all benchmark applications together with theirsource and indicators for whether therespectiveapplicationuses indirectfunctioncalls, indirectjumps and/or recursion. Note that theapplicationsmightstill containindirectcontrol flowtransfers in

3https://www.eembc.org/coremark 4https://github.com/ucb-bar/riscv-benchmarks 5https://github.com/embench/embench-iot 6http://vhosts.eecs.umich.edu/mibench

69 6. Evaluation

shared library code.The table alsoshows whichofthe four academichardware-basedCFI proposals described in Section 3.3 used thebenchmarks for evaluation in theiroriginal evaluation. Most of the CFIschemeswereevaluated withother applications, which are notinclude in this work. It becomes apparent that coremark and are frequently used by academia. All of theapplications listed in Table 6.1 work with our implementationsofthe four academicproposals, with two exceptions.The slre benchmark resultsinfalse-positive CFI errors for HAFIX because it contains nested recursivecalls, which arenot supported as alsostatedbythe authorsofthe original proposal in [DHP+15]. Similarly, towers results in wrongCFI errors with HCFI, alsobecause of limitationsregardingrecursion. However, we are not entirelysure whether this is ashortcoming of theoriginalHCFI concept [CCAI16]oranimplementation error of ours. In order to enableafair comparison, slre and towers are not included in aggregated run-time performance results. Our implementation of IntelCET [Int19], which is the onlycommerciallyavailable concept comparable to theacademicproposalsdescribed in Section3.3, can onlybeexecuted with disabledCFI enforcementfor applications containing indirect control flowtransfers in libgcc functions. Thisisbecause CFI enforcementonforward edgesisimplicit, but the calleescannot be instrumented for reasons described in Section5.3.This does,however, not influencethe run-time or code size of applications executedwith our implementation of IntelCET.

6.2.1 Observations We analyzed our set of benchmarks in order to get afeelingfor suitablevaluesfor the common parameters listed in Section 4.4. Thissection gives insights to our observations. Figure 6.1 shows the number of functions in each of thebenchmarks listed in Table6.1 whenbeing compiled with thePULP ISA extension and GCC’soptimizationoption -O2. Only basicmath, cubic, dhrystone, fft, susan and wikisort containmore than 100 functions.Nonefalls below80orexceeds 125 differentfunctions.Concepts likethe active-set require this information in ordertofind aworkingand efficientsize forthe set.

75 80 85 90 95 100 105 110 115 120 125 130 135 140

Figure 6.1: Numbers of functions in our benchmark applications

Figure 6.2 depictsthe share of differenttypes of control flowtransfersinour applications. The numbers arerounded to the first decimal place. Thevast majorityofall control flow transfers are performed in adirect way,i.e., with statically known targets. From our set of benchmarks onlyfive containindirectfunctioncalls in their usercode at all,namely coremark, bitcount, picojpeg, sglib and wikisort. User code is the portionofcode, whichisspecific to some applicationand does notbelong to a

70 6.2. Benchmarks

Indirect transfers Usedinoriginal evaluation

I FIX FI CF Benchmark Calls Jumps Recursion FIXER HA HC HE coremark aha-mont64 crc32 cubic edn huffbench matmult-int minver nbody oT nettle-aes hI

nc nettle-sha256 be nsichneu

Em picojpeg qrduino sglib slre* st statemate ud wikisort dhrystone median multiply qsort rsort RISC-V sort spmv UCB towers* vvadd basicmatch bitcount h dijkstra enc fft qsort† MiB stringsearch susan† factorial

tom nqueens cus

* excluded from aggregated resultsdue to missing support † reduced problem set

Table 6.1: Listofused benchmark applications

71 6. Evaluation

34.7 Function returns Directcalls Indirect calls Directjumps 13.7 1.4 Indirect jumps 1.3 48.9

Figure 6.2: Relativeshares of control flowtransfers typesinbenchmarkapplications

shared library.Inthe coremark benchmark 15% of all function callsare indirect ones, respectively0.1%inpicojpeg,26% in wikisort and even about 61% in bitcount. The sglib benchmark,however, does not execute itsindirectcallsbecause of conditional branchdecisions at runtime.None of theapplication contains morethan10functions, which are (also) called indirectly.

The picojpeg and qrduino benchmarkscontainswitchstatementsinthe timedsection of theircode, which are translated to indirect jumps by the compiler. These hardly reflect on run-time performance at all though.

In total, less than 3% of all control flowtransfers in theprogram (usercodeand libraries combined)are indirect function calls or indirect jumps. Thisinformationhelps, forexample,when decidingfor ameaningfullabel widthfor forward edge protection. Sometimes it is alsoargued thatprotecting indirect function callsand jumps is less relevant in hardware because theshare of these operations is so small[DHP+15].

The depthofnested function callsdoesnot exceed 15 for most applications. Only dedicated recursivebenchmarks like tak and factorial stackfunctioncallshigher. Thoseare custom examplesofoursthoughand notpart of anybenchmark suite. This information about call-and recursion depthisrelevantfor dimensioning shadowstacks and optional elementsfor recursion handling. In general,itcan be saidthatrecursion occurs on multiple benchmark applications,but notinanextensiveway.

6.2.2 Changestothe Benchmarks The benchmark applications listed in Table 6.1 requirecertainminormodifications in order to be executableonour developmentplatform. To the best of our knowledge, none of themodifications presentedherechanges application behaviorinarelevantway.We

72 6.3. Security

first state modifications, which were implemented for all applications, and then list those for specific benchmarks. The timing of all benchmark programs is unified for ideal comparability.RISC-V CSRs are used for counting the number of instructionsbetween startand end of eachbenchmark body execution.The relevant CSR is reset at the beginning of eachrun. Setupand verification code is excludedfrom timingand outputsofdebug information or benchmark results areonlyprinted outside of the benchmark body. Certainincludes and constants of thePULP SDKare required for all applications. The includedheaders link platform specific code like acustomversionofprintf,which outputs to the UniversalAsynchronous Receiver Transmitter (UART) interface of the FPGA board. Some applications requiremore memory on thestackthanthe 2KiB defined in the original PULP SDK. In order to enableexecutionofmore benchmarks, thestacksize was increasedto18 KiB.Two programs of the MiBenchsuite, namely qsort and susan, were still not executable with this stack size.The problemset of thesetwo applications wasdecreased so that they are runable on our platform. Calls to unavailable functions of theCstandard library were replaced with custom versions thereofbecause thePULP platformcompileswith -nostdlib perdefault. The rsort benchmark application, which is part of UC Berkeley RISC-V Benchmarks, requires the __sync_fetch_and_add built-in function foratomic memory access.This function is not available on our platform.All calls were replaced with acustomsoftware implementation,thereby losing atomicity. Thisdoesnot affect the intendedsemantics since our platformissingle-threaded anyways.

6.3Security

Basicsecuritytests were performed in ordertocheck the effectiveness of our respective CFI implementations. Adetailed qualitativesecurityevaluation is no focus of this thesis. We instead refertootherworks like [dCV17], which extensivelyanalyzesecurityconcepts. Exemplaryapplications containing simple memory corruption attacks that manipulate the return address or thetargetaddress of some indirect call or jump are used in order to confirm whetherthe resulting deviationsfrom the expected control floware detected by our CFI modules. All of thesemanipulationswould enableCRAs. In addition to simple custom Return-OrientedProgramming (ROP) and Jump-OrientedProgramming (JOP) examples, some tests of theRuntime Intrusion Prevention Evaluator (RIPE)7 suite were used. RIPE is aRISC-Vport of theconsiderable testbed formemory corruption vulnerabilitiespresented in [WNY+11]. Apossiblemetric for asecurity evaluation and comparison of multiple CFI concepts is the Average Indirect target Reduction (AIR) proposed in [ZS13], which expresses the

7https://github.com/draperlaboratory/hope-RIPE

73 6. Evaluation

reductionofpossibleindirectbranchtargets. However, there is criticism about this approach. It is sometimesconsidered not particularly meaningfulbecausemost CFI approachesreachsimilar numberswhen being evaluated with the AIR metric (usually > 99% reduction of targets) and theremaining unprotected branches mightstill offer enough targets to launch an attack[BCN+17]. While aCPU core withoutany enforcementmechanisms in place acceptsamanipulated return addressand simply continues executionthere,all of theCFI concepts implemented for this thesis are abletodetectthe attackand throwacorrespondingexception. Forward edge protection is implementedinall approaches butHAFIX, where theauthors argue that software-sideforwardedge checks are sufficientlyefficient[DHP+15]. Only HECFI, IntelCET and our EXCEC scheme support CFI enforcementonindirectjumps. Only CET and EXCEC support interruptsand also enforce CFI duringISRs. Table 6.2 brieflysummarizes thescopeofsecurity features of theCFI implementations. Thisfunctional scopemust be considered wheninterpreting the followingevaluation results forperformance,codesize and hardware utilization because thesenumbers on theirown might giveandistorted view.

FIXER HAFIX HCFI HECFI Intel CET EXCEC Function returns Indirect calls Indirect jumps Fine-grained CFG Protected interrupts

Table 6.2: Overviewofthe supported CFI protection scope

6.4Performance

One of themost important parameters whencomparingCPUs is runtime,i.e. the time required forexecuting some application. Thissection examines theeffectsofCFI enforcementonthe runtime of our benchmark applications. Thebasisfor all evaluations presented here is the number of instructions executed between start and endofthe benchmark code.All run-time numbersrepresentthe average valuesofexecutions with and without thePULP extension. Detailed and not aggregatedrun-time overheads of all CFI variants and allapplications are shown in the appendix in Table A.1.

Overview Figure 6.3 presents an overviewofthe introduced performance overhead. The chart shows theaveragerelative increase forthe benchmark applications listed in Section 6.2 in terms of additional instructionsexecuted as compare to abaselinewithout anyCFI enforcementinplace.

74 6.4. Performance

1.96 [%] 2 1.53 1.37 1.47 rhead ve 1 eo

ativ 0.10 0.16 Rel 0 FIXERHAFIX HCFIHECFI CET EXCEC

Figure6.3: Comparisonofperformanceoverhead introduced by CFIschemes

Twoobservations stand out in Figure6.3.The comparatively largerun-time overhead of HAFIX is due to its higher number of instructions required forbackward edge protection. As shown in Section 3.3.2, HAFIX requires CFI instructions atthe very beginning and very end of every function. Furthermore, the concept needs one additional instruction for eachfunction call. This sumsuptoatleastthree instructions forone direct call, whichis morethan anyother CFI concept requires. The particularly lowoverheads of Intel CET and EXCEC on theotherhand can be explainedbythe fact thatthesetwo approaches don’t require additional instructionsfor backward edge CFI. 100 %oftheirintroduced overhead resultsfrom forwardedge protectiononindirectcontrolflow transfers.FIXER, HCFI andHECFIall use shadowstacks forbackward edge protection and need the same number of instructionstointerface with the stack. Therelativelysmalldifferences between thesethree implementationsstem fromforward edge CFI, whereFIXER only requires oneadditional instruction percall, HCFI twoand HECFI three (or even more, in case trampolinesare used).

Comparison for Selected Applications The performance impact of our implementationsdepends on thenumberofadditional instructions required forCFI enforcement. Thisvalue is aproductofthe number of relevant forward and backward edges presentinthe executionflow and thenumberof CFI instructions needed to instrument eachofthem. Forexample, the edn application does notcontainany relevantcontrol flowtransfers in the timedsection of itscode, so naturally noneofthe CFI variantsintroduces anyrun-time overhead. Figure 6.4 shows therun-time increase for some other specific benchmarks,namely basicmath, coremark, drhystone, factorial and bitcount,wheredeviations between the various CFI implementations,but also between different applications, are significant. The basicmath benchmark contains no indirectfunctioncalls and onlysmall number of direct ones,leadingtoaminimal run-time overhead of 0.28% for most CFI imple- mentations. An above-average number of direct callsisexecuted in dhrystone on the other hand. HAFIX is significantlyslowerhere because itsinstrumentation concept adds instructions notonly directly at function calls, but alsotoevery function entry and exit.This leads to significantly moreexecuted instructions in some cases,depending on

75 6. Evaluation 22 . 31

[%] 20 14 . 88 14 . 88 14 . 88 rhead ve 10 . 75 eo

10 8 . 93 8 . 93 8 . 12 ativ 5 . 95 5 . 17 5 . 17 5 . 17 4 . 80 Rel 2 . 98 2 . 13 1 . 64 1 . 53 . 28 1 . 42 0 0 . 28 0 . 28 0 . 28 0 . 22 0 . 11 0 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 basicmath coremark dhrystone factorialbitcount

FIXER HAFIX HCFI HECFI CET EXCEC

Figure 6.4: Comparison of performance overhead for selected applications

the structure of someprogram. IntelCET and EXCEC don’t need instrumentation for direct calls, so there is no overhead at all for suchapplications. However, coremark and bitcount containindirect function calls, whichmakeupfor as much as 15.4% of all callsfor coremark and even 61.2% for bitcount.HECFI’s instrumentation concept, which requiresatleast three instructionsper indirectcall,isparticularly vulnerable here. Thenumbers forthe recursive factorial application are particularly bad for the academicapproaches.FIXER,HCFI and HECFIall use the same number of instructions for protecting direct function callsand returns. HAFIX requires more instructionsfor the same task,while Intel CETand EXCECget alongwithout anyadditionalones.

ComparisonofForward and Backward Edge CFI CFI proposals differ in the way they protect forward edges in frontofall, while backward edgeCFI is implemented in asimilar waywith shadowstacks in most concepts. We therefore compare the average run-time overhead introducedfor applications,whereonly backward edge protection is required, with theoverhead for applications alsocontaining indirect function callsand/or jumpsinFigure 6.5. HAFIX is not considered in this evaluation becauseitdoesnot offer protection of forward edges.The values forbackward edgeprotectionrepresentthe average overhead for all of theapplications described in Section 6.2, whichdonot containindirectcontrol flowtransfers. Intel CETand EXCEC introducenooverhead at all whensecuring function returns because they don’t require custom instructionsbut onlycouple theirprotection mechanisms to already existingones. The second series in Figure6.5 shows the combined overhead of forward-and backward edge protection for theremaining applications, namely coremark, picojpeg, qrduino, sglib, wikisort and bitcount,whichall containindirectfunctioncalls and/or jumps. Note that indirect calls are not executed for sglib at runtime because of dynamicbranch decisions.Functions,whichcouldpotentially be called in an indirectway,havetobe

76 6.4. Performance

[%] 4 3 . 23 rhead 2 . 86 ove 2 . 28

me 2 ti un 1 . 20 1 . 20 1 . 20 0 . 96 er 0 . 59 ativ 0 0 . 00 0 . 00 Rel FIXERHCFIHECFICET EXCEC

Only backward edge CFI required Backward and forward edge CFI required

Figure 6.5: Comparison of performance overhead for programsonlycontaining CFI- protected backward edges with applications also containing indirect callsand/or jumps instrumented with CFI checks anyway.Theseinstructions contribute to the performance overhead, even if they are not actuallyinterpreted butare in fact treated similarly to NOPs during execution. Significantdifferencescan be seenfor theseapplications. HECFI requires the most additional instructions for indirect functioncalls and stands out here. Indirect jumps, which areused for switch-statementsinpicojpeg and qrduino,are rarely executed andonly CFI-enforced in HECFI, CETand EXCEC.

Comparison with Original Evaluation Results Ultimately we alsocomparethe run-time overhead as measured for our implementations with thenumbers statedinthe original papersfor FIXER, HAFIX, HCFI andHECFI. Concretestatements regardingoverhead for Intel’s CET were not available at thetime of writing thisthesis,sothe concept is omitted here.Figure 6.6 shows adirectcomparison of therespectivenumbers.

The mainreason whyour implementation of HCFI appears to be slowerthanthe original one is presumably thefact that theauthors of theoriginalimplementation could place theirCFI instructions in branchdelay slots, which eliminates some overhead. Thisisnot possibleonour RISC-V platform and is generally not alwaysapplicable in other architectures.However, the numbersfor FIXERinFigure 6.6 need further explanation even though they are very close. Theperformance overhead measured by us representsanoptimized versionofFIXER with fewer additionallyrequired instructions, as previously explained in Section 4.3. The set of benchmark applications used for the original evaluation in [DBGJ19]mightbemorebeneficial for this approachbecause most programs therein contain only few function calls. In general, aportion of thedifferences forall of theschemesshown in Figure 6.6 can be explained by the differentsetsof applications usedfor evaluation.

77 6. Evaluation [%] 2 . 00 2 1 . 96 1 . 75 rhead 1 . 53 1 . 50 1 . 47 1 . 37 ove me 1 . 00 ti 1 un er ativ Rel 0 FIXERHAFIX HCFI HECFI

Measured by us Specified in paper

Figure 6.6: Comparison of measured and specified performance overhead

6.5CodeSize

Especially in the embeddedworld, where memory is limitedcompared to full-scale computers, thesize of applications is of particular relevance. Thereforealso the effects of CFI instrumentation on code size,i.e., sizeofthe .text section in some binary, are considered forthe benchmark applications. Not aggregatednumbers forthe overheads on code size for all CFI implementationsand all benchmark applications are listed in Appendix A.2.

Overview Abasicanalysis and in termsofthe relativesize overheads introducedbythe CFI implementationsisshown in Figure 6.7. Thereason for HECFIbeing theworst here can be explained by its protectionconcept.HECFI injects CFI instructions justbefore and rightafter every function call whilemost other conceptsadd oneinstruction before the call and one before each of thecallee’sreturn statements. Thisdifference means thatHECFI addstwo instructions for every call instead of just one.Similarly,HAFIX addstwo instructions to everyfunction, regardlesswhether it is called or not,and an additional one for everyfunctioncall. FIXER and HCFI handle direct function callsinthe same way.Given the very small shareofindirectcallsinthe benchmark applications, thesetwo approachesbring avery similar number of additionallyinjected instructions with them.Intel CET and EXCEC don’t require instrumentationfor direct function callsatall, so theiroverhead in size is naturallyminimal.The difference between the two of them can be explainedbythe fact thatEXCEC instruments caller and callee with an additional instruction at indirect control flowtransfers while Intel CETonly injects one at thecallee. Also, EXCEC adds trampolines, if needed.

78 6.5. Code Size

8.15 %] 8 7.57 d[ 6.01 6.00

rhea 6 ve

eo 4 iz

es 2

ativ 0.12 0.17 0 Rel FIXERHAFIX HCFIHECFICET EXCEC

Figure 6.7: Comparison of code size overhead for various CFI schemes

It becomes apparentfromFigure 6.7 that instrumentation of indirect functioncalls and jumps onlyaccountsfor aminorshare of all injected instructions. However, all CFI variants showawiderange of .text section size overheads for differentapplications spanning from onlyslightly above0%toanincrease of about 10 %. Figure6.8 shows amore detailedperspectiveand utilizesalogarithmicscalefor abetter overview. The outliersfor CET and EXCEC are of course applications with indirectcontrol flow transfers.

101 [%]

100 rhead ve eo iz −1 es 10 ativ Rel 10−2

FIXER HAFIX HCFIHECFI CET EXCEC

Figure 6.8: Comparison of code size overhead(logscale)

CorrelationofPerformance and Code Size Overhead The overviewnumbers forperformance-and code sizeoverhead shown inFigure 6.3 and Figure 6.7 suggest that thereisacorrelation between thetwo of them. Figure 6.9 shows adirectcomparison of theseindicators.However,this correlation is notalwaystrue in individualcases because theoverheads on performance and code size stronglydependon the natureofaprogram’sCFG.For example, programs with manyrepeated function calls requireminimal CFI instrumentation but showsignificantloss of performance.

79 6. Evaluation 8 . 15

8 7 . 57 [%] 6 . 01 6 6 . 00 rhead ve 4 eo ativ

2 1 . 96 1 . 53 Rel 1 . 47 1 . 37 0 . 17 0 . 16 0 . 12 0 0 . 10 FIXER HAFIX HCFI HECFIICET EXCEC

.text sectionsize Runtime

Figure 6.9: Comparison of code size and run-time overhead

An examplethereof is shown in Figure 6.10a for the factorial application. Here the run-time overhead is significantlylarger than theincrease of code sizebecause thevery same instructionsare executed over and over again in recursively calledfunctions.

Anothercase is acomparison for the bitcount benchmark as showninFigure 6.10b, which contains forward edgeprotectionfor indirectfunctioncalls.Herethe relative performance-and code sizeoverheads are almost aligned for theacademic concepts. Note the significant differencestothe average caseshown in Figure 6.9 for both examples.

Code Code 20 20 Runtime Runtime [%] [%] rhead rhead

ve 10 ve 10 eo eo ativ ativ Rel Rel 0 0

X X ER FI ER FI HCFI CET CEC HCFI CET CEC FIX HAFI HEC EX FIX HAFI HEC EX (a) factorial (b) bitcount

Figure 6.10: Comparison of code size and run-time overheads for specific programs

80 6.6. Utilization

6.6Utilization

Anotherveryimportantaspect to consider whenworking with FPGAsorApplication- specific integrated circuits (ASICs) is utilization of available hardware resources or area, respectively. We compare the effectsofCFI-related changes to theCPU and use the utilization report created by Vivado shown inTable 6.3 as abaseline. These are numbersfor synthesis and implementation of thecompletePULPissimo microcontroller architecture on aXilinxZedBoard without anyCFI enforcementinplace.

Resource Utilization Available Utilization [%] LUT 40484 53200 76.10 FF 18624 106400 17.50 BRAM 128 140 91.43

Table 6.3: PULPissimo’sbase utilization without CFI enforcement

PULPissimo uses 512 KiB of data- and instruction memory,occupying the address space 0x1C000000 - 0x1C080000 [tea20]. Thismemoryisimplemented with BRAM elementsonthe FPGA,using about 91 %(128 of 140 blocks) thereof. BRAM blocks offer considerableamounts of memory,but introduceanaccessdelay of at least one cycle[zyn19]. FFsonthe other handoffer fast access, but are onlyavailable in limited numbersand additionallyrequire accompanying LUTs. The available number of FFs fromTable 6.3 directly corresponds to the available number of registerswithin the whole project. Thesefactors need to be considered when implementing memoryelements. Each of theCFI approachesimplementedfor this thesisentailscertain additional hardware requirements forlogic and memory elements. Whileoverhead in termsofLUTsdepends on implementation details, therequired additional memory is mostly determined by the dimensions of respective memoryelements. The parametersdefined in Section 4.4were usedfor these dimensions in all CFI schemes to enablebetter comparability. All memory elements were implemented with FFs.

Overview Figure 6.11 shows an overviewofthe introduced hardware overhead in termsofthe absolute numbers of additionally required LUTsand FFsfor eachofthe CFI variants. We consider acomparison of theabsolute LUTsand FFsoverheads moremeaningfulwhen comparing theeffectsofCFI implementationsonhardware. Relative numbers are way moreinfluenced by the setup. Forexample, enabling or disabling theFloating PointUnit (FPU), which is an optional componentinthe CV32E40P core, movesrelative overhead numbersnoticeably. It becomes apparentfromFigure 6.11 that FIXERhas thehighest hardware demands, which does notcome by surprise because theconcept includes both ashadowstackand averyexpensivepolicy matrix.

81 6. Evaluation , 370 10 k 9 , 676 9 d rhea ve 6 , 032 eo t u 4 , 485

l 5k 4 , 196 3 , 963 3 , 599 3 , 459 3 , 207 Abso 1 , 418 1 , 288 804 0k FIXERHAFIXHCFIHECFI CET EXCEC

LUT FF

Figure 6.11: Comparison of additionallyrequiredhardware resources

HAFIX on theother hand does notinclude forward edge protection and onlyuses an activeset forCFI enforcementonfunction returns,whichischeaperinterms of required memory but also lesssecure.

Partition of FF Usage These observations suggest to takeacloserlookonhow the overhead on FFsiscomposed in detail. Figure 6.12 shows this partition.The shadowstacks used in FIXER, HCFI and IntelCET forbackward edge protection all require thesame number of FFsbecause they store full32-bitaddresses.They all require4096 FFswith ashadowstacksizeof128, which is used forall of thestackimplementations(as defined in Section 4.4).HECFI does not store fulladdresses but 10-bitlabelsonits stack,whichresultsinasignificantly smallerfootprint. Similarly,only 18 out of 32 addressbits are stored in EXCEC, thereby cuttingits demand of FFsalmost in half forthe shadowstack. EXCEC includesaparticularlythorough handling of recursion though with counters associatedtoevery shadow stackentry, which of course requires additional hardware resources.From the concepts analyzedand implementedfor this thesis, onlyHAFIX does not use ashadowstackbut an active-set. The authorsofHAFIX suggestedtouse an active-set with 16384 entries in [DHP+15], which we considerdisproportionate for embeddedsystems. Given our limited setup, we decided forwithasmaller setwith1024 entries. However, the differences between forward edge protectionconceptsare significantly larger. Label-based approaches, which are used in HCFI, HECFI, EXCEC and roughlyalso in IntelCET,hardly introduce anyoverhead in termsofFFs.Given the minimalsize of the registers required forlabels, therespectiveelements are placed inthe others category in Figure 6.12.FIXER on the other hand implements forward edge CFIwith apolicy

82 6.6. Utilization

EXCEC

CET

HECFI

HCFI

HAFIX

FIXER

1,024 2,048 4,096 8,196 10,000

Numberofadditional FFs

Shadowstack Active-set Policymatrix Matrix index table Recursionelem. Setjmp elem. Others

Figure 6.12: Partition of FFsusage for various CFI schemes matrix, whichisavery fine-grained and expensiveapproach. Thematrix is implemented as an array with 64 × 64 entries,basedonthe parameters defined in Section4.4. This component alone requires 4096 FFs. In addition,FIXER needssome mechanisms for translating 32-bitaddress to matrix indices.Inour implementation this is a matrixindex table with 64 × 18 bits, which storesthe addresses usedinthe matrix and theirassociated indices. HAFIX does not come with forward edge protection at all.

Label-based approachesare superiorcompared to apolicy matrixinterms of required FFs. It can be argued that labelsprovide aless fine-grainedprotection, but improvements like trampolines close this gap. Furthermore,a64 × 64 matrix, as used in our implementation and theoriginalFIXER proposal in [DBGJ19],isratheroptimistic.

Timing

The additional hardware modules required for CFI enforcementalsoimpactthe maximum frequency with whichthe SoCcan be operated.Synthesis of theoriginalPULPissimo SoC for theZedBoard is possiblewith 16 MHz without timing violations with stillsome margin available. All of our CFI implementationsinfluencethe timingtosome extent, but FIXER decreases this marginnoticeably. However, it canstill be synthesized without timingviolations.Inaddition, our FIXERimplementation has thehighest number of logic levels because itsmemory elements requirealarge amountofassociated logic for the lookupoperations in the policy matrix. Higher frequencies can of course lead to problemsinall CFI schemes.

83 6. Evaluation

Floorplans The numbers presented in this section are alsoreflected in theFPGA floorplansshown in Figure 6.13. The turquoise areas represent cellsused for PULPissimo itself, purple areas showthe cells required forthe respectiveCFI implementation and darkareasindicate free resources.FIXER showsaparticularly highutilization. Note that Figure6.13 only represents exemplary floorplansbecause placementisnot deterministic anddependsvery much on theElectronic DesignAutomation (EDA)toolused.

(a)FIXER (b) HAFIX (c) HCFI

(d) HECFI (e) Intel CET (f)EXCEC

Figure 6.13: Visual device utilization comparison for theZedBoard

6.7Final Comparison

Eventually, Figure 6.14 shows therelativeoverheads on runtime, code sizeand hardware utilization in acombinedway forabetter overview. Note that thefigurenow shows the relative increase of hardware utilization (average relativeoverhead of LUTsand FFs) as compared to avanillaSoC without anyCFI hardware here. This is not meaningfulwhen comparing differentplatforms, but canbeusedasanindicator for the impacts of various modifications to thesame system. The figure hintsthatFIXER’s particularly large hardware utilization doesnot payoff because itsrun-time- and code sizeoverheads are stillataverage levels, when being

84 6.7. Final Comparison

40

30 [%]

rhead 20 ve eo

ativ 10 Rel

0 FIXER HAFIX HCFIHECFI CET EXCEC

Runtime Code size Utilization

Figure 6.14: Comparison of runtime,codesize and hardware utilization overheads compared to theother academic approaches. The policy matrix, whichismostly respon- siblefor FIXER’s increased hardware demands, offersveryfine-grained CFI, but similar results canbeachieved with labels and trampolinestoo,ifneeded. In contrast, HECFIentails the lowest hardware overhead because it uses labels forCFI enforcementonforward edges, which are practically free in terms of additional hardware, and ashadowstackwith onlyareduced widthsince HECFI stores 10-bit labelsinstead of full32-bitaddresses.The necessary additional instructionsaffect runtime and code size though. Intel’sCET and EXCEC appear almost identical in thecomparison in Figure6.14. Both introducenoperformance-and code sizeoverhead at all for backward edge protection and onlyaminimal overhead for CFI on forward edges. However, CETonly offers coarse-grained CFI for indirect jumps and function calls.Compared to morefine-grained schemes like EXCEC, this imprecisionofthe CFGleavesmore opportunities for attacks.

85

CHAPTER 7

Conclusion

The maingoal of this thesisistoprovide ameaningfulquantitative comparisonofvarious hardware-assisted CFG-based CFI schemes. Thiswas enabled by implementing five existingapproaches on acommon RISC-V platform. Previous comparativeworks only analyzedsecurityaspectsand listed overheads on performance,codesize and hardware utilization as stated by the authorsofthe respectiveproposals for differenthardware platforms, if at all. In addition to implementationsofexisting approaches, we designed and implementeda novelCFI schemecalled EXCEC based on conceptsand ideas from existingproposals. EXCECoffers asoundprotection for all sortsofcontrol flowtransfers, considers elsewhere seldom supported concepts like setjmp callsand interrupts and comes with dedicated recursionhandling. We alsodeveloped aGCC pluginfor all instrumentation tasks,which injects required CFI instructions into applicationsduring the compilation process. Using theseimplementations, the impactsofCFI enforcementonruntime,codesize and hardware utilization were compared.Tothatend almost 40 benchmark applications for embeddedsystems were ported to our platform, mostly from respected benchmark suites. Whiledifferences in performance and code size are rather small,some concepts stand out whenitcomes to hardware requirements-bothinpositive and negativeways. Approachesfor CFI protection on forwardedges with labels and apolicy matrixwere evaluated. Hardware demandsofthe latterare extreme and theconcept does not offer an otherwise unreachable levelofsecurity. Policy matrices, while being very intuitiveand fine-grained, arenot favorable in comparisontothe very cheap but effectivelabel-based approaches. Forbackward edge protection shadow stacks and an active-set were compared.Itiswell established nowadays thatonly stateful protection, likeitisoffered by shadowstacks, providesperfectCFI enforcementwhile weaker approaches can be bypassed. However,

87 7. Conclusion

shadow stacks are usuallymore expensivecompared to active-setsinterms of hardware utilization. The platformonwhichaCFI concept is to be implementedhas significantconsequences. CFI modules implemented as coprocessorstosome CPU can entailacomparably large performance-and code sizeoverhead. In contrast,adeep integration into some CPU’s pipelineallows foravery efficientimplementation.CFI enforcementonbackward edges is even possiblewithout anyadditional performance-and code sizecostswhen the respective CFI module hasaccess to allrequiredpipelinesignals. Platform-specific knowledgealso allowedustodesign avery efficientshadowstack, which onlystoresapart of instruction addresses. Thisispossible because onlyasmallportion of thewhole addressspace is usedfor instructionsonsome systems, especially in the embeddedworld wherememoriesare rather small. The probablybiggest challenge remains thegenerationofCFG information, whichis required forenforcement of indirectforward-edgeCFI.Incases where no precise CFG is available, it is worth consideringtouse conceptslike Branch Regulation,whereonly certainpropertiesofcontrol flows are enforced. An examplefor suchaconstraint is that everyfunction call mayonlytarget theverybeginning of anyfunction. Intel’s CET basically follows thiscoarse-grained approach. In conclusion it can be saidthatthe actual requirementsand available resources decide which CFI approachisfavorable:Coarse-grained CFI, e.g. with acombination of shadow stackand branch regulation, does not require aCFG,already prevents ROPattacks and significantlyreduces thegadget space for other CRAs. Morefine-grainedapproaches do needaCFG and furtherimprove security. In thecourse of working on this thesiswealsowrote apaper on Comparative Analysis and Enhancement of CFG-basedHardware-AssistedCFI Schemes [TT21]and submitted it to the 48th InternationalSymposium on Computer Architecture.Itwas rated with "Strong Accept" by two out of fivereviewers but wasunfortunately rejected after the rebuttalnevertheless. We are currently lookingfor anothervenue to publishit.

7.1 Future Work

There remainsome possibilities for future work despite the extent of this thesis.Onthe software side,instrumentation is currentlyonly possiblefor applications written in C because thePULP SDKdoesnot providesupport for C++ as of this time. Instrumenting C++ applications would be avaluable additiontoour work, especiallysince concepts likethe vtables usedC++ should alsobeCFI protected. An evaluation of theeffectson runtime would be obvious for suchprograms.Itcan alsobeinteresting to implementa differentconcept for instrumentationlike binaryrewriting becauseour compiler-based requires the availability of source code. The implementationspresented in this thesis are based on abare-metal system. An extension with OS support could giveinterestinginsights, for exampletoevaluate how

88 7.1. Future Work an OS kernelcan interactwith the CFI hardware. This could alsoenablethe evaluation of larger and morecomplex applications like webbrowsers. An increasedperformance overhead for suchprograms is expected becauseofdynamiclinking and the use of object-orientedtechniques. Manuallycreated CFGinformation forindirectfunction calls wasused for instrumentation because CFGgeneration is ahard problemonits ownand no focus of this thesis. A combination of our GCC plugin forinstrumentation with some framework for extracting the control flowofagiven program seems tempting and could solve this shortcoming. The hardware modules for CFI enforcementwere implemented as extensions to asingle- threaded RISC-V SoC. Porting the CFI schemes to other ISAs with differentcharacteristics would offer interesting insights forevaluating the effectsofplatform specifics likebranch delay slots. Implementationsfor multi-threaded environments could alsobeconsidered but require more advanced conceptsbecause CFI statesfor eachthreadcontext must be maintained.

89

APPENDIX A

RawData

A.1 Performance Overhead

Table A.1: Relativerun-time overhead in termsofadditionallyexecuted instructions

Benchmark FIXER HAFIX HCFI HECFI CET EXCEC coremark 1.423 2.135 1.533 1.643 0.110 0.220 aha-mont64 0.056 0.084 0.056 0.056 0.000 0.000 crc32 0.000 0.000 0.000 0.000 0.000 0.000 cubic0.308 0.310 0.308 0.308 0.000 0.000 edn 0.000 0.000 0.000 0.000 0.000 0.000 huffbench0.084 0.126 0.084 0.084 0.000 0.000 matmult-int 0.000 0.000 0.000 0.000 0.000 0.000 minver0.655 0.982 0.655 0.655 0.000 0.000 nbody 1.069 1.069 1.069 1.069 0.000 0.000 nettle-aes 0.008 0.011 0.008 0.008 0.000 0.000 nettle-sha256 0.070 0.105 0.070 0.070 0.000 0.000 nsichneu 0.000 0.000 0.000 0.000 0.000 0.000 picojpeg 1.139 1.708 1.139 1.168 0.028 0.057 qrduino 0.174 0.261 0.174 0.176 0.002 0.004 sglib 2.987 4.481 3.105 3.105 0.118 0.118 slre3.790 n/a 3.790 3.790 0.000 0.000 st 1.563 1.564 1.563 1.563 0.000 0.000 statemate0.000 0.000 0.000 0.000 0.000 0.000 ud 1.837 1.886 1.837 1.837 0.000 0.000 wikisort1.984 2.444 2.261 2.538 0.277 0.554 dhrystone 5.165 8.117 5.165 5.165 0.000 0.000

Continued on next page 91 A. RawData

Table A.1 –continued from last page Benchmark FIXER HAFIX HCFI HECFI CET EXCEC median 0.000 0.000 0.000 0.000 0.000 0.000 multiply 0.000 0.000 0.000 0.000 0.000 0.000 qsort 0.000 0.000 0.000 0.000 0.000 0.000 rsort 0.001 0.002 0.001 0.001 0.000 0.000 sort0.000 0.000 0.000 0.000 0.000 0.000 spmv 1.435 1.435 1.435 1.435 0.000 0.000 towers 5.520 8.280 n/a 5.520 0.000 0.000 vvadd 0.000 0.000 0.000 0.000 0.000 0.000 basicmath0.280 0.284 0.280 0.280 0.000 0.000 bitcount 5.952 8.929 8.929 10.750 2.976 4.797 dijkstra 0.153 0.230 0.153 0.153 0.000 0.000 fft 0.655 0.663 0.655 0.655 0.000 0.000 qsort 0.764 1.145 0.764 0.764 0.000 0.000 stringsearch 0.000 0.000 0.000 0.000 0.000 0.000 susan 0.083 0.083 0.083 0.096 0.000 0.000 factorial 14.876 22.314 14.876 14.876 0.000 0.000 nqueens 0.769 1.153 0.769 0.769 0.000 0.000 tak 7.260 10.890 7.260 7.260 0.000 0.000

A.2CodeSize Overhead

Table A.2: Relativecodesize overhead

Benchmark FIXER HAFIX HCFI HECFI CET EXCEC coremark 7.24 8.67 7.19 10.59 0.26 0.34 aha-mont64 7.04 9.02 7.04 9.69 0.11 0.15 crc32 7.30 9.36 7.30 10.04 0.12 0.16 cubic2.97 3.38 2.97 3.54 0.02 0.03 edn 6.15 7.90 6.15 8.37 0.10 0.13 huffbench 6.85 8.81 6.85 9.30 0.11 0.14 matmult-int 6.85 8.80 6.85 9.31 0.11 0.15 minver6.55 8.18 6.55 8.75 0.09 0.12 nbody 5.64 6.73 5.64 7.12 0.06 0.09 nettle-aes 3.48 4.45 3.48 4.77 0.06 0.07 nettle-sha256 4.84 6.21 4.84 6.67 0.08 0.10 nsichneu 2.82 3.61 2.82 3.90 0.05 0.06 picojpeg 5.15 6.29 5.13 8.02 0.48 0.60 qrduino 4.77 6.08 4.77 6.64 0.24 0.28

Continued on next page

92 A.2. Code Size Overhead

Table A.2 –continued from last page Benchmark FIXER HAFIX HCFI HECFI CET EXCEC sglib 6.85 8.55 6.75 9.84 0.13 0.80 slre7.35 9.32 7.35 9.85 0.12 0.15 st 5.86 7.00 5.86 7.38 0.07 0.09 statemate6.47 8.08 6.47 8.63 0.09 0.13 ud 6.41 8.17 6.41 8.67 0.10 0.13 wikisort4.89 5.77 4.76 6.04 0.21 0.24 dhrystone 9.12 11.96 9.12 12.00 0.12 0.16 median 8.03 10.28 8.03 11.03 0.14 0.18 multiply 7.32 9.37 7.32 10.06 0.12 0.17 qsort 7.77 9.92 7.77 10.66 0.13 0.18 rsort 7.68 9.86 7.68 10.52 0.13 0.17 sort7.71 9.80 7.71 10.52 0.13 0.17 spmv 0.54 0.67 0.54 0.71 0.01 0.01 towers 7.87 10.04 7.87 10.82 0.13 0.17 vvadd 8.01 10.26 8.01 11.01 0.14 0.18 basicmatch 2.98 3.39 2.98 3.54 0.02 0.03 bitcount 8.40 10.23 8.15 10.59 0.42 0.50 dijkstra 1.49 1.89 1.49 2.06 0.02 0.03 fft 4.20 4.97 4.20 5.32 0.04 0.05 qsort 6.90 8.79 6.90 9.55 0.11 0.15 stringsearch 2.92 3.72 2.92 4.00 0.05 0.07 susan 3.59 4.41 3.59 4.78 0.05 0.06 factorial 8.25 10.53 8.25 11.26 0.14 0.18 nqueens 8.07 10.32 8.07 11.13 0.14 0.18 tak 8.23 10.49 8.23 11.33 0.14 0.18

93

ListofFigures

1.1 RISC-V stacklayout during afunctioncall [PH] ...... 3 1.2 Attackmodel with different kinds of exploits[SPWS13] ...... 4 1.3 Stacklayout forthe CcodeofListing1.5 ...... 7 1.4 Example of agadgetchain for aROP attack[ZQQQ18] ...... 8 1.5 WorkingsofaJOP attack[SAS15] ...... 9 1.6 Functionalityofadispatchergadget [BJFL11] ...... 9

2.1 Abstract CFGexample ...... 14 2.2 CFI example with imprecise CFG[ABEL09] ...... 16 2.3 Example with imprecise CFG[GABP14] ...... 16 2.4 Exemplary flowfor breaking out of theCFG [DDE+12] ...... 19

3.1 Example for statefulCFI ...... 24 3.2 Active-setexample ...... 25 3.3 Shadowstackexample ...... 25 3.5 Operation of atrampoline for indirect functioncalls [SAD+16] ...... 27 3.6 Policy matrix example ...... 28 3.7 Simplified execution flowwith aC++ exception[DZL16] ...... 30 3.12 IntelControl-FlowEnforcement Technology [Gar20] ...... 38

4.1 RISC-V baseinstruction formats[AW17] ...... 43 4.2 Instrumented code for direct function call andreturn in FIXER [DBGJ19]46 4.3 Part of addressused for shadow stackinEXCEC ...... 50 4.4 Workingsofatrampoline for indirect functioncalls in EXCEC ...... 51 4.5 Valid (CFI)instruction sequences in EXCEC ...... 53 4.6 CV32E40P pipeline with EXCEC(modified from [pul])...... 54

5.1 Stepsfor binaryrewriting [WMUW19] ...... 58

6.1 Numbers of functions in our benchmark applications...... 70 6.2 Relativesharesofcontrol flowtransfers typesinbenchmarkapplications .72 6.3 Comparison of performance overhead introducedbyCFI schemes ..... 75 6.4 Comparison of performance overhead for selected applications ...... 76

95 6.5 Comparison of performance overhead for programs onlycontaining CFI- protected backward edges with applications alsocontaining indirectcalls and/or jumps...... 77 6.6Comparison of measured and specified performance overhead ...... 78 6.7Comparison of code size overhead for various CFI schemes ...... 79 6.8Comparison of code size overhead (log scale) ...... 79 6.9Comparison of code size and run-time overhead...... 80 6.10 Comparison of code size and run-time overheads for specific programs .. 80 6.11 Comparison of additionally required hardware resources ...... 82 6.12 PartitionofFFs usage forvarious CFI schemes ...... 83 6.13 Visual device utilization comparison for theZedBoard ...... 84 6.14 Comparison of runtime,codesize and hardware utilization overheads ...85

96 List of Tables

1.1 Some unsafefunctions in theStandard CLibrary [BST+00] ...... 2

2.1 CFI instructions proposedbyAbadietal. [ABEL09]...... 15

4.1 Common implementation parameters ...... 48 4.2 CFI instructions forEXCEC ...... 52 4.3 EXCECinterface...... 55

6.1 Listofused benchmark applications ...... 71 6.2 Overview of thesupported CFI protectionscope...... 74 6.3 PULPissimo’sbase utilization without CFI enforcement...... 81

A.1 Relativerun-time overhead in termsofadditionally executed instructions .91 A.2 Relativecodesize overhead ...... 92

97

List of Listings

1.1 Potential bufferoverflow...... 2 1.2 ROPgadgetfor addingtwo register values...... 5 1.3 Twovalidx86 instructions[Sha07] ...... 6 1.4 Sequence without thefirst byte hasadifferentmeaning on x86 [Sha07] 6 1.5 SimpleROP attack example ...... 7 2.1 CFI instrumentation of avaliddestination (x86) [ABEL09] ...... 17 2.2 CFI instrumentation of an indirect functioncall(x86)[ABEL09] ... 17 2.3 CFI instrumentation of afunctionreturn (x86) [ABEL09] ...... 18 2.4 Example for problems with static source code analysis[BCN+17] ...20 3.1 Settinglabel in caller ...... 26 3.2 Label checkincallee ...... 26 3.3 FIXER: Directfunction call [DBGJ19] ...... 32 3.4 FIXER: Function return [DBGJ19]...... 32 3.5 HAFIX: Function call ...... 33 3.6 HAFIX: Called function...... 33 3.7 HCFI: Indirect function call ...... 35 3.8 HCFI: Called function...... 35 3.9 Recursive factorialfunction...... 36 3.10 Simplified recursive towers function...... 36 3.11 HECFI: Indirect call ...... 37 3.12 HECFI: Callee function ...... 37 4.1 Examplesfor custom instruction format ...... 43 4.2 Examplesfor using custom instructions ...... 44 4.3 Caller instrumentation ...... 46 4.4 Callee instrumentation ...... 46 5.1 HCFI: Instrumentation of adirectfunctioncall ...... 60 5.2 HCFI: Instrumentation of adirectly called function ...... 60 5.3 HCFI: Instrumentation of indirect function call...... 61 5.4 HCFI: Instrumentation of indirectlycalled function ...... 61 5.5 Exemplary registration of aGCC plugin...... 63 5.6 Exemplary execute function foraGCCplugin ...... 63 5.7 Debug representation of areturnstatement...... 64 5.8 Example for aCFG configuration file...... 66

99

Acronyms

AIR Average Indirect target Reduction. 73, 74

API Application ProgrammingInterface. 62

ASIC Application-specific integrated circuit. 81

ASLR Address Space Layout Randomization. 4, 10, 11, 18

BR BranchRegulation. 20, 28, 29, 31, 38

BRAM BlockRAM. 31, 34, 45, 50, 81

CET Control-FlowEnforcement Technology.20–22, 38, 39, 41, 44, 47, 56, 68, 70, 74–78, 82, 84, 85, 88

CFB Control-Flow Bending. 16, 17

CFG Control FlowGuard.21, 38

CFG Control Flow Graph. xi, xiii, xv, 11–20, 23–28, 30, 31, 33, 36, 37, 50, 58, 59, 61, 65, 66, 79, 85, 87–89, 95, 99

CFI Control Flow Integrity.xi, xiii, xv, 1, 2, 11–39, 41–57, 59–63, 65–70, 73–79, 81–85, 87–89, 95–97

COP Call-Oriented Programming. 5, 10

COTS commercial-off-the-shelf. 62

CPU Central Processing Unit. 1, 9–11,22, 24, 30, 31, 38, 42, 47, 55, 56, 67, 74, 81, 88

CRA Code-Reuse Attack.xi, xiii, xv, 4, 5, 7, 9, 10, 13, 18, 28, 56, 73, 88

CSR Control and StatusRegister.69, 73

CU Compilation Unit.30, 39, 46, 61–64

CWE Common Weakness Enumeration.1

101 DEP Data ExecutionPrevention. 4, 10, 18

DIFT Dynamic InformationFlowTracking.11

EDA Electronic DesignAutomation. 84

EXCEC EXtensiveCFI Enforcement Concept.xiii, xv, 12, 41, 49–56, 67, 74, 77, 82, 85, 87, 95, 97

FF Flip Flop. 31, 42, 45, 50, 54, 81–84, 96

FPGA Field Programmable Gate Array.32, 34, 42, 73, 81, 84

FPU Floating PointUnit. 81

FSM Finite State Machine.29, 35, 38, 50, 53

GCC GNU Compiler Collection. 12, 21, 29, 42, 44, 48, 57, 62–65, 68, 70, 87, 89, 99

HDL hardware description language. 42

IBT Indirect BranchTracking.38

IoT Internet of Things. 1

IP Intellectual Property. 31

IPA Inter-Procedural Optimization. 62

IRM InlinedReference Monitor.18

ISA Instruction SetArchitecture.xv, 5, 6, 10, 17, 19, 22, 31, 34, 36, 38, 41–45, 47, 50, 52, 56–60, 69, 70, 89

ISR interrupt serviceroutine. 12, 49, 56, 74

JOP Jump-Oriented Programming. 5, 8–10, 73, 95

LIFO lastin-first out. 26

LSB least significant bit. 50

LSS Label StateStack. 37

LTO Link Time Optimization. 21, 39, 48, 61, 64, 65

LUT LookupTable. 42, 81, 84

102 MMU Memory Management Unit.18

MPU Memory Protection Unit. 10

MSB most significantbit. 50

OS operatingsystem.1,6,10, 21, 22, 88, 89

PAC Pointer Authentification Code.21, 22

PC Program Counter. 9, 31, 32, 35, 37, 45, 55, 56

PL programmable logic. 42

PUF physical unclonablefunction. 28

RILC return-into-libc. 5

RIPE Runtime Intrusion Prevention Evaluator.73

RoCC Rocket CustomCoprocessor.32, 33

ROP Return-OrientedProgramming. 5–8, 10, 73, 88, 95, 99

RTL RegisterTransfer Level.62, 64

SDK software developmentkit. 42, 50, 68

SFI Software Fault Isolation. 18

SMAC Software MemoryAccess Control. 18

SoC System-on-a-Chip. 32, 34, 41, 42, 45, 67, 69, 83, 84, 89

SSA Static Single Assignment.62

UART UniversalAsynchronous ReceiverTransmitter. 73

103

Bibliography

[AAB+16] Krste Asanovic, Rimas Avizienis,Jonathan Bachrach, Scott Beamer,David Biancolin,Christopher Celio, Henry Cook, DanielDabbelt, John Hauser, Adam Izraelevitz,etal. Rocket. EECSDepartment, University of California, Berkeley,Tech. Rep. UCB/EECS-2016-17,2016.

[ABEL05] Martín Abadi, Mihai Budiu,ÚlfarErlingsson, and JayLigatti. Control-flow integrity. In Proceedings of the 12th ACMConferenceonComputer and Communications Security (CCS),CCS ’05, page 340–353. Association for ComputingMachinery,2005.

[ABEL09] Martín Abadi, Mihai Budiu,ÚlfarErlingsson, and JayLigatti. Control-flow integrityprinciples, implementations, and applications. ACMTransactions on Information and System Security (TISSEC),13(1):1–40, 2009.

[and20] Control flowintegrity|android. https://source.android.com/ devices/tech/debug/cfi,September 2020. (Accessed on 10/14/2020).

[ARM15] ARM. D4.4.1 memory access control ARM architecture referenceman- ual forARMv8-A. https://developer.arm.com/documentation/ ddi0487/ag/,72015. (Accessed on 12/13/2020).

[Att] Attribute Syntax -Usingthe GNUCompilerCollection (GCC). https: //gcc.gnu.org/onlinedocs/gcc-4.2.4/gcc/Attribute- Syntax.html#Attribute-Syntax.(Accessed on 11/09/2020).

[AW17] Krste Asanovic Andrew Waterman. The RISC-V Instruction SetMan- ualVolume I: User-Level ISA. https://riscv.org//wp-content/ uploads/2017/05/riscv-spec-v2.2.pdf,May 2017. (Accessed on 09/06/2020).

[BCN+17] Nathan Burow,ScottACarr, JosephNash,Per Larsen,Michael Franz, StefanBrunthaler,and MathiasPayer. Control-Flow Integrity:Precision, Security, and Performance. ACMComputing Surveys(CSUR),50(1):1–33, 2017.

105 [Ben20] Marco Benatto. Fighting exploitswith control-flow integrity(CFI) in clang. https://www.redhat.com/en/blog/fighting-exploits- control-flow-integrity-cfi-clang,May 2020. (Accessed on 10/14/2020).

[BJF11] Tyler Bletsch, Xuxian Jiang, and Vince Freeh.Mitigating code-reuse attacks with control-flow locking.InProceedings of the 27th AnnualComputer Security Applications Conference(ACSAC),pages 353–362, 2011.

[BJFL11] Tyler Bletsch, Xuxian Jiang, Vince WFreeh, and Zhenkai Liang. Jump- orientedprogramming: anew class of code-reuse attack. In Proceedings of the 6th ACMSymposium on Information, Computer and Communications Security (ASIA-CCS),pages 30–40, 2011.

[Bra16] David Brash.Armv8-a:2016 additions. https://community.arm. com/developer/ip-products/processors/b/processors-ip- blog/posts/armv8-a-architecture-2016-additions,October 2016. (Accessed on 10/14/2020).

[BRSS08] Erik Buchanan, RyanRoemer,Hovav Shacham, and Stefan Savage. When good instructions go bad:Generalizingreturn-orientedprogramming to RISC. In Proceedings of the 15th ACMconferenceonComputer and CommunicationsSecurity(CCS),pages 27–38, 2008.

[BST+00] Arash Baratloo, Navjot Singh, Timothy KTsai,etal. Transparentrun-time defense against stack-smashingattacks. In USENIXAnnual Technical Conference, General Track (USENIXATC),pages 251–262, 2000.

[CAS+17] Abraham AClements, Naif Saleh Almakhdhub, Khaled SSaab, Prashast Srivastava, Jinkyu Koo, SaurabhBagchi,and Mathias Payer.Protecting bare-metalembeddedsystemswith privilegeoverlays. In 2017 IEEE Symposium on Security and Privacy (S&P),pages 289–303. IEEE, 2017.

[CBP+15] Nicholas Carlini, Antonio Barresi, Mathias Payer, David Wagner,and Thomas RGross.Control-flow bending: On the effectiveness of control-flow integrity. In 24th USENIXSecuritySymposium (USENIXSecurity15), pages 161–176, 2015.

[CCAI16] NickChristoulakis, GeorgeChristou, EliasAthanasopoulos, and Sotiris Ioannidis. HCFI: Hardware-enforcedcontrol-flow integrity. In Proceedings of the Sixth ACMConferenceonData and Application Securityand Privacy (CODASPY),pages 38–49, 2016.

[CH01] Tzi-ckerChiuehand Fu-Hau Hsu.RAD:Acompile-time solution to buffer overflowattacks. In Proceedings 21st International ConferenceonDis- tributedComputing Systems (ICDCS),pages 409–417. IEEE,2001.

106 [claa] Control flowintegrity—Clang12documentation. https:// clang.llvm.org/docs/ControlFlowIntegrity.html.(Accessed on 10/14/2020).

[clab] Control flowintegritydesign documentation —Clang 12 documentation. https://clang.llvm.org/docs/ ControlFlowIntegrityDesign.html.(Accessed on 10/14/2020).

[Cor19] JonathanCorbet.ComparingGCC and clang securityfeatures. https: //lwn.net/Articles/798913/,September 2019. (Accessed on 10/14/2020).

[CPM+98] Crispan Cowan, Calton Pu, Dave Maier, JonathanWalpole, Peat Bakke, SteveBeattie, Aaron Grier, Perry Wagle, QianZhang, and Heather Hinton. Stackguard: Automatic adaptivedetection and prevention of buffer-overflow attacks.InProceedingsofthe 7th USENIX SecuritySymposium (USENIX Security 1998),volume98, pages 63–78. San Antonio, TX, 1998.

[cv3] OpenHWGroupCORE-V CV32E40P RISC-V IP. https://github. com/openhwgroup/cv32e40p.(Accessed on 11/07/2020).

[CW14] Nicholas Carliniand David Wagner.ROP is still dangerous: Breaking modern defenses. In 23rdUSENIXSecuritySymposium (USENIXSecurity 14),pages 385–399, 2014.

[CXS+09] Ping Chen, Hai Xiao, Xiaobin Shen, Xinchun Yin,Bing Mao, and Li Xie. DROP: Detecting return-oriented programming malicious code. In Interna- tional ConferenceonInformation SystemsSecurity (ICISS),pages 163–177. Springer,2009.

[CZM+14] Yueqiang Cheng, Zongwei Zhou, Yu Miao, Xuhua Ding,and Robert H Deng.ROPecker: Ageneric and practical approachfor defendingagainst rop attack. Proceedingsofthe 21st Networkand DistributedSystem Security Symposium (NDSS),2014.

[DBGJ19] AsmitDe, Aditya Basu,Swaroop Ghosh,and TrentJaeger. FIXER: Flow integrityextensions forembeddedRISC-V. In 2019 Design, Automation &Test in Europe Conference&Exhibition (DATE),pages 348–353. IEEE, 2019.

[DCGÜ+16] Ruan De Clercq,Johannes Götzfried, DavidÜbler, Pieter Maene, and Ingrid Verbauwhede. SOFIA: Software and control flowintegrityarchitecture. 2016 Design, Automation &Test in Europe Conference&Exhibition (DATE), 68:16–35, 2016.

[dCV17] Ruan de Clercq and Ingrid Verbauwhede. Asurvey of hardware-based control flowintegrity(CFI). arXivpreprint arXiv:1706.07257,2017.

107 [DDE+12] Lucas Davi,Alexandra Dmitrienko, ManuelEgele, Thomas Fischer, Thorsten Holz,Ralf Hund,StefanNürnberger,and Ahmad-Reza Sadeghi. MoCFI: Aframework to mitigatecontrol-flow attacks on smartphones.In Proceedings of the 19th AnnualNetworkand DistributedSystem Security Symposium (NDSS),volume26, pages 27–40, 2012.

[DHP+15] Lucas Davi,Matthias Hanreich, DebayanPaul,Ahmad-Reza Sadeghi, PatrickKoeberl,Dean Sullivan,Orlando Arias, and YierJin. HAFIX: Hardware-assisted flowintegrityextension. In 2015 52nd ACM/EDAC/IEEE Design Automation Conference(DAC),pages 1–6. IEEE, 2015.

[DZL16] Sanjeev Das, WeiZhang,and Yang Liu.Afine-grainedcontrol flowintegrity approachagainst runtime memory attacks for embeddedsystems. IEEE Transactions on Very LargeScale Integration (VLSI) Systems,24(11):3193– 3207, 2016.

[ES00] Ulfar Erlingsson and Fred BSchneider.IRM enforcementofjavastack inspection.InProceeding 2000 IEEESymposium on Securityand Privacy (S&P),pages 246–255. IEEE,2000.

[FC08] Aurélien Francillon and ClaudeCastelluccia. Code injection attacks on harvard-architecture devices. In Proceedings of the 15th ACMconference on Computer and Communications Security (CCS),pages 15–26, 2008.

[GABP14] EnesGöktas, Elias Athanasopoulos, Herbert Bos,and Georgios Portokalidis. Out of control: Overcomingcontrol-flowintegrity. In 2014 IEEESymposium on Security and Privacy (S&P),pages 575–589. IEEE,2014.

[Gar20] TomGarrison. IntelCET answers call to protectagainstcommon malware threats. https://newsroom.intel.com/editorials/intel- cet-answers-call-protect-common-malware-threats,June 2020. (Accessed on 10/13/2020).

[grs16] On the Effectiveness of Intel’s CETAgainst Code Reuse At- tacks. https://grsecurity.net/effectiveness_of_intel_ cet_against_code_reuse_attacks,June 2016. (Accessed on 10/15/2020).

[Int19] Intel. Control-flow enforcementtechnology specification. https: //software.intel.com/sites/default/files/managed/4d/ 2a/control-flow-enforcement-technology-preview.pdf, May2019. (Accessed on 10/15/2020).

[JTC18] JTC 1/SC 22/WG14. Programming languages —C.Std 9899:2018, ISO/IEC, July 2018.

108 [KOGP12] Mehmet Kayaalp, Meltem Ozsoy, Nael Abu Ghazaleh,and DmitryPono- marev.Efficientlysecuring systems from code reuse attacks. IEEE Trans- actions on Computers (TC),63(5):1144–1156, 2012.

[LH19] Kangjie Lu and Hong Hu. Wheredoesitgo? refiningindirect-call targets with multi-layertypeanalysis. In Proceedingsofthe 2019 ACMSIGSAC ConferenceonComputer and Communications Security(CCS),pages 1867– 1881, 2019.

[MIT20] MITRE. CWE -2020 CWE Top25Most Dangerous Soft- ware Weaknesses. http://cwe.mitre.org/top25/archive/2020/ 2020_cwe_top25.html,2020. (Accessed on 10/16/2020).

[msc18] Control flowguard. https://docs.microsoft.com/en-us/ windows/win32/secbp/control-flow-guard,May 2018. (Accessed on 10/14/2020).

[NS05] JamesNewsome and Dawn XiaodongSong. Dynamic taint analysisfor automaticdetection,analysis, andsignaturegeneration of exploitson commoditysoftware. In Proceedings of the 12th AnnualNetworkand DistributedSystem Security Symposium (NDSS),volume 5, pages3–4, 2005.

[NT14] Ben Niu and GangTan. Modularcontrol-flow integrity. In Proceedings of the 35th ACMSIGPLAN ConferenceonProgrammingLanguage Design and Implementation (PLDI),pages 577–587, 2014.

[PH] David APatterson and JohnLHennessy.Computerorganization and design RISC-V edition:The hardware software interface (the morgan kaufmann seriesincomputer architecture anddesign).

[Pro02] Mark Probst. Dynamic Binary Translation. In UKUUGLinux Developer’s Conference,volume 2002, 2002.

[pul] PULPissimo Github page. https://github.com/pulp-platform/ pulpissimo.(Accessed on 11/07/2020).

[PW17] David Patterson and Andrew Waterman. TheRISC-V Reader: an open architectureAtlas.StrawberryCanyon, 2017.

[QT17] Inc.Qualcomm Technologies. Pointer authentication on armv8.3: Designand analysisofthe new software securityinstructions. https: //www.qualcomm.com/media/documents/files/whitepaper- pointer-authentication-on-armv8-3.pdf,January2017. (Ac- cessed on 10/14/2020).

109 [RKHB12] Mehryar Rahmatian, Hessam Kooti,Ian GHarris,and Elaheh Bozorgzadeh. Hardware-assisted detectionofmalicious software in embeddedsystems. IEEEEmbeddedSystems Letters,4(4):94–97, 2012.

[SAD+16] DeanSullivan, Orlando Arias, Lucas Davi,Per Larsen,Ahmad-Reza Sadeghi, and YierJin. Strategywithout tactics: Policy-agnostic hardware- enhanced control-flow integrity. In 2016 53nd ACM/EDAC/IEEE Design Automation Conference(DAC),pages 1–6. IEEE, 2016.

[SAS15] AliAkbar Sadeghi,FarzaneAminmansour,and HamidRezaShahriari. Tiny jump-orientedprogramming attack(aclass of code reuseattacks). In 2015 12th International Iranian Society of CryptologyConferenceonInformation Security and Cryptology (ISCISC),pages 52–57. IEEE,2015.

[SBPV12] Konstantin Serebryany, DerekBruening,Alexander Potapenko, and Dmitriy Vyukov.Addresssanitizer: Afast addresssanity checker. In Presentedas part of the2012 USENIXAnnual Technical Conference(USENIXATC 12),pages 309–318, 2012.

[Sha07] HovavShacham. The geometryofinnocentflesh on thebone: Return- into-libc without function calls(on thex86). In Proceedings of the 14th ACMconferenceonComputer and Communications Security(CCS),pages 552–561, 2007.

[SPP+04] HovavShacham, Matthew Page, Ben Pfaff, Eu-Jin Goh,Nagendra Modadugu, and Dan Boneh. On theeffectiveness of address-spaceran- domization.InProceedings of the 11th ACMconferenceonComputer and Communications Security(CCS),pages 298–307, 2004.

[SPWS13] LaszloSzekeres,Mathias Payer, TaoWei,and Dawn Song.Sok: Eternal warinmemory. In 2013 IEEESymposium on Securityand Privacy (S&P), pages 48–62. IEEE, 2013.

[StGDC20] RichardM.Stallmanand the GCC DeveloperCommunity.GNU compiler collection internals. Free SoftwareFoundation,2020.

[SWK+16] Takamichi Saito,Ryohei Watanabe,Shuta Kondo,Shota Sugawara,and Masahiro Yokoyama. Asurveyofprevention/mitigation against memory corruption attacks.In2016 19th InternationalConferenceonNetwork- BasedInformation Systems (NBiS),pages 500–505. IEEE,2016.

[tea20] The PULP team. PULPissimo: Datasheet. https://github.com/ pulp-platform/pulpissimo/blob/master/doc/datasheet/ datasheet.pdf,May 2020. (Accessed on 09/08/2020).

[TEB+11] Minh Tran, MarkEtheridge, Tyler Bletsch, Xuxian Jiang, Vincent Freeh, and Peng Ning. On theexpressiveness of return-into-libcattacks. In

110 InternationalWorkshop on Recent Advances in Intrusion Detection(RAID), pages 121–141. Springer,2011.

[TRC+14] Caroline Tice, TomRoeder,Peter Collingbourne,Stephen Checkoway, Úlfar Erlingsson, LuisLozano, and Geoff Pike.Enforcingforward-edge control-flow integrityingcc & llvm.In23rdUSENIXSecuritySymposium (USENIX Security 14),pages 941–955, 2014.

[TT21] Mario Telesklavand StefanTauner. Comparativeanalysis and enhance- mentofCFG-based hardware-assisted CFI schemes. arXiv preprint arXiv:2103.04456 [cs.AR],2021.

[TW17] Michael Theodorides and DavidWagner. Breaking active-setbackward- edge CFI. In 2017 IEEEInternational Symposium on HardwareOriented Security and Trust (HOST),pages 85–89. IEEE, 2017.

[Van20] Jens Vankeirsbilck. Advancing Control Flow Error DetectionTechniques for EmbeddedSoftwareusingAutomated Implementationand Fault Injection. PhDthesis, 2020.

[VTVW+17] Jens Vankeirsbilck, Venu BabuThati, Jonas VanWaes, Hans Hallez, and Jeroen Boydens.Control flowaware software-implemented fault injection forembeddedCPUs. In 2017 XXVIInternational Scientific Conference Electronics (ET),pages 1–4. IEEE, 2017.

[WJ10] Zhi Wang and Xuxian Jiang. Hypersafe: Alightweightapproach to provide lifetime hypervisorcontrol-flow integrity. In 2010 IEEESymposium on Security and Privacy (S&P),pages 380–395. IEEE, 2010.

[WMUW19] Matthias Wenzl,GeorgMerzdovnik, JohannaUllrich, and Edgar Weippl. From hacktoelaborate technique—asurvey on binaryrewriting. ACM Computing Surveys (CSUR),52(3):1–37, 2019.

[WNY+11] John Wilander, NickNikiforakis,YvesYounan, MariamKamkar,and Wouter Joosen.RIPE:runtime intrusion prevention evaluator. In Pro- ceedings of the 27th AnnualComputer Security Applications Conference (ACSAC),pages 41–50, 2011.

[ZDN20] ZDNet.Chrome:70% of all securitybugsare memory safety issues. https://www.zdnet.com/article/chrome-70-of-all- security-bugs-are-memory-safety-issues,May 2020. (Ac- cessed on 10/16/2020).

[ZQQQ18] Jiliang Zhang, Binhang Qi,Zheng Qin, and GangQu. HCIC: Hardware- assisted control-flow integritychecking. IEEE InternetofThings Journal (IoT-J),6(1):458–471, 2018.

111 [ZS13] MingweiZhang and RSekar. Control flowintegrityfor COTS binaries. In 22nd USENIX Security Symposium (USENIX Security 13),pages337–352, 2013.

[ZWC+13] Chao Zhang, TaoWei, Zhaofeng Chen, Lei Duan, LaszloSzekeres,Stephen McCamant, Dawn Song, and WeiZou. Practical control flowintegrity andrandomization for binary executables. In 2013IEEE Symposiumon Security and Privacy (S&P),pages 559–573. IEEE, 2013.

[zyn19] 7series FPGAs memory resources. https://www.xilinx.com/ support/documentation/user_guides/ug473_7Series_ Memory_Resources.pdf,July 2019. (Accessedon09/08/2020).

112