POLITECNICO DI TORINO

III Facolt`adi Ingegneria Corso di Laurea in Ingegneria Elettronica

Tesi di Laurea

LISARM: embedded ARM platform design and optimization

Relatori: Prof. Guido Masera Ing. Maurizio Martina Ing. Fabrizio Vacca Candidato: Carlo Ceriani

Aprile 2007 A mia madre, a mio padre e . . .

a chi ha avuto fiducia in me

I Acknowledgements

Il primo e pi`ugrande ringraziamento va a mia madre, per il fondamentale supporto datomi in questi lunghi anni di studi, per non avermi mai fatto mancare la propria fiducia ed avermi saputo dare i giusti stimoli, soprattutto nei momenti pi`udifficili. In queste righe non posso non ricordare mio padre, in particolare per avermi insegnato che, rimboccandosi le maniche ed avendo fiducia nelle proprie capacit`a,ci si pu`o sempre spingere oltre, allargando i propri orizzonti. Ringrazio il mio relatore, prof. Guido Masera, ed i miei corelatori, Maurizio Martina e Fabrizio Vacca, per le essenziali consulenze, per avermi saputo indirizzare negli snodi cruciali del mio lavoro e per avermi messo a disposizione le risorse di cui necessitavo. Ringrazio gli altri componenti del VLSILab, con i quali ho avuto il piacere di condividere questa esperienza, per essersi sempre dimostrati disponibili a risolvere una moltitudine di ordinari problemi presentatisi. Un particolare ringraziamento va a Federico Quaglio, per l’aiuto che mi ha dato sia nella fase di ricerca e sviluppo del progetto, che in quella di stesura di questo elaborato. Trattandosi dell’atto conclusivo di un lungo percorso di studi, ma anche e soprat- tutto per suggellare un tratto importante della mia vita, ringrazio tutti coloro che in questo cammino hanno saputo arricchire la mia vita di conoscenza, di esperienza, ma anche semplicemente di piacevoli momenti di svago.

I Summary

The diffusion of electronic devices in many aspects of the common life has deeply changed not only the industrial production constraints but also the technologies the applications required by the market are based on. Although System-on-Chip technology allows to put heterogeneous components on the same die, the devel- opment time of hardwired technologies and the noteworthy constraints imposed by economic return reasons, have led to find new approaches. Hardware-software partitioning is one of the most applied techniques; it allows to divide the target application complexity on two different levels: powerful and flexible programmable system design and complex algorithm implementation for the market demand satis- faction. The development phase must be performed in a coordinated way between the designer groups, so that this approach can ensure reduced times for the product implementation. Other constraints, for power consumption and occupied area, are also important, particularly for mobile devices which have to give long endurance for batteries and higher performance with respect to preceding applications, as required by the customers. In this technology branch, microprocessor based platforms are the most diffused and the ARM7TDMI processor represents a successful product, thanks to its noteworhty performance and low power characteristics. Embedded processors use is not the unique solution, although architectures available on the market furnish many of the characteristics requested by manufacturers, sometimes they are not tailor-made for critical applications or their structure is too complex, with dramatic effects on power consumption and area occupation. A different so- lution is represented by the ASIPs, i.e. processors specifically designed for target applications, that provide a dedicated instruction-set, built on the software algo- rithms which have to be executed on them. The programmable architectures design uses particular software environments which allow to describe the instruction-set in a flexible manner, enabling the code reuse by writing it with an Architecture Descrip- tion Language like LISA 2.0. LISATek Toolsuite and Language for Instruction-Set Architecture allow the processor behavior description in all its aspects, also by a temporal point of view, integrating technologies like pipelining and caching and allowing to obtain an hardware description in HDL, a powerful simulator and all the dedicated tools for software development. Aim of this thesis work is to explore

II the possibilities offered by the software environment in the development of a pro- grammable platform based on the ARM7 processor, whose available documentation, due to a number of its applications, allows to analyse in-depth the characteristics to be transferred to the model.

Chapter 1 Contains a brief review of all the topics treated in this thesis and an extended summary in italian, as required by the university rules for foreign language thesis.

Chapter 2 Introduces some concepts about computer architectures, reporting some historical outlines about the evolution of computers and microprocessors.

Chapter 3 Describes the ARM7TDMI processor from its programmer’s model to an in-depth architecture analysis, describing its instruction set and the core inter- facing with external systems.

Chapter 4 Introduces the LISATek toolsuite, a powerful software environment for ASIP modeling, the principal instrument used for the LISARM development and verification.

Chapter 5 Describes the LISARM processor model by reporting the guidelines fol- lowed in the development of its various parts and the architectural solutions adopted to obtain a coherent ARM7 model, for both behavior and internal structure.

Chapter 6 Describes the tools obtained from the model description by using the LISATek automatic generation tools and some external solutions for the compati- bility issues, like memory wrapping and toolchain adaption.

Chapter 7 Contains some conclusive considerations about the thesis work and traces some hypothesis about future applications of the produced material.

III Contents

Acknowledgements I

Summary II

1 Sintesi 1 1.1 Introduzione ...... 1 1.2 L’architettura dei processori RISC ...... 2 1.3 Architettura del microprocessore ARM7 ...... 8 1.4 L’ambiente di sviluppo LISATek ...... 14 1.5 Il modello LISA dell’ARM7 ...... 19 1.6 Strumenti di sviluppo per ARM7 ...... 26 1.7 Conclusioni e sviluppi futuri ...... 29

2 The RISC microprocessor architecture 31 2.1 The ...... 31 2.2 ...... 33 2.3 The increased processor complexity ...... 34 2.4 The RISC architecture ...... 36 2.5 Pipelining and cache technology ...... 41 2.6 RISC vs CISC architecture ...... 45

3 The ARM microprocessor architecture 49 3.1 The ARM processor family ...... 50 3.2 The Thumb concept ...... 51 3.3 The programmer model ...... 53 3.3.1 Operating states and state switching ...... 53 3.3.2 Memory formats and data types ...... 53 3.3.3 Operating modes ...... 54 3.3.4 Processor resources ...... 55 3.3.5 The Processor Status Registers (PSRs) ...... 56 3.4 The exception handling ...... 57

IV 3.4.1 Processor reset ...... 60 3.4.2 and fast interrupt requests ...... 60 3.4.3 Abort conditions ...... 61 3.4.4 Software and supervisor mode ...... 62 3.4.5 Undefined instruction ...... 62 3.4.6 Exception priorities ...... 63 3.5 ARM instruction set ...... 63 3.5.1 Conditional execution ...... 63 3.5.2 Branch and exchange (BX) ...... 64 3.5.3 Branch and branch with link (B-BL) ...... 66 3.5.4 Data processing instructions ...... 67 3.5.5 PSR transfer instructions ...... 71 3.5.6 Multiply and multiply and accumulate (MUL-MLA) ...... 73 3.5.7 Multiply and multiply and accumulate long (MULL-MLAL) . 75 3.5.8 Single data transfer operations (LDR-STR) ...... 77 3.5.9 Halfword and signed data transfer operations ...... 79 3.5.10 Block data transfer operations (LDM-STM) ...... 80 3.5.11 Single data swap (SWP) ...... 82 3.5.12 Software interrupt ...... 83 3.5.13 Coprocessor instructions ...... 83 3.5.14 Undefined instruction ...... 84 3.6 Thumb instruction set ...... 85 3.7 The memory interface ...... 86 3.8 The coprocessor interface ...... 89 3.9 The debugging system ...... 90

4 LISATek toolsuite 94 4.1 The ASIP design flow ...... 95 4.2 Architecture exploration ...... 97 4.3 The architecture description: the LISA language ...... 99 4.3.1 Memory model ...... 99 4.3.2 Resource model ...... 101 4.3.3 Instruction-set model ...... 102 4.3.4 Behavioral model ...... 103 4.3.5 Timing model ...... 104 4.3.6 Microarchitecture model ...... 105 4.4 The LISATek model development tools ...... 105 4.4.1 The Processor Designer ...... 105 4.4.2 The Instruction-set Designer ...... 106 4.4.3 The Syntax Debugger ...... 107 4.5 The architecture implementation ...... 108

V 4.6 The application software design ...... 110 4.6.1 Assembler and linker ...... 110 4.6.2 Disassembler ...... 110 4.6.3 Simulator: the “Processor Debugger” ...... 111 4.6.4 The C-Compiler ...... 113 4.7 The system integration and verification ...... 114

5 The LISARM model 116 5.1 The model structure ...... 116 5.1.1 Processor resources, interface, internal units ...... 117 5.1.2 The main LISA operation ...... 120 5.1.3 The coding tree and the decoding mechanism ...... 123 5.2 The processor datapath ...... 124 5.2.1 The barrel shifter unit ...... 124 5.2.2 The ...... 127 5.2.3 The 32x8 bit multiplier ...... 128 5.3 Other LISA operations ...... 130 5.4 The branch instructions ...... 131 5.5 Data processing instructions ...... 133 5.6 PSR transfer instructions ...... 136 5.7 Multiplication instructions ...... 138 5.8 Single data transfer instructions ...... 140 5.9 Block data transfer instructions ...... 145 5.10 The data swap instruction ...... 146 5.11 Software interrupt and undefined instructions ...... 147

6 LISARM support tools 149 6.1 The ARM LISA simulator ...... 149 6.2 The memory wrapping ...... 150 6.3 ARM commercial toolchains ...... 152 6.4 ARM model toolchain adaption ...... 153 6.5 HDL generation and tests ...... 155

7 Conclusions and possible future applications 158 7.1 Conclusions ...... 158 7.2 Possible future applications ...... 159

A Model LISA operations summary 161

Bibliography 166

VI Chapter 1

Sintesi

1.1 Introduzione

La diffusione dei dispositivi elettronici in molti aspetti della vita comune ha cambia- to profondamente gli assetti della produzione industriale e le tecnologie che stanno alla base delle applicazioni che il mercato richiede. Sebbene le tecnologie System- on-Chip abbiano dato grandi possibilit`adi integrare componenti elettronici anche molto eterogenei su un singolo die, i tempi di sviluppo delle tecnologie hardwired e gli stringenti vincoli imposti dalle logiche di ritorno economico hanno condotto alla ricerca di nuovi approcci. La tecnica attualmente pi`uutilizzata `eprobabilmente quella dell’hardware-software partitioning, che consiste nel suddividere la comples- sit`adell’applicazione da sviluppare su due livelli differenti: la progettazione di un sistema integrato programmabile potente e flessibile e la produzione di algoritmi complessi in grado di soddisfare le esigenze del mercato. Lo studio dell’applicazione deve avvenire in maniera coordinata tra i gruppi che ne curano il progetto e questo approccio `ein grado di garantire minori tempi di sviluppo del prodotto. Altre spe- cifiche, non meno importanti, riguardano il basso consumo ed il ridotto ingombro per i dispositivi portatili, che devono garantire grande autonomia a fronte di sempre maggiori prestazioni richieste dall’utente. In questo ambito tecnologico le piattafor- me programmabili basate su microprocessore sono tra le pi`udiffuse e l’ARM7TDMI rappresenta uno dei processori di maggior successo, grazie alle sue notevoli prestazio- ni ed alle sue caratteristiche di basso consumo. L’uso di processori embedded non `e, comunque, l’unica soluzione tecnologica attualmente in uso; sebbene le architetture disponibili sul mercato forniscano molte delle caratteristiche richieste dai produttori del settore, talvolta essi non sono sufficientemente“su misura”per talune applicazioni critiche o hanno una struttura troppo complessa che va a discapito di occupazione di area e consumo di potenza. L’alternativa `erappresentata dagli ASIP, ovvero da processori progettati per specifiche applicazioni che forniscono un set di istruzioni

1 1 – Sintesi dedicato, costruito sulle esigenze degli algoritmi software che devono essere eseguiti. Per la progettazione delle architetture programmabili si utilizzano ambienti di sviluppo software che consentono di descrivere il set di istruzioni in maniera flessibile e riutilizzabile, mediante la stesura di codice in un Architecture Description Language quale LISA 2.0. LISATek Toolsuite e Language for Instruction-Set Architecture consentono di descrivere il funzionamento di un processore in tutti i suoi aspetti, sia comportamentali che temporali, integrando tecnologie attuali quali pipeline e cache e consentendo di ottenere una descrizione dell’hardware in HDL, un potente simulatore e i tool necessari per lo sviluppo del software dedicato. Scopo del presente lavoro di tesi `equello di esplorare le possibilit`aofferte da questo strumento software per la progettazione di una piattaforma programmabile basata sul processore ARM7, la cui ampia documentazione, frutto delle numerose applicazioni su esso basate, consente di analizzare a fondo le caratteristiche da riprodurre nel modello.

1.2 L’architettura dei processori RISC

L’architettura di Von Neumann `estata una delle architetture pi`uutilizzate fin dalle primi pionieristici progetti di calcolatori automatici. Essa utilizza un’unica struttura per la memorizzazione dei dati e del codice da eseguire e, nonostante la sua sempli- cit`a,ha rivoluzionato la concezione del calcolo automatico fino ad allora dominante, introducendo le instruction-set architecture (ISA). Prima della sua formalizzazione, infatti, i calcolatori funzionavano sulla base un programma predeterminato, ese- guendo una serie non modificabile di operazioni sui dati forniti in ingresso. Con l’introduzione delle architetture ad istruzioni, dei modelli programmazione e di una serie di metodi per l’accesso alle risorse del sistema, l’insieme di operazioni da esegui- re durante la computazione sono state formalizzate tramite una sequenza di codici operativi, scritti in linguaggio macchina. Il vantaggio principale di questo approccio `e,evidentemente, la possibilit`adi evitare la riprogettazione della logica che realizza il sistema di calcolo al presentarsi di nuove esigenze computazionali. Sebbene tale architettura sia stata di importanza epocale, nel mondo dei compu- ter, essa ha un aspetto negativo noto come Von Neumann bottleneck, con il quale si indica l’effetto“collo di bottiglia”dovuto alla differenza tra la velocit`adi esecuzione delle istruzioni (throughput) del processore e la velocit`adi traferimento della me- moria, differenza dovuta alle diverse tecnologie utilizzate per la loro realizzazione. In talune applicazioni, quando il processore esegue un limitato numero di istruzio- ni su una rilevante quantit`adi dati, il collo di bottiglia dell’accesso alla memoria introduce una seria riduzione della velocit`adi elaborazione del sistema. Contrariamente all’approccio di Von Neumann, l’architettura di Harvard utiliz- za un sistema di memorizzazione indipendente per il codice rispetto ai dati, anche per quanto concerne i segnali di controllo ed i bus di comunicazione necessari. In

2 1 – Sintesi un’architettura di Harvard le caratteristiche dei due tipi di memoria non devono ob- bligatoriamente essere identiche, ma possono differire in tecnologia implementativa, dimensione dei dati, tempo di accesso, metodo e logica di indirizzamento e struttu- ra. La caratteristica saliente di questo approccio `ela possibilit`aofferta alla CPU di poter leggere un’istruzione della memoria di codice mentre viene effettuato un accesso in lettura o in scrittura su una locazione della memoria dati. Questa tecnica rende la struttura di Harvard mediamente pi`uveloce rispetto a quella vista in pre- cedenza ma, ovviamente, la complessit`adel sistema cresce e con essa l’occupazione di area sul silicio. Anche questa architettura soffre degli effetti della velocit`adella CPU rispetto a quella della memoria per cui, se un programma deve accedere alla memoria ad ogni ciclo di clock, la velocit`adi esecuzione delle istruzioni raggiungibile `epi`uprossima alla velocit`adi accesso alla memoria che alle prestazioni offerte dalla CPU stessa.

MAIN MEMORY

CONTROL ALU

UNIT accumulator

INPUT OUTPUT (a) (b)

Figura 1.1. Esempi di architetture: (a) Von Neumann, (b) Harvard

Il problema delle differenza di velocit`adel processore rispetto alla memoria di sistema `estato uno dei primi aspetti che ha motivato il notevole incremento nella complessit`adelle architetture dei processori, soprattutto quando le prestazioni delle CPU sono diventate di un ordine di grandezza superiori rispetto ai pi`uveloci sistemi di memorizzazione. Per ridurre gli effetti di tale squilibrio, alcune procedure di alto livello che venivano eseguite tramite subroutine sono state integrate nei set di istruzioni ed anche l’uso della microprogrammazione ha avuto ampia diffusione, grazie al fatto che le tecnologie d’integrazione hanno consentito di sopravanzare le architetture che utilizzavano logica di esecuzione delle istruzioni totalmente cablata. Tra gli altri aspetti che hanno motivato la sempre maggiore complessit`adei mi- croprocessori commerciali va annoverata la necessit`adi mantenere la compatibilit`a con precedenti membri di una stessa famiglia di processori, soddisfatta talvolta tra- mite microprogrammi di emulazione. Anche la crescente popolarit`adei linguaggi

3 1 – Sintesi di programmazione di alto livello ha richiesto maggiori prestazioni alle CPU, nel tentativo di colmare il gap semantico tra le possibilit`aofferte dai set di istruzioni esistenti e quelle di sempre pi`upotenti linguaggi. Sempre nell’ottica di fornire mag- giore flessibilit`ae potenzialit`aai programmatori, soprattutto quelli che utilizzavano direttamente i linguaggi assembly, sono stati introdotti i set di istruzioni ortogona- li, ovvero capaci di accettare qualsiasi modo di indirizzamento della memoria per ogni tipo di istruzione implementata. Il notevole impulso dato dai linguaggi di pro- grammazione di alto livello allo sviluppo di software applicativi, ha condotto alla necessit`adi avere memorie di sempre maggior capacit`a,con drammatici effetti sui costi degli interi sistemi. Allo scopo di ridurre tali effetti alcuni produttori hanno impegnato notevoli energie per ottenere una maggior densit`adi codice istruzioni, bench´equesto tipo di istruzioni richiedano una maggior quantit`adi informazione da codificare e quindi un maggior numero di bit, a discapito dello spazio di memoria richiesto e della complessit`adella logica di decodifica ed esecuzione delle istruzioni. L’uso di set di istruzioni complessi ha incrementato anche la quantit`adi informazioni da salvare ad ogni evento di interrupt, con un conseguente aumento del numero di registri (shadow registers) e del microcodice necessario per la loro gestione. Questo aspetto `efortemente critico per applicazioni quali i Digital Signal Processors (DSP) ed i microcontrollori per automazione, dove per la funzionalit`adei sistemi la latenza massima deve essere garantita. In contrasto alla crescente complessit`adei microprocessori del tempo, all’inizio degli anni ’80, venne introdotto un nuovo e pi`usemplice approccio architetturale denominato Reduced Instruction Set Computer (RISC) che implementava un set di istruzioni ridotto rispetto ai prodotti classici. In considerazione del fatto che queste architetture utilizzano solo due istruzioni per l’accesso ai dati in memoria, esse sono anche dette load-store architecture ed hanno la caratteristica di eseguire tutte le operazioni tra registri e non direttamente con operandi contenuti in memoria. In antitesi con l’acronimo suddetto si inizi`oad indicare le architetture classiche con il nome di Complex Instruction Set Computer (CISC). I RISC sono caratterizza- ti dall’avere un tipo di codifica delle istruzioni uniforme, con istruzioni tutte della stessa lunghezza e codici operativi collocati sempre nella medesima posizione, per consentire una pi`urapida decodifica. Anche i modi di indirizzamento sono in numero inferiore e pi`usemplici, ma nei RISC si possono sfruttare altre semplici operazioni per il calcolo degli indirizzi. I tipi di dati nativi implementati `eridotto al minimo ma la loro gestione e resa pi`uversatile, nei RISC, infatti, non esistono operazioni di trattamento delle stringhe n´ealtre istruzioni complesse; spesso le singole istruzioni vengono eseguite in un solo colpo di clock. I processori RISC sono generalmente forniti di un set di registri di uso generale, impiegabili in qualsiasi contesto, e ci`o consente di impiegare i bit risparmiati mediante una codifica pi`usemplice per in- trodurre delle costanti nella stessa codifica delle istruzioni, riducendo il numero di accessi alla memoria.

4 1 – Sintesi

Intorno agli anni ’80 si riteneva di aver raggiunto i limiti teorici nella velocit`a di esecuzione dei processori, considerando i miglioramenti nelle tecnologie di fab- bricazione come uniche possibilit`aper incrementarla, grazie a transistor e linee di interconnessione sempre pi`upiccole. In quel periodo emerse l’idea suddividere le unit`adi esecuzione in pi`ustadi e di inserire dei registri tra questi stadi per consen- tire di svolgere operazioni su pi`uistruzioni contemporaneamente. In questo modo, mentre un’istruzione viene caricata dalla memoria, un’altra pu`oessere decodifica- ta ed una terza eseguita, in tre unit`adiverse che lavorano tutte assieme. A parte questioni di latenza per il caricamento completo della catena, l’utilizzo di questa so- luzione tecnologica, detta pipeline, consente di accelerare l’esecuzione delle istruzioni sfruttando a pieno le potenzialit`adi tutte le parti dell’architettura. La struttura ab- bozzata rappresenta solo un esempio, ma la tendenza `estata quella di avere pipeline di lunghezza sempre maggiore, per ottenere sempre migliori prestazioni. Questa tecnologia presenta anche alcuni svantaggi dovuti al branch delay slot, ov- vero l’effetto sui tempi di un nuovo caricamento della catena di registri in presenza di un salto condizionato, che non consente di sapere a priori quale sar`ala prossima operazione ad essere eseguita (branch hazard o control hazard). Per ovviare a questi inconvenienti si utilizzano varie tecniche quali branch target predictor, e out-of-order execution, in maniera tale da poter prevedere la destinazio- ne di un salto o comunque poter eseguire altre istruzioni in attesa di sapere da quale punto continuer`al’esecuzione del programma, anche a costo di eseguire operazioni inutili. Appare evidente come questo tipo di approccio possa migliorare le presta- zioni ma sia inutilizzabile in applicazioni a basso consumo. Un’altra tecnologia molto diffusa `equella dei processori superscalari, in cui si in- tegrano pi`uunit`adi esecuzione, cercando di eseguire parallelamente in esse alcune istruzioni consecutive. Anche tale soluzione ha una controindicazione, dovuta ai data hazard, ovvero la necessit`adi avere a disposizione dati che sono attualmente in esecuzione parallela all’interno di un’altra unit`a. Altro problema dei processori dotati di pipeline sono gli structural hazard, ovvero eventi in cui due o pi`uistruzioni necessitano di utilizzare le stesse risorse hardware in contemporanea. Nonostante le ragioni discusse, tutte le suddette tecniche vengono usate con successo su molti processori e l’evoluzione dei processori si fonda proprio sulle tecnologie supersca- lari esplicite, ovvero quelle implementate a livello hardware ma gi`asupportate dai compilatori, che scelgono anticipatamente le istruzioni che possono essere eseguite assieme e le inseriscono in una istruzione multipla. Un’altra tecnica molto diffusa, che consente di ottenere miglioramenti nelle pre- stazioni di accesso alla memoria, `el’uso di cache, ovvero di piccole quantit`adi memoria molto veloce che mappano regioni di memoria di sistema. Spesso i pro- cessori hanno a disposizione due tipi di cache, una per le istruzioni ed una per i dati, all’interno delle quali vanno a ricercare la risorsa desiderata: nel caso essa sia presente (cache hit) si ha una velocit`adi accesso molto ridotto, mentre in caso

5 1 – Sintesi contrario (cache miss) il dato dev’essere preventivamente caricato in cache dalla me- moria centrale, con conseguente riduzione di prestazioni e consueti tempi di accesso alla memoria. L’approccio di progettazione RISC garantisce di poter sfruttare al meglio le sud-

Figura 1.2. Esempio di architettura di Harvard dotata di pipeline (DLX) dette tecniche, consentendo di impegnare l’area occupata con grandi quantit`adi cache e registri della pipeline, ma anche altri registri di uso comune ed altre risorse integrate quali controllori ed interfacce. In talune circostanze le architetture RISC offrono vantaggi significativi rispetto ai sistemi CISC e viceversa, di fatto la maggior parte degli attuali sistemi non pu`o essere definita totalmente RISC piuttosto che CISC e nel tempo i due approcci si sono evoluti l’uno verso l’altro, impedendo una chiara distinzione degli aspetti che li caratterizzavano inizialmente. I processori CISC sono basati sul principio di ridurre il tempo impiegato a rintracciare le istruzioni da eseguire in memoria, concentrando le operazioni elementari all’interno di istruzioni complesse, anche se queste richiedono pi`ucicli macchina per l’esecuzione. Un processore CISC ha le seguenti caratteristiche principali:

• Utilizza il microcodice residente in ROM interne per semplificare l’unit`adi controllo, evitando di eccedere nelle implementazioni in hardware.

• Tramite l’utilizzo di ROM interne per le istruzioni, di pi`uveloce accesso rispetto alla memoria di sistema, migliora le prestazioni di esecuzione.

• Consente di avere set di istruzioni pi`uricchi , con istruzioni di lunghezza varia- bile, semplici o di maggior complessit`a,con numerosi modi di indirizzamento quando non `edi tipo ortogonale.

6 1 – Sintesi

• Ha un minor numero di registri interni, considerato il fatto che le istruzioni possono operare direttamente sulla memoria senza necessit`adi memorizzarne gli indirizzi.

Tra le propriet`adei processori RISC spiccano:

• Utilizza un set di istruzioni semplificato per migliorare le prestazioni con una architettura pi`usemplice.

• La maggior parte delle istruzioni viene eseguita in un solo ciclo macchina.

• Utilizza tecniche di pipelining, pre-fetching e di speculative execution.

• L’interfacciamento con la memoria utilizza solo due istruzioni e quindi mecca- nismi pi`usemplici.

• Garantisce maggiori prestazioni nei calcoli in virgola mobile.

• Ha un ridotto numeri di modi di indirizzamento.

• Dispone di un elevato numero di registri.

• Non utilizza microcodice, esegue le istruzione tramite unit`ahardware cablate.

• Fa carico al compilatore di ridurre al minimo la complessit`adella applicazioni software.

Il numero di istruzioni dei RISC garantisce una pi`urapida e semplice procedura di decodifica rispetto a quello dei CISC, inoltre alcune statistiche sostengono che i com- pilatori utilizzano solo il 20% circa delle istruzioni di questi ultimi. Anche il fatto di avere istruzioni di lunghezza variabile, che vengono eseguite su vari cicli macchina rappresenta spesso un problema, mentre molte delle istruzioni dei RISC vengono eseguite in un unico ciclo macchina. Purtroppo anche i processori RISC hanno alcu- ni difetti, infatti i programmatori devono porre notevole attenzione nello scheduling delle istruzioni, onde evitare che il processore debba perdere cicli macchina in at- tesa di istruzioni da eseguire; sempre in ragione dello scheduling delle operazioni il debugging risulta pi`ucomplesso rispetto ad un CISC. Dal punto di vista dell’occu- pazione d’area `eevidente che un CISC pu`ocomportare problemi realizzativi rispetto ad un’architettura pi`usemplice ed anche i tempi di sviluppo possono avvantaggiare i RISC che, con cicli di progettazione pi`ubrevi, possono sfruttare processi tecnologici pi`urecenti, pi`uefficienti e meno costosi. Un ulteriore enorme vantaggio dei pro- cessori RISC `eil basso consumo di potenza che, legato sempre questioni di ridotta complessit`ainterna, ha consentito a questi prodotti di conquistare ampiamente il mercato dei prodotti automotive e degli apparati portatili.

7 1 – Sintesi

1.3 Architettura del microprocessore ARM7

Il microprocessore AR7TDMI `eun membro della famiglia di processori ARM che utilizza un’architettura RISC a 32-bit, la cui struttura interna rispecchia quella di Von Neumann. Esso integra una pipeline a tre stadi, che garantisce il funziona- mento continuo ed ottimale di tutte le sue unit`ainterne. Il suo set di istruzioni ed il relativo meccanismo di decodifica sono molto semplici rispetto a sistemi micro- programmati quali i processori CISC e questa minor complessit`asi traduce in una notevole velocit`adi esecuzione delle istruzioni e di risposta agli interrupt, rendendo il processore adatto anche alle applicazioni real-time quali DSP e controlli automa- tici. Il processore ARM7TDMI, grazie alle sue prestazioni ed alle sue caratteristiche di basso consumo, ha beneficiato di un’ampia diffusione nell’ambito delle applica- zioni embedded ed `eintegrato in molti dispositivi portatili commerciali, dove tali aspetti si rivelano fortemente critici. Intere famiglie di microcontrollori sfruttano le potenzialit`adel processore ARM7TDMI, la cui bassa complessit`ainterna si traduce in minori costi di fabbricazione rispetto a piattaforme equivalenti. Una caratteristica particolare di questa versione del processore, bench´enon im- plementata nel modello costruito, `ela micro-architettura Thumb, ovvero una por- zione dell’architettura a 32 bit che implementa un set di istruzioni a soli 16 bit il cui comportamento `eequivalente ad alcune istruzioni appartenenti al set di istru- zioni completo dell’ARM. Questa soluzione consente di ottenere un’elevata densit`a di codice mantenendo molte delle prestazioni del processore ed `euna caratteristica unica della famiglia di processori ARM a partire dal modello in questione. Il siste- ma di decodifica del set di istruzioni Thumb consente di realizzare una traduzione dinamica immediata verso il set di istruzioni completo dell’ARM ed all’interno del medesimo programma sorgente possono essere utilizzati entrambi i set di istruzioni, che il processore `ein grado di eseguire cambiando la sua configurazione interna. Per passare da uno stato interno all’altro i due set di istruzioni forniscono una particola- re istruzione denominata branch and exchange; essa consente, tra l’altro, di saltare ad altre parti del codice in esecuzione. Il processore consente di operare su dati di differente dimensione, oltre alla word standard (32 bit), infatti, possono essere gestiti dati di ampiezza pari al singolo byte (8 bit) o ad una halfword (16 bit) e possono anche essere specificati dati numerici con o senza segno (in notazione complemento a 2). L’organizzazione della memoria pu`o essere di tipo big endian o little endian ed il processore `edotato di un’interfaccia di memoria in grado di gestire sia memorie statiche che dinamiche in maniera flessibile, anche sfruttando risorse di tipo misto all’interno della stesso sistema. Il processore pu`ofunzionare in differenti modalit`aoltre a quella per la comune esecuzione del programma utente, per consentirne la gestione di vari eventi quali interrupt e condizioni di errore che possono verificarsi durante il funzionamento. Per questa ragione sono previsti sette differenti modalit`aoperative di cui sei privilegiate,

8 1 – Sintesi tra le quali una modalit`a supervisor dedicata al sistema operativo. Il processore `edotato di ben trentasette registri, trentuno dei quali sono registri general purpose, mentre gli altri servono per memorizzare lo stato interno del sistema (processor status register). Questi registri non possono essere utilizzati tutti assieme, nella modalit`adi esecuzione normale solo sedici di essi (uno dei quali `eil ) sono accessibili, mentre una serie di banked register sono visibili solo quando il processore opera nella altre modalit`asopra discusse. ARM7 prevede due diverse modalit`aper servire degli interrupt, uno espressamente creato per consentire rapidi cambi di processo in esecuzione. Per evitare di dover memorizzare il contenuto di molti registri in memoria, prima di effettuare il cambio di modalit`aoperativa verso quella di Fast Interrupt Request (FIQ), essa pu`obeneficiare di una serie di ben sette registri dedicati. A parte la gestione degli eventi di interrupt e fast interrupt sono previste le modalit`adi gestione di instruction e data abort, per gli errori nella gestione delle operazioni sulla memoria, una modalit`aper la gestione delle istruzioni indefinite e l’interfacciamento con i coprocessori ed una modalit`adi interrupt alla quale `epossibile accedere tramite un’istruzione software (SWI). Per ognuna di queste esistono dei registri riservati per il salvataggio dell’indirizzo contenuto nel program counter, per quello dell’istruzione cui puntare al ritorno della subroutine di gestione dell’eccezione e per la memorizzazione dello stato del processore. Al fine di velocizzare alcune operazioni che il processore svolgerebbe in tempi non ottimali, `epossibile avvalersi di un’unit`adi calcolo esterna, collegata sul medesimo bus dati utilizzato dall’ARM per comunicare con la memoria centrale. Tramite tale configurazione il coprocessore pu`oaccedere indirettamente alla memoria, anche se `e il processore a dover provvedere al suo indirizzamento. Nel caso in cui il processore non riconosca un’istruzione letta dalla memoria, esso la inoltra alla rete di copro- cessori tramite una terna di segnali di handshaking e se uno di questi `ein grado di riconoscerla ed eseguirla il calcolo parte parallelamente ad altre operazioni che il processore svolge in attesa che il coprocessore fornisca i risultati desiderati. In caso contrario, se nessuna unit`aesterna riconosce l’istruzione, essa risulta indefinita ed il processore entra nello stato undefined eseguendo il relativo exception handler. Quando il processore opera nello stato ARM, esso legge dalla memoria istruzioni a 32 bit e le decodifica rispetto al set di istruzioni completo di cui dispone. Tutte le istruzioni del set ARM sono eseguite in maniera condizionale, ovvero ognuna di esse contiene un campo di quattro bit che esprime una certa condizione relativa ai valori dei quattro flag contenuti nel PSR e solo se tale condizione `erispettata l’istruzione viene realmente eseguita. Il PSR contiene quattro flag di uso comune nei processori: il carry flag (C-bit), il negative flag (N-bit), lo zero flag (Z-bit) ed un bit che indica la condizione di overflow (V-bit). Tutti questi bit vengono impostati in relazione all’ultimo risultato ottenuto da un’operazione eseguita dalla ALU o comunque in relazione all’esito dell’ultima istruzione che ne ha richiesto l’aggiornamento. Le istruzioni dell’ARM set possono essere raggruppate nel seguente modo:

9 1 – Sintesi

Figura 1.3. Schema del core ARM7TDMI

• Branch: si occupano di realizzare salti condizionati e non, verso porzioni di codice che rappresentano sottoprocedure; l’istruzione BX consente anche il cambio dell’instruction set di riferimento.

• Data processing: eseguono le comuni operazioni di ALU tra due operandi, quali somme e sottrazioni con e senza segno, operazioni booleane bit a bit, operazioni di mascheramento e di confronto.

• Moltiplicazione: eseguono il prodotto di due operandi a 32 bit con rispetto del segno dei medesimi e fornendo il risultato su 32 o 64 bit.

• Single data transfer: si occupano di traferire valori tra la memoria ed i registri interni, con eventuale estensione del segno per tipi di dati di ampiezza inferiore alla word da 32 bit.

10 1 – Sintesi

• Block data transfer: consentono la gestione degli stack in memoria, permetten- do di traferire un sottoinsieme o l’intero set di registri interni del processore.

• PSR transfer: forniscono delle funzionalit`aper la modifica ed il trasferimento dei registri di stato (PSR) verso la memoria e viceversa; anche i soli flag relativi alle operazioni dell’ALU possono essere modificati.

• Coprocessore: consentono la comunicazione tra il processore ed i coprocessori ad esso collegati e di eseguire operazioni di accesso alla memoria da parte dei coprocessori.

Oltre alle istruzioni elencate esistono la succitata istruzione per l’accesso agli inter- rupt software ed una operazione di scambio tra un dato contenuto in memoria ed un registro. Quest’ultima istruzione (swap) `eessenziale per la gestione di particola- ri variabili protette dette semafori, necessarie per l’implementazione di un sistema operativo. Essa consiste in una operazione di lettura da memoria seguita da una scrittura, che sono per`oconseguenti ed inscindibili per evitare che altre periferiche possano accedere alla memoria modificandone il contenuto. Affinch´eci`onon accada, il sistema di gestione della memoria viene avvisato dell’esecuzione di un’operazione di swap tramite un apposito segnale (LOCK). Le istruzioni di branch consentono di esprimere l’indirizzo di destinazione in maniera assoluta tramite un registro che lo contiene o relativamente al valore del program counter (PC-relative) tramite un valore di offset immediato. L’esecuzione di un’operazione di branch implica sempre il flush dei registri della pipeline che precedono lo stadio di execute, per evitare l’esecuzione di istruzioni non desiderate prima dell’effettiva realizzazione del salto alla subroutine. Le istruzioni di data processing eseguono le operazioni riportate in tabella 1.1. Le operazioni accettano differenti sintassi e numero di operandi a seconda del tipo ed un campo di 4 bit nella loro codifica consente di selezionare l’operazione desi- derata. Esse accettano come primo operando un registro e come secondo operando un altro registro oppure un valore immediato, il quale viene memorizzato in ma- niera particolare all’interno della codifica dell’istruzione. L’assembler ARM, infatti, accetta solo quelle particolari costanti che possono essere ottenute effettuando la rotazione a destra di un valore immediato memorizzato su 8 bit, per un numero di bit pari al doppio di un ammontare memorizzato su 4 bit. In questo modo `e possibile memorizzare solo alcune costanti su 32 bit, tra le quali tutte le potenze del due. L’operazione di conversione viene eseguita dal barrel shifter che poi for- nisce il valore dell’operando alla ALU per l’esecuzione dell’istruzione designata. Il secondo operando pu`oanche essere espresso tramite un’operazione di barrel shifter da eseguire su un valore registrato ed il numero di bit per il quale eseguire lo shift o la rotazione pu`oessere espresso tramite un valore immediato oppure tramite il

11 1 – Sintesi

Tabella 1.1. Operazioni di data processing Mnemonico Operazione AND operand1 AND operand2 EOR operand1 EOR operand2 SUB operand1 - operand2 RSB operand2 - operand1 ADD operand1 + operand2 ADC operand1 + operand2 + carry SBC operand1 - operand2 + carry - 1 RSC operand2 - operand1 + carry - 1 TST as AND, but result is not written TEQ as EOR, but result is not written CMP as SUB, but result is not written CMN as ADD, but result is not written ORR operand1 OR operand2 MOV operand2 (operand1 is ignored) BIC operand1 AND NOT operand2 (Bit clear) MVN NOT operand2 (operand1 is ignored) contenuto di un registro. Per esprimere l’operazione di shift desiderata `epossibile utilizzare uno mnemonico nella sintassi assembly e le operazioni permesse sono: • Shift logico o aritmetico a sinistra (LSL o ASL).

• Shift logico a destra (LSR).

• Shift aritmetico a destra (ASR).

• Rotazione a destra (ROR). Quando l’ammontare dello shift da eseguire `eespresso tramite un registro alcune operazioni di barrel shifter particolari possono essere eseguite. Esse sfruttano alcune codifiche ridondanti quali tutte quelle per un ammontare nullo e la loro selezione avviene in base al valore contenuto negli 8 bit meno significativi del registro sud- detto. Le istruzioni di data processing possono richiedere o meno l’aggiornamento dei relativi flag del PSR ed ogni qualvolta il registro di destinazione `eil program counter (R15), un’operazione di flush della pipeline viene effettuata, come se fosse stata eseguita un’operazione di branch. Le istruzioni di moltiplicazione sfruttano un moltiplicatore veloce 32x8 bit in- tegrato nell’architettura, che consente di eseguire il prodotto di due operandi a 32 bit con segno (in complemento a due) o senza, per ottenere un risultato su 32 o 64 bit dopo al massimo quattro cicli macchina. Per l’esecuzione del prodotto viene

12 1 – Sintesi utilizzato l’algoritmo di Booth a 8 bit e le somme parziali vengono sommate in uno o due registri ad opera della ALU. Se alcune sequenze dei bit pi`usignificativi del moltiplicatore sono tutti pari ad uno o sono tutti nulli, l’istruzione impiega meno cicli macchina per l’esecuzione ed il controllo viene ceduto all’istruzione successiva appena calcolato il risultato. Le operazioni di accesso alla memoria consentono il trasferimento di dati di am- piezza pari a 32, 16 e 8 bit, con eventuale estensione del segno per i tipi di dati di ampiezza inferiore alla word. Esse accettano i modi di indirizzamento pre-indexing e post-indexing ed accettano un offset immediato o contenuto in un registro che pu`o essere aggiunto o sottratto all’indirizzo di base contenuto in un registro. L’aggiorna- mento dell’indirizzo iniziale pu`oessere aggiornato tramite l’operazione di writeback e per il calcolo dell’offset possono essere usate le operazioni di barrel shifter gi`aviste per le operazioni di data processing, salvo quelle che prevedono la memorizzazione del numero di bit di cui effettuare uno shift all’interno di un registro. Le operazio- ni di trasferimento di pi`uregistri in memoria (block data transfer) permettono la realizzazione di stack in memoria ed accettano tutti i modi di indirizzamento ap- pena discussi, sicch´elo stack medesimo pu`ocrescere o decrescere in una direzione piuttosto che nell’altra nello spazio di indirizzamento. Tali operazioni accettano un lista dei registri da traferire che pu`oessere espressa in maniera molto flessibile nella sintassi assembly e consentono anche il writeback dell’indirizzo nel registro usato come base. Le istruzioni dell’ARM vengono eseguite in tempi differenti a seconda del tipo e del modo in cui sono espressi gli operandi, mentre per le operazioni di trasferimento di blocchi di registri il numero di cicli macchina dipende dal numero di registri da trasferire. Il processore ARM `edotato di una interfaccia di debug basata sullo standard denominato Boundary Scan (IEEE Std. 1149.1/1990), costituito da una catena di registri multifunzione (scan cell) collegati a monte degli ingressi ed a valle delle uscite del core. Tali registri consentono di forzare dei livelli logici sugli ingressi e di campionare le uscite secondo tempi programmati tramite il TAP controller ed una serie di segnali di controllo accessibili dall’esterno, tra i quali un ingresso ed un’uscita seriale. Altra caratteristica delle celle dalla catena `ela possibilit`adi ri- sultare trasparenti per consentire il normale funzionamento del sistema, come se la rete di debug non esistesse. La rete di debug fornisce avanzate caratteristiche per il monitoraggio e la correzione degli errori nelle fasi di sviluppo di applicazioni, sistemi operativi e di sistemi integrati che includono il core ARM7. Le estensio- ni hardware consentono di bloccare l’esecuzione del programma in occasione della lettura di una specifica istruzione o di un particolare dato contenuto in memoria, ma anche in maniera asincrona tramite un segnale di debug request. Entrando nella modalit`adi debug lo stato interno del core pu`oessere esaminato approfonditamen- te tramite l’interfaccia JTAG ed `epoi possibile ritornare alla normale esecuzione

13 1 – Sintesi del programma. In aggiunta al sistema descritto il processore ARM7TDMI `eforni- to del modulo EmbeddedICE (o ICEBreaker) ovvero un’altra estensione hardware per il debug consistente in una coppia di unit`adi osservazione in tempo reale, che possono accedere alle varie risorse del processore per controllarne il corretto funzio- namento. Tale modulo utilizza come canale di comunicazione con l’esterno gli stessi bus utilizzati dai comuni coprocessori, utilizzando un identificatore (CP#) ad esso riservato.

1.4 L’ambiente di sviluppo LISATek

La diffusione dei dispositivi elettronici in vari aspetti della vita comune ha profon- damente cambiato molti vincoli della produzione industriale, in particolare i tempi di sviluppo di un nuovo prodotto devono essere il pi`upossibile ridotti, per garantire il ritorno economico auspicato. D’altro canto la tecnologia dei semiconduttori ha aperto nuovi orizzonti e la richiesta di nuove e pi`upotenti applicazioni ha condot- to ad una sempre maggior complessit`adei sistemi digitali integrati. La tecnologia System-on-Chip (SoC) ha consentito di produrre sistemi composti da svariati core tra loro intercomunicanti a bordo del medesimo chip e partendo dal presupposto che questo tecniche richiedono intensi cicli di sviluppo, la produttivit`adei progettisti `e divenuta un fattore di vitale importanza per ottenere prodotti di successo. Per le ra- gioni esposte, l’idea di implementare tramite potenti algoritmi funzioni tipicamente svolte da sistemi integrati e DSP, riducendo la complessit`aintrinseca dello sviluppo dell’hardware, ha condotto al passaggio dai sistemi puramente hardwired all’inclu- sione di core programmabili all’interno dei SoC. Questa strategia rappresenta un approccio innovativo rispetto alle tecnologie esistenti, noto con il nome di Applica- tion Specific Instruction-set Processor (ASIP) design. Tra gli applicativi software utilizzati a questo scopo, la toolsuite LISATek introduce notevoli vantaggi dal punto di vista del risparmio di tempo e risorse, rendendo automatiche una serie di pro- cedure fino ad ora svolte in maniera essenzialmente manuale. Gli strumenti offerti la LISATek, infatti, consentono di ottenere sia un’implementazione hardware del processore che degli strumenti di sviluppo software quali il simulatore e la toolchain completa di compilatore per il linguaggio C. Il flusso di sviluppo di un ASIP prevede le seguenti fasi principali:

• Esplorazione dell’architettura.

• Implementazione dell’architettura.

• Creazione della toolchain e produzione degli applicativi software.

• Integrazione del sistema e verifica.

14 1 – Sintesi

Nel corso della prima fase vengono analizzati gli algoritmi da eseguire sul proces- sore, per stabilire quali caratteristiche deve avere l’architettura e le unit`adi esecuzio- ne in essa integrate. In questa fase `enecessario avere a disposizione uno strumento software in grado di simulare il comportamento del processore, oltre a degli applica- tivi che consentano di definire il giusto profilo sia per il software che per l’hardware necessari a realizzare l’applicazione desiderata. LISATek non fornisce strumenti di supporto per l’esplorazione della caratteristiche di una nuova architettura, ma una buona base `espesso rappresentata da una semplice architettura di partenza, dotata di un set di istruzioni minimale, su cui effettuare le dovute modifiche e miglioramenti per ottenere un oggetto sempre pi`usimile al risultato auspicato. Una volta noto il comportamento che l’architettura deve avere, `epossibile pro- cedere alla fase implementativa, descrivendo le funzionalit`adel processore trami- te linguaggi di descrizione dell’hardware come o VHDL. Questa fase viene svolta spesso manualmente nell’approccio classico ed un notevole problema `erap- presentato dalle verifiche di consistenza tra il comportamento del simulatore e quello dell’hardware descritto. La fase di creazione degli strumenti di sviluppo software consiste sostanzialmente nel mettere a disposizione dei programmatori compilatori per linguaggi ad alto li- vello, assembler e linker per la generazione di programmi eseguibili sull’architettura implementata, per certi versi strumenti analoghi a quelli usati nella fase iniziale di esplorazione. Nella fase di integrazione e di co-simulazione del sistema invece, il sistema descrit- to in uno dei linguaggi HDL ed il simulatore vengono analizzati e collaudati assieme, per verificare che il loro comportamento sia il medesimo in ogni caso, ovvero che il modello sia consistente. A causa dei continui raffinamenti, che possono rendersi opportuni nel corso di ognuna delle fasi di sviluppo, simulatore, tool di supporto alla produzione del soft- ware e descrizione HDL devono essere rivisti a pi`uriprese ed in maniera tale da rispettare i vincoli di compatibilit`atra i vari livelli di astrazione, con tutte le con- seguenze in termini di tempo ed energie impegnate. La toolsuite LISATek consente di ridurre notevolmente la mole di lavoro necessaria per sviluppo di una nuova architettura, attraverso un approccio che ne consente la riprogettazione e l’affinamento in modo rapido ed efficiente, descrivendo sia la piattaforma che il set di istruzioni in un’unica soluzione. Il linguaggio di descrizione LISA, ovvero Language for Instruction-Set Architectures, ha come scopo la genera- zione automatica di codice HDL sintetizzabile e di tutti i tool per lo sviluppo del software. Essa consente di descrivere in maniera approfondita il set di istruzioni ed il modello funzionale e comportamentale del processore, includendone tutti gli aspetti sequenziali della logica che lo implementa. La collezione di tool generati include compilatore C, assembler, linker ed un potente simulatore. Per quanto concerne la generazione della descrizione dell’hardware, LISATek `ein grado di fornire sia codice

15 1 – Sintesi

VHDL che Verilog, ed anche i file di configurazione per il testing con i pi`unoti applicativi di simulazione e sintesi HDL. Il modello di un’architettura, descritto tramite il linguaggio LISA, `ecomposto dalle seguenti parti: • Modello della memoria • Modello delle risorse • Modello del set di istruzioni • Modello comportamentale • Modello temporale • Modello della micro-architettura Il modello della memoria descrive sostanzialmente gli elementi di memoria di cui il processore `edotato, ovvero registri generali e dalla pipeline, RAM, cache, segnali interni, flag, bus e tutti i parametri relativi a queste entit`a.Il modello delle risorse descrive la disponibilit`adelle risorse in relazione all’uso che le operazioni LISA fanno degli elementi del modello della memoria. Esso `ecostruito valutando tempi e modi di accesso in relazione allo scheduling delle operazioni LISA e tale modello viene utilizzato anche nella generazione della descrizione HDL per risolvere i conflitti che si verificano nelle assegnazioni dei vari segnali interni. Assembler, disassembler e instruction decoder possono essere generati mediante il modello del set di istruzioni; queste caratteristiche del modello vengono descritte attraverso due sezioni specifiche delle operazioni LISA, che legano la sintassi assem- bly con la relativa codifica. In questa sezione sono definiti i vari codici operativi (opcodes), gli operandi, le notazioni per i valori immediati, etc.. Il modello comportamentale descrive il comportamento dell’architettura attra- verso il microcodice scritto all’interno delle operazioni LISA; il codice viene scritto mediante istruzioni in linguaggio C standard, con ulteriori arricchimenti tipici del linguaggio LISA. Il modello temporale descrive il comportamento dell’architettura nel tempo, so- prattutto per quanto riguarda il funzionamento della pipeline e lo scheduling delle operazioni LISA che eseguono le istruzioni assembly. Il linguaggio LISA consente di utilizzare delle specifiche funzioni che rendono semplice ed efficace la gestione degli eventi della pipeline. Il modello della microarchitettura, infine, consente di descri- vere una struttura gerarchica della descrizione hardware in codice HDL, attraverso un raggruppamento delle operazioni LISA eseguito dall’utente che si traduce nella generazione di file separati per le varie parti dell’architettura. A supporto della fase di descrizione in linguaggio LISA, la toolsuite fornisce i seguenti applicativi:

16 1 – Sintesi

• Processor Designer

• Instruction-set Designer

• Syntax Debugger Il Processor Designer `eessenzialmente un editor per la scrittura del codice LISA, bench´econsenta di controllare le varie parti del progetto, ovvero i vari file che lo compongono. Tramite la sua interfaccia grafica `epossibile impostare varie opzioni per i tool di generazione degli strumenti di produzione del software e della descrizio- ne dell’hardware HDL. L’Instruction-set Designer, invece, `euno strumento grafico dedicato alla descrizione del set di istruzioni e rappresenta un’alternativa alla scrit- tura manuale del codice appartente alle preposte sezioni dei file LISA. Esso `emolto utile per avere una visione grafica dei vari campi della codifica delle istruzioni e pu`o essere utile per fare emergere inconsistenze di modello. Il Syntax Debugger, infine, consente di eseguire piccole porzioni di codice assembly per testare il set d’istruzioni descritto, seguendo passo per passo il processo di decodifica eseguito dalle istruzioni LISA scritte nei vari file. La descrizione hardware dell’architettura viene generata tramite il Processor Ge- nerator, che analizza i file LISA per definire gli oggetti da istanziare, quali risorse del processore, segnali interni per dati e controllo, pipeline, memorie, porte d’in- gresso e di uscita. Analizzando il raggruppamento delle operazioni all’interno delle singole unit`a,consente di generare il decoder delle istruzioni e la logica combinatoria che realizza i singoli stadi di esecuzione delle istruzioni, quali ALU, barrel shifter, moltiplicatori, etc.. LISATek genera in automatico i seguenti tool per lo sviluppo delle applicazioni software: • Assembler

• Linker

• Disassembler

• C Compiler

• Simulatore L’assembler generato da LISATek, consente di elaborare il codice assembly e tra- sformarlo in codice oggetto da passare al linker. Il tool viene generato sulla base della descrizione LISA del set di istruzioni riportata nelle varie sezioni SYNTAX e CODING. Oltre alle operazioni specifiche del processore modellato, l’assembler generato accetta una serie di pseudoinstruzioni o direttive utili per controllare la procedura di assemblaggio del codice e l’inizializzazione dei dati e del codice che

17 1 – Sintesi l’architettura dovr`aelaborare. Il tool integra anche alcune funzionalit`atipiche dei macro-assembler. Dopo la fase di assemblaggio, il codice oggetto contiene un certo numero di simboli, ovvero di riferimenti a routine globali e/o locali memorizzate in maniera non consequenziale in memoria. Per ottenere unico file eseguibile dai file assemblati, `enecessario risolvere questi riferimenti rintracciando le linee di codice da collegare alla parte principale dell’applicazione. Questa operazione `esvolta dal linker, che utilizza un file di configurazione specifico in cui l’utente deve riportare alcune informazioni relative alla gestione della memoria del processore. Il disassem- bler svolge il lavoro inverso rispetto all’assembler ed al linker, accettando un file eseguibile in ingresso e fornendo in uscita un file assembly in cui si possono rileva- re delle differenze rispetto agli indirizzi ed ai simboli riportati nel file sorgente, in considerazione del fatto che il file disassemblato contiene riferimenti assoluti e non relativi. LISATek, per la generazione del disassembler, utilizza le stesse informazioni utilizzate per creare l’assembler sopra discusso. Il Processor Debugger rappresenta lo strumento principe per la descrizione del comportamento del processore, ovvero un simulatore generato automaticamente in C++ che, interfacciandosi con una GUI, consente di monitorare vari aspetti dello stato interno del sistema quali registri, pipeline, segnali interni, eventi ed anche le memorie ad esso collegate. Il simulatore consente di caricare la memoria dell’archi- tettura con un programma oggetto selezionato dall’utente, visualizzandone il codice assembly, il disassemblato ed anche il microcodice LISA che descrive il processore. Esso consente di monitorare l’esecuzione del programma passo per passo, tramite comandi di debug analoghi a quelli forniti dagli ambienti di sviluppo pi`ucomuni. Il compilatore C utilizza la tecnologia CoSy Compiler Development System, che segue un approccio modulare basato un motore di compilazione per effettuare il par- sing del codice e l’analisi semantica dei file in linguaggio C forniti in ingresso, per l’ottimizzazione del formato intermedio del codice compilato e per la generazione del codice eseguibile per il processore. Il compilatore accetta codice C standard ed adat- ta automaticamente il codice eseguibile generato alle caratteristiche ed alle risorse dell’architettura di riferimento. Per tale ragione, la generazione del compilatore C richiede l’accurata definizione delle risorse del processore, quali registri utilizzabili, specifiche dei dati e del layout dello stack, direttive per lo scheduling delle operazio- ni ed altre definizioni e convenzioni del linguaggio di programmazione. Questa fase di configurazione pu`oessere eseguita tramite il Compiler Designer contenuto nella toolsuite. La fase di integrazione e verifica del sistema include l’importante compito di va- lutare funzionalit`adell’architettura quali tempistiche di esecuzione, area occupata su silicio e consumo di potenza, per determinare quali parti dell’applicazione de- vono essere implementate in hardware e quali tramite software (hardware-software partitioning). D’altro canto anche la bont`adella descrizione hardware ottenuta de- ve essere verificata, quindi le tecniche e le interfacce di co-simulazione si rendono

18 1 – Sintesi largamente utili anche per questi test, consentendo di integrare il modello hard- ware ed il simulatore software in un unico sistema di collaudo, in cui entrambi i modelli vengono stimolati mediante gli stessi pattern o gli stessi file eseguibili. La toolsuite LISATek include la System Integrator Platform, che fornisce possibilit`adi verifica ed integrazione di processori embedded, memorie, e vari componenti in un unico sistema, anche di un intero SoC. Per consentire l’integrazione di componenti sia hardware che software l’applicativo accetta vari linguaggi quali VHDL, Verilog, C/C++, e differenti formalismi descrittivi, anche forniti da altri noti tool.

1.5 Il modello LISA dell’ARM7

Il modello LISA dell’ARM7TDMI `estato costruito scrivendo in maniera essenzial- mente manuale i file in linguaggio LISA, senza usare strumenti grafici di supporto quali l’Instruction-set Designer sebbene, nella prima fase di descrizione della sintassi assembly e della codifica delle istruzioni, questo strumento si sia rivelato utile per avere una visione d’assieme dei singoli campi della codifica binaria. Per verificare la corretta decodifica delle istruzioni si `efatto anche uso del Syntax Debugger, ma il Processor Designer ed il Processor Debugger sono stati gli strumenti principali impiegati nella costruzione del modello. Il modello LISARM `estrutturato raccogliendo le operazioni LISA che descrivono il comportamento del processore in vari file, a seconda della funzione specifica che realizzano. Esiste un file di riferimento per tutte le operazioni che contiene una serie di definizioni di dati, costanti, maschere per operazioni aritmetiche e logiche, indirizzi degli exception handler e tipi enumerativi in C standard per rendere chiaro e semplice l’uso di valori caratteristici del processore modellato. Il cuore funzionale del modello `erappresentato dal file main.lisa, che contiene la descrizione di tutte le risorse del processore e le operazioni base eseguite ad ogni ciclo macchina. Il file contiene la definizione della memoria centrale, dei registri general purpose, dei registri di stato (anche in versione banked), degli stadi della pipeline e dei rispettivi registri, delle porte di ingresso e di uscita, dei vari segnali e delle variabili interne. Per ciascun elemento qui definito `estato scelto il tipo di dato pi`uidoneo e si `efatto largo uso dei tipi di dati CXBit, ovvero di dati predefiniti dal linguaggio LISA che consentono di eseguire operazioni di estrazione ed assegnazione di singoli bit o di regioni di bit. I dati di tipo CXBit sono particolarmente indicati per eseguire conversioni verso altri tipi di dati e rendono flessibile il trattamento dei valori in essi contenuti. La struttura della pipeline `estata definita assegnando i nomi dei suoi stadi, ovvero PF (prefetch), FE(fetch), DC(decode), EX(execute) e ED(execute-dummy); il primo stadio `eimplicito, considerato che l’ARM esegue l’operazione di prefetch,

19 1 – Sintesi mentre l’ultimo stadio ha solo lo scopo di consentire l’uso delle operazioni di pol- ling. Le operazioni di polling sono operazioni LISA in grado di riattivare se stesse nel ciclo macchina successivo e sono di utilit`avitale per consentire la descrizione delle istruzioni che vengono eseguite su pi`ucicli di clock successivi. Affinch´el’o- perazione sia in grado di riattivarsi, per`o,`enecessario che lo stadio della pipeline successivo rispetto a quello cui l’operazione appartiene sia in condizioni di stallo, da qui l’esigenza di introdurre questo stadio fittizio che non viene comunque generato in hardware. I registri della pipeline sono definiti in un’unica soluzione per tutti gli stadi, sebbene le operazioni di sintesi dell’HDL generato eliminino automatica- mente tutti gli elementi le cui uscite non risultano collegate ad alcunch´e.Essi sono utilizzati essenzialmente per trasferire la codifica binaria dell’istruzione letta dalla memoria verso lo stadio di decodifica e per trasmettere allo stadio di esecuzione una serie di impostazioni che consentano al datapath di realizzare le operazioni richieste dall’istruzione in esecuzione. Una serie di segnali per il controllo della pipeline e delle unit`adi decodifica ed esecuzione vengono poi aggiunti al modello in maniera automatica, grazie alle potenzialit`adegli strumenti di LISATek. All’interno del file principale vengono poi definite le singole unit per il raggrup- pamento della operazioni che vanno mappate in hardware, per consentire il man- tenimento di una certa gerarchia del sistema. Ad ogni gruppo di istruzioni qui creato, corrisponde un file HDL separato, contenente un modulo (Verilog) o un enti- ty (VHDL) capace di eseguire tutte le operazioni LISA definite per esso. Nel modello generato, oltre alle unit`adi fetch e prefetch, sono definiti l’instruction decoder e il condition checker e tre unit`aseparate per l’ALU, il barrel shifter ed il moltiplicatore. Tutte le operazioni di esecuzione delle istruzioni sono raggruppate in relazione alle funzioni svolte, per cui branch, data processing, memory access, multiply sono le unit`ache si occupano di realizzare in hardware tali funzionalit`a.Per consentire la corretta inizializzazione del processore, all’avvio ed in presenza di un evento sull’in- gresso predefinito di reset asincrono, `eprevista un’operazione di reset, che esegue l’azzeramento di tutti i registri e l’impostazione del registro di stato secondo quanto previsto dalle specifiche dell’ARM7. Tramite l’operazione LISA denominata main vengono eseguite una serie di ope- razioni ricorrenti ad ogni ciclo macchina e la gestione di tutti gli eventi relativi alla pipeline, quali lo scheduling delle operazioni di prefetch, fetch e decode. Tramite essa viene controllata l’esecuzione delle operazioni LISA programmate e lo shift del contenuto dei registri della pipeline, ma solo in assenza di eventi che ne richiedano lo stallo. Tale operazione realizza anche il controllo dello stato dei flag del PSR per l’esecuzione condizionale delle operazioni e gestisce direttamente alcuni segnali d’uscita per l’interfacciamento con la pipeline. Fondamentale, in questa operazione LISA, la gestione delle eccezioni, eseguita monitorando gli ingressi di ABORT, IRQ e FIQ e tutti gli altri eventi che possono richiedere un cambiamento della modalit`a operativa, compatibilmente con lo stato della pipeline e l’eventuale esecuzione di

20 1 – Sintesi un’istruzione che si svolge su pi`ucicli macchina. L’operazione prefetch, invece, si occupa del caricamento delle istruzioni dal- la memoria e della gestione dei salti (branch), tramite funzioni specifiche messe a disposizione dal linguaggio sia per l’accesso alla memoria che per lo stallo della pipeline. L’operazione decode rappresenta il punto iniziale del processo di decodifica delle istruzioni, essa osserva la codifica dell’istruzione binaria ed innesca l’esplorazione di un albero di possibilit`ache di passo in passo attivano una serie di altre opera- zioni LISA, il cui codice (sezione BEHAVIOR) esegue lo scheduling e il setup per l’esecuzione dell’istruzione desiderata. Una porzione dell’albero di decodifica del modello `eriportato in figura 1.4. Il medesimo albero `epercorso anche per la gene-

CODING ROOT multiply_grp other_grp

MUL SWI SWP MLA MULL MLAL PSR_access_grp branch_grp

MRS B MSR BL BX data_proc_grp

cmp_grp mov_grp

arith_logic_grp

CMP MVN CMN TST MOV TEQ ADD ADC SUB SBC mem_access_grp AND ORR EOR BIC std_data_grp block_data_grp

su_data_grp LDR LDM STR STM STRH LDRH LDRSH LDRSB

Figura 1.4. Albero di decodifica delle istruzioni razione dei tool assembler, disassembler e compilatore C e, per meglio identificare le operazioni che si occupano della decodifica delle istruzioni da quelle che ne descri- vono il comportamento all’interno dell’architettura, esse sono state indicate come dc. Esse assegnano ai registri della pipeline una serie di informa- zioni quali gli indici dei registri riportati nella sintassi dall’istruzione, alcuni flag e degli altri valori per consentire la corretta configurazione di ALU, barrel shifter

21 1 – Sintesi e moltiplicatore. Queste operazioni effettuano anche l’attivazione delle operazioni di esecuzione ( ex) che, compatibilmente con gli eventi della pipe- line e con l’eventuale condizione espressa nella sintassi, vengono eseguite nel ciclo macchina successivo. Per l’elaborazione dei dati il modello utilizza una struttura simile a quella del processore originale e lo stile di descrizione utilizzato mira ad ottenere gli stessi componenti hardware stand-alone, ovvero la ALU, il barrel shifter ed il moltiplicatore 32x8 bit. Essi hanno propri segnali di controllo e la rete di collegamento con i bus dati rispecchia quella dell’ARM7. La struttura complessiva `eindicata in figura 1.5.

Register File

32 x 8 B_bus

Multiplier

A_bus

bs_carry_out Barrel Shifter C_flag

carry_out C_flag

Figura 1.5. Struttura dell’unit`adi esecuzione

Il barrel shifter `edescritto mediante una coppia di operazioni LISA principali. La prima operazione (barrel shifter op dc) consente la decodifica delle operazioni richieste dalla sintassi assembly, ovvero shift aritmetico o logico a destra, shift a sinistra, rotazione a destra e relativo ammontare in numero di bit. La seconda

22 1 – Sintesi operazione (barrel shifter op dc) esegue le operazioni richieste mediante l’uso di istruzioni in linguaggio C. Allo scopo si sono dovuti utilizzare alcuni stratagemmi per adattare le istruzioni implementate dall’ARM a quelle che il C standard esegue ed in considerazione del fatto che un ciclo“for”in LISA non pu`opu`oessere ripetuto un numero di volte variabile, a causa dell’impossibile mappaggio in hardware, la descrizione funzionale ridondante dell’operazione ROR risulta necessaria. Il barrel shifter descritto esegue anche le operazioni dotate di codifica particolare e quelle con ammontare memorizzato in un registro, per configurare l’esecuzione delle quali sono stati previsti dei flag dedicati all’interno della pipeline. L’unit`adi esecuzione aritmetico-logica (ALU) opera su due operandi a 32 bit, provenienti uno dal register file e l’altro dal barrel shifter ed esegue sostanzialmente le stesse operazioni previste dalle operazioni di data processing riportate in tabella 1.1. L’operazione viene selezionata tramite un registro specifico della pipeline ed il risultato pu`oessere riscritto all’interno del register file oppure nel registro di indiriz- zamento della memoria, se `ein esecuzione il calcolo di un indirizzo per un’istruzione di trasferimento dati. Una serie di operazioni LISA si occupa di aggiornare ade- guatamente i flag del PSR in relazione al tipo di operazione eseguita ed al risultato ottenuto, ammesso che il suo aggiornamento sia richiesto dall’istruzione assembly. Sia la ALU che il barrel shifter usano un formato interno a 33 bit che consente di gestire sia le operazioni di shift e di rotazione, che le operazioni di somma e sot- trazione, senza perdita di dati importanti quali riporti ed eventuali condizioni di overflow, diversamente non ottenibili. Per eseguire la moltiplicazione tra due operandi a 32 bit il processore sfrutta un metodo analogo a quello dell’ARM, demandando al moltiplicatore veloce 32x8 bit il calcolo del prodotto di blocchi di 8 bit del moltiplicatore e dell’intero moltiplicando. Il linguaggio LISA non consente di descrivere in maniera approfondita il lavoro che il moltiplicatore deve svolgere, quindi le istruzioni C impiegate esprimono solo il prodotto di due operandi delle dimensioni sopra riportate, esprimendone il risultato su un bus a 32 bit. Il modo in cui il sintetizzatore hardware potrebbe implementare il moltiplicatore non `enoto a priori e questi aspetti appartengono ad una fase di sviluppo successiva alla costruzione del modello funzionale. L’uscita del moltipli- catore `ecollegata al barrel shifter che, ad ogni ciclo, introduce degli shift di 8 bit per consentire all’ALU di calcolare le somme parziali. Dopo quattro cicli queste operazioni forniscono il risultato della moltiplicazione, sia essa con o senza segno. Per facilitare la stesura e la comprensibilit`adel codice, si `efatto uso di una serie di semplici operazioni che effettuano conversioni di valori immediati di differente ampiezza in bit, di indici di registri per l’indirizzamento del register file, di valori contenuti in registri ed anche della lista di registri per le operazioni di gestione dello stack. Tutte queste operazioni facilitano il riuso del codice LISA e quindi delle risorse hardware. Esse sono raggruppate in unica unit per poterne identificare le funzioni nell’HDL generato. Tra esse viene anche definita un’importante operazione

23 1 – Sintesi che si occupa di effettuare il setup del datapath per la conversione degli operandi espressi tramite il formato immed8 r, come discusso nel paragrafo 1.3, ed anche la funzione write result che si occupa di scrivere il risultato fornito dall’ALU nel register file o nel registro di accesso nella memoria, a seconda del tipo di istruzione in esecuzione. Il resto delle operazioni descrive sostanzialmente il comportamento delle istru- zioni, nel rispetto delle specifiche del data sheet dell’ARM7TDMI [16] sia per quanto riguarda la sintassi accettata, che per codifica e tempi di esecuzione. Come accen- nato, per ciascun gruppo di operazioni, esiste una prima operazione che si occupa di decodificare opportunamente la codifica binaria, salva le informazioni per la con- figurazione delle unit`adi esecuzione nella pipeline ed effettua lo scheduling delle operazioni per il ciclo di clock che segue. Le operazioni di branch (BX, B, BL) utilizzano il datapath in maniera diffe- rente per il calcolo dell’indirizzo di destinazione del salto ed anche per l’eventuale salvataggio dell’indirizzo di ritorno da subroutine se richiesto. Esse effettuano l’o- perazione di flush della pipeline per evitare comportamenti indesiderati del sistema e, se `eprevisto il link, sfruttano un’operazione di polling per eseguire la correzione dell’indirizzo di memoria salvato e riscriverlo nel registro dedicato. Le operazioni di data processing eseguono ognuna una specifica operazione di ALU, configurandola adeguatamente mediante una operazione dc, atti- vata per eseguirne la decodifica, e convertendo il secondo operando nella maniera richiesta. Nel caso in cui il secondo operando dovesse richiedere un’operazione di barrel shifter il cui ammontare `eespresso in un registro, viene sfruttata un’opera- zione di polling per consentire l’accesso al registro stesso in un primo ciclo macchina e l’operazione dell’ALU in quello successivo. Essendo accettato come registro di destinazione anche il program counter (R15), se il writeback viene eseguito su di esso il flush della pipeline viene programmato dalle operazioni del gruppo. Come accennato il processore `ein grado di effettuare moltiplicazioni tra operandi con segno o senza segno andandone a scrivere il risultato su uno o due registri designati. L’operazione viene svolta sfruttando il moltiplicatore 32x8 ed un certo numero di cicli macchina, dipendente da quanti gruppi di 8 bit tutti pari a zero o a uno sono presenti nella parte pi`usignificativa moltiplicatore. La ragione per cui si possono risparmiare cicli `eevidente, nel caso in cui ci fossero uno, due o tre gruppi di 8 bit identici, sarebbe necessaria al massimo una semplice somma per concludere il calcolo. Le operazioni LISA che descrivono le operazioni non fanno altro che mascherare i gruppi di bit pi`usignificativi e confrontarli per capire se hanno tutti lo stesso valore e valutano l’operazione da eseguire; esse sfruttano le consuete funzionalit`aofferte dal polling e stallano la pipeline per estendere il numero di cicli di esecuzione dell’operazione. Le operazioni di accesso e modifica dei PSR agiscono sullo stato corrente del processore o su quello salvato, nel rispetto della modalit`adi esecuzione attuale del

24 1 – Sintesi processore, e consentono di cambiarne anche solo parzialmente il contenuto mediante funzioni LISA particolarmente flessibili. Le operazioni LISA dedicate convertono adeguatamente i valori immediati espressi nella codifica o nei registri sorgente ed eseguono i mascheramenti dovuti per la corretta assegnazione dei singoli bit dei registri di stato. Le istruzioni di accesso alla memoria per il trasferimento di singoli dati utiliz- zano varie operazioni LISA per la decodifica ed il setup delle unit`adi esecuzione, che nel primo ciclo macchina devono eseguire il calcolo dell’indirizzo per l’accesso alla risorsa esterna. A tale scopo sono state definite operazioni per la decodifica dei modi di indirizzamento PC-relative, pre-indexed e post-indexed, come previsto dal- l’assembly dell’ARM. Per convertire i valori di offset immediati, espressi in formato immed8 r, si usano le stesse operazioni utilizzate dalle operazioni di data processing, sfruttando il riuso delle operazioni. Tali operazioni si svolgono necessariamente su pi`ucicli macchina consecutivi e sfruttano i meccanismi di polling programmando un numero di cicli di stallo adeguato al tipo di operazione da eseguire, ovvero un solo ciclo per la scrittura e due per la lettura di un dato. Se `erichiesta l’opera- zione di writeback dell’indirizzo calcolato sul registro usato come base, essa viene eseguita durante secondo ciclo sia per la scrittura che per la lettura, come previsto dalle specifiche dell’ARM. Per le operazioni di scrittura viene effettuato il setup dei segnali necessari nel primo ciclo di esecuzione e l’accesso alla memoria avviene nel ciclo macchina susseguente quello in cui si calcola l’indirizzo. Per il caricamento di un registro si trasmette l’indirizzo della locazione cui accedere nel secondo ciclo ed il dato fornito dalla memoria viene campionato e scritto nel register file nel terzo ciclo macchina. Nel caso in cui il registro di destinazione di un’istruzione STR sia il program counter, viene eseguita un’operazione di flush della pipeline come nel caso delle operazioni di branch. Le operazioni di questo gruppo gestiscono direttamente i segnali dell’interfaccia di memoria e per richiedere l’accesso alla memoria utilizzano una funzione LISA analo- ga a quella utilizzata per il prefetch delle istruzioni. Per le istruzioni che trasferiscono dati di dimensione byte o half-word viene aggiunto un segnale per indicare quale byte o halfword deve essere trasferita, in modo da adattare l’interfaccia del modello alle specifiche di ARM tramite un’unit`aesterna descritta nel paragrafo seguente (1.6). Le istruzioni di gestione dello stack eseguono molte delle operazioni gi`adiscusse per delle istruzioni di accesso alla memoria, ma utilizzano un particolare registro globale interno a 16 bit che contiene i flag associati ciascuno ad un registro generale del processore. L’operazione LISA di decodifica assegna a tale registro il valore della register list contenuta nella codifica e nella fase di esecuzione si effettuano tanti cicli di stallo quanti sono i bit ad uno, per effettuarne passo a passo il trasferimento. Dopo un primo ciclo in cui viene caricato dal registro base l’indirizzo per l’accesso alla memoria, infatti, si procede al trasferimento di un singolo dato, ricercando il prossimo registro da traferire nel registro dei flag e programmando un successivo

25 1 – Sintesi ciclo di stallo se la lista non `evuota. L’istruzione di data swap `edescritta mediante un’operazione polling che sostan- zialmente richiama le operazioni svolte da una operazione di scrittura in memoria seguita da una lettura, l’unica differenza `eche l’operazione pilota il segnale di LOCK per avvisare il sistema di gestione della memoria che l’accesso alla risorsa dev’essere negato ad altre periferiche, finch´eil processore non ha terminato l’esecuzione della istruzione. L’istruzione undefined prevede l’attivazione della procedura di handshaking con i coprocessori collegati e le sue istruzioni LISA controllano direttamente i segna- li dedicati, per stabilire se cambiare la modalit`aoperativa del processore, salvare l’indirizzo di fetch attuale e memorizzare l’indirizzo dell’handler relativo nel pro- gram counter. Anche in questo caso si tratta di un’operazione che viene eseguita su pi`ucicli, dovendo attendere un eventuale risposta dal coprocessore che riconosce un’istruzione destinata a s´e;solo nel caso nessun coprocessore riconosca l’istruzione l’undefined trap viene attivata. L’istruzione di interrupt software, invece, attiva direttamente la modalit`a super- visor e come la precedente salva il program counter assegnandogli subito l’indirizzo dell’exception handler. Ogni qualvolta un’istruzione prevede il flush della pipeline, trascorrono due cicli macchina in cui il processore non esegue alcuna operazione, per consentire lo riem- pimento automatico dei registri e l’afflusso di una nuova istruzione nello stadio di decodifica prima e di esecuzione poi.

1.6 Strumenti di sviluppo per ARM7

La costruzione del modello dell’ARM7TDMI si `esvolta attraverso due fasi principali: la creazione di un modello instruction-accurate e la sua estensione per l’ottenimento di un secondo modello di tipo cycle-accurate. Nella prima fase si sono essenzial- mente descritte le propriet`adella sintassi e della codifica delle istruzioni, ma anche le operazioni necessarie al controllo e all’esecuzione delle computazioni, senza fare riferimento alle tempistiche di esecuzione in pipeline o su vari cicli macchina. Solo nella seconda fase si `eprovveduto a distribuire il codice scritto nei vari stadi della pipeline ed a descrivere un comportamento del processore rispettoso delle tempisti- che di esecuzione espresse dalle specifiche dell’ARM. In entrambe le fasi il simulatore (Processor Debugger) si `erivelato di importanza vitale, al fine di controllare la con- sistenza ed il corretto funzionamento di ogni singola parte del processore. Grazie ad assembler, linker e disassembler generati automaticamente e mediante la stesura di adeguati file sorgenti, `estato possibile analizzare in maniera approfondita risorse e funzionalit`adel modello, dai primi passi e fino al processore completo.

26 1 – Sintesi

A causa di un limite del linguaggio LISA e della toolsuite LISATek, l’inter- facciamento con la memoria non corrisponde esattamente alle specifiche dell’ARM, soprattutto per quanto concerne l’indirizzamento dei singoli byte. Sebbene il model- lo della memoria sia stato descritto per ottenere il caricamento di dati ed istruzioni a 32 bit con sottoblocchi da 8 bit, affinch´eil processore esegua correttamente il cari- camento delle istruzioni, l’incremento imposto al program counter pu`oessere di una sola unit`aalla volta, ma questo non consente di accedere a dati di 8 e 16 bit. Per ovviare a questo problema si `eprevisto l’uso di un wrapper esterno che si interfaccia correttamente con i segnali del modello LISA ed in particolare `estato aggiunto un segnale per la selezione del byte o half-word da trasferire (BS, byte select), che non fa altro che comunicare all’esterno i due bit meno significativi del program counter interno. Il wrapper, sfruttando questo dato, prende l’indirizzo della locazione di memoria nMREQ MCLK nRW SEQ

address_bus RAM data_bus 2 2 MAS 2 MAS BS

MCLK Memory Wrapper

Figura 1.6. Schema del wrapper per la memoria cui accedere dal bus degli indirizzi e lo fa traslare a sinistra di 2 bit per inserire i bit del segnale BS; il risultato `eun indirizzo che consente alla unit`adi gestione della memoria di puntare il singolo byte. Affinch´enelle operazioni di scrittura di byte e half-word il dato venga replicato correttamente anche sugli altri slot di pari dimensione, come previsto dall’ARM7, il wrapper collega le singole linee del bus dati in modo selettivo, in relazione ai valori dei segnali BS e MAS. In questo modo anche le istruzioni di accesso ai dati di dimensione byte o half-word possono opera- re correttamente, rispettando le specifiche dell’interfaccia di memoria dell’ARM. Il memory wrapper pu`oanche fornire il supporto per l’organizzazione big endian della memoria, altrimenti non gestito dal modello LISA, mediante l’opportuno incrocio dei singoli byte letti dalla memoria o sul bus dati del processore. La diffusione del processore ARM7TDMI nel mercato degli apparati portatili e

27 1 – Sintesi soprattutto dei microcontrollori, ha reso disponibili vari strumenti per lo sviluppo del software, quali compilatori C/C++ ed assembler, anche non legati al marchio ARM Ltd.. Data la compatibilit`adel modello costruito con il core originale, tali strumenti possono essere utilizzati per generare i file eseguibili, grazie all’uso della medesima codifica delle istruzioni. I file assembly scritti per ARM, invece, devono essere modificati per essere utilizzati anche con il processore generato, per alcune differenze legate a particolari metodi usati dall’assembler di ARM per codificare le costanti a 32 bit e le liste dei registri da trasferire per le istruzioni di gestione de- gli stack. Per superare questi problemi `estato scritto un tool in linguaggio C che analizza il file assembly generato per ARM, rintraccia le istruzioni che fanno uso dei valori immediati e le istruzioni di block data transfer e sostituisce la sintassi propria di LISARM. Il tool di pre-assemblaggio legge dal file sorgente il valore della costante da convertire e la carica internamente in un tipo di dato a 32 bit, su cui esegue una serie di mascheramenti di gruppi di 8 bit contigui per capire se essa pu`o essere espressa o meno tramite un valore immediato su 8 bit su cui effettuare una rotazione a destra pari ad un certo ammontare. Nel caso sia possibile, dalla posizio- ne della maschera si deduce il numero di bit di cui `enecessario ruotare il valore e i due dati vengono sostituiti nel file assembly originale come due valori immediati, uno esprimibile su otto bit e l’altro su quattro. Considerato che alcune operazioni quali ADD e SUB, ADC e SBC, AND e BIC, MOV e MVN, CMN e CMN eseguono operazioni su dati complementati ad uno o a due e se `epossibile esprimere l’opposto della costante data secondo il formato immed8 r, allora la conversione viene eseguita e le operazioni vengono scambiate, come previsto dall’assembler ARM originale [19]. Se la conversione non pu`oessere eseguita il tool restituisce un segnale di errore. Le istruzioni LDM ed STM accettano una lista di registri da trasferire espressa in maniera esplicita, registro per registro, oppure tramite gruppi di registri inclusi tra due loro specificatori separati da un trattino“-”. Considerato che non `epossi- bile implementare direttamente una simile funzionalit`anell’assembler generato da LISATek, il tool trasforma la lista contente gruppi di registri in una lista esplicita, con i nomi dei registri interessati separati da virgole. Il VHDL generato dal HDL Generator `estato parzialmente verificato utilizzando il simulatore ModelSim, per il quale il tool stesso genera i file di configurazione ed il dump su file della memoria contenente il programma di test che si vuole caricare. Alcuni test preliminari hanno permesso di scoprire malfunzionamenti del modello descritto e di eseguire le dovute revisioni. Il tool di generazione dell’HDL invece, attraverso numerose compilazioni della descrizione LISA, ha consentito di adattare lo stile di descrizione ad una visione pi`uvicina a quella dell’architettura hardware, grazie alle segnalazioni di errore e di warning fornite; tali inconsistenze sono moti- vate dai diversi livelli di astrazione che convivono all’interno del modello descritto tramite il linguaggio LISA.

28 1 – Sintesi

C files

C−compiler

C libraries LISARM LISARM LISARM LISARM pre− post− assembler disassembler assembler disassembler binary file disassembly file

ARM

assembler

assembly files

Figura 1.7. Diagramma della toolchain completa di LISARM

1.7 Conclusioni e sviluppi futuri

La toolsuite LISATek ed il linguaggio LISA hanno dimostrato le loro grandi poten- zialit`adurante tutte le fasi di sviluppo del modello, soprattutto per aver consentito di concentrarsi su una serie di aspetti differenti nella prima fase di descrizione del modello instruction-accurate ed in quella successiva per il modello cycle-accurate. Notevoli i meccanismi per la descrizione del set di istruzioni, che hanno permesso di suddividere la complessit`ail problema riducendola attraverso un certo numero di sotto-fasi, anche dal punto di vista delle tempistiche e dell’esecuzione in pipeline. Sebbene sia necessario adattare il proprio stile di descrizione ad aspetti propri del livello hardware e dei linguaggi di descrizione HDL, gli strumenti di sviluppo han- no consentito di ottenere una gerarchia dell’hardware generato tale da permettere successive ottimizzazioni ad opera del sintetizzatore, sostituendo parti del datapath con efficienti architetture di libreria. Mantenendo la struttura del modello costruito, si pu`oottenere un’architettura adatta anche ad altre applicazioni specifiche, modificando semplicemente il set di istruzioni gi`aesistente. Grazie alla flessibilit`adella descrizione LISA ed alle oppor- tunit`aofferte da LISATek si possono produrre la descrizione HDL sintetizzabile, un simulatore efficiente e i vari tool di sviluppo del software per una nuova architettura con notevole risparmio di tempo ed energie. Nel modello descritto `estato ignorato l’interfacciamento con i coprocessori, in quanto l’intenzione non era tanto quella di creare un clone di ARM7 ma ottenere un processore estensibile, che potesse svol- gere al suo interno anche istruzioni specifiche mediante l’incorporamento di nuove

29 1 – Sintesi funzionalit`a. Per migliorare la densit`adel codice, invece, si potrebbe pensare di implementare anche il set di istruzioni Thumb, sfruttando sia le potenzialit`ache il linguaggio LISA mette gi`aa disposizione che l’idea utilizzata dal processore ARM7 di realizzare una decodifica dinamica delle istruzioni a 16 bit in istruzioni appartenenti al set completo. Alcuni limiti del modello attuale, quali l’interfacciamento e l’indi- rizzamento della memoria o la gestione flessibile di bus di comunicazione, potrebbero essere superati grazie a future versioni della toolsuite, consentendo di renderlo com- pletamente compatibile con le specifiche dell’ARM. Un’altra valida idea potrebbe essere quella di trasformare l’architettura di Von Neumann attuale in un’architettu- ra di Harvard, ottenendo migliori prestazioni in esecuzione senza re-ingegnerizzare tutta la struttura interna. Mediante il linguaggio LISA, infatti, pu`oessere como- damente aggiunto un stadio nella pipeline e le operazioni di accesso alla memoria possono essere spostate al suo interno con grande semplicit`a.Una fase intensiva di test dev’essere eseguita sul modello prodotto, per garantirne il funzionamento di ogni sua parte in qualsiasi condizione operativa; per fare ci`oil tool di co-simulazione di LISATek consente di verificare contemporaneamente simulatore e descrizione HDL, utilizzando anche gli stessi pattern prodotti durante lo sviluppo del modello.

30 Chapter 2

The RISC microprocessor architecture

This chapter introduces some concepts about computer architecture and reports some historical outlines about the evolution of computers. Starting from funda- mental approaches like Von Neumann and Harvard architectures, processor growing complexity is examined, underlining the reasons of some manufacturing trends and marketing strategies, until the revolution introduced by RISC architectures. A par- ticular attention is reserved to the latter design approach, in order to understand some of the reasons that have made this architecture the leader in embedded cores and mobile devices market. Moreover, some important improvements produced by pipeline, cache technologies and parallel execution introduction are discussed. At the end of the chapter, some of the most recent projects are presented and a balance of the various modern design approaches is traced.

2.1 The Von Neumann architecture

Since the first steps in computer modeling, the simplest structure used was the Von Neumann architecture[1], so called stored-program architecture; it is a very sim- ple design that uses a unique storage structure to hold both data and code. The Von Neumann architecture has revolutionized the computer concept introducing the instruction set architecture (ISA) idea. Before this formalization, the available computers worked with a fixed program, executing a sequence of unchangeable op- erations on the data, although these operations were not specified by opcodes1. With the introduction of the instruction set architectures and of the processor program- ming model, opcodes, native data types, processor resource references and addressing

1an opcode is the portion of a machine language instruction that specifies the operation to be performed, the term stands for “operation code”.

31 2 – The RISC microprocessor architecture modes were formalized via instructions written in machine language. The sequence of instructions describe step by step the operations the processor has to perform, to produce the desired computation; these instructions represent the program to be executed. The Von Neumann architecture is more flexible with respect to its ancestors, in fact the program is stored in a memory and can be modified to achieve the user needs without re-design or re-structure the processor architecture. Since the instructions can be treated as data, the Von Neumann architecture can also modify the program itself. This characteristic was useful for first platforms which did not support memory addressing via index registers or indirect addressing techniques. Self-modifying programs are deprecated today, because this kind of applications are very difficult to debug and have low efficiency on pipelined and cached processors (Lisp2 HLL represent an important exception, anyway). Self- modifying programs, moreover, could have harmful effects on the whole system, because in case of errors the program can damage itself, other programs stored in memory and also some operating system procedures. Another risk is represented by malware programs, which try to broke some software structures to modify other programs and data, to compromise files stored on the system or crash it. To avoid these effects, several memory protection techniques are implemented, so the access to memory locations assigned to programs or operating system procedures is strictly controlled and unauthorized attempts to modify data or code are blocked. Another drawback of this architecture is the Von Neumann bottleneck, term used to refer to the difference between the throughput of the CPU and the transfer rate of the memory system, due to the different technologies used for their implementation. The separation between the CPU and the memory system is implicit in the Von Neumann architecture and the progress in integrated technologies allow CPUs to reach very high speed in computation. As a consequence data load/store from/to memory could be performed at a high data transfer rate. On the other hand, since the memory system has to store a huge amount of data and code, increasing its tim- ing performance would increase the cost of the whole system. In certain applications, when the processor executes a limited number of instructions on a relevant amount of data, the memory access bottleneck introduces a serious reduction in elaboration speed. The increasing of CPU speed and the requirement of big quantities of mem- ory for programs and data, have historically made the Von Neumann bottleneck a substantial problem. To alleviate this problem caching techniques improvements are used on most advanced architectures as described in the following paragraphs.

2LISP stands for LISt Processor and it is a High Level Language often used in artificial intelli- gence projects.

32 2 – The RISC microprocessor architecture

MAIN MEMORY

CONTROL ALU

UNIT accumulator

INPUT OUTPUT

Figure 2.1. Von Neumann architecture model

2.2 Harvard architecture

In contrast with the Von Neumann architecture, the Harvard architecture uses sep- arate storage and signal pathways for data and code. This type of architecture was introduced in the Harvard Mark I relay-based computer, which stored instructions to be executed on punched tape and data in relay latches. In a Harvard architecture there’s no need for data and code memories sharing, so they can differ for word width, timing, implementation technology, addressing logic and structure. This particularity of the Harvard architecture could fit specification of some systems in which code and data have noteworthy differences in word width or when the sizes of the two memories influences addressing modes. Moreover, if the data memory is often implemented by using a random access memory, the code memory could be read only. In a Harvard architecture the CPU can read an instruction from the code memory while reading or writing from or to data memory and can fetch the next instruction when another operation completes. This behavior makes the Harvard architecture faster with respect to the Von Neumann structure, but, obviously, the cost in terms of area and architecture complexity increases. Also this architecture suffers from the effects of CPU speed with respect to the main memory timings so, if a program needs to access the memory at every clock cycle, the achievable throughput is closer to memory speed, thus CPU performance can not be exploited. The use of caching techniques improves also in this case the overall performance, as reported in the section 2.5.

33 2 – The RISC microprocessor architecture

Figure 2.2. Harvard architecture model

2.3 The increased processor complexity

Microprocessor architecture complexity has increased more and more for many years, this allowed manufacturers to conquer the market offering high performance ma- chines, maintaining as much as possible a cost-effective tradeoff. One of the first reasons of the increased architecture complexity was the noteworthy speed difference between CPUs and available memories [2]. When the throughput of CPUs has become ten times higher than the main memory, the memory accesses issue has been tackled by the increasing CPU capabilities. Since many ”higher-level” operations, as floating point subroutines, have been included in the instruction set of many commercial processors, some primitives implemented as subroutines became instructions, with dramatic gains in computation speed. Another aspect that made more cost-effective the increase of CPUs complexity is the evolution of integration technologies, so that microprogramming got advantage on hardwired control logic. Microprogramming techniques are alternative solutions to reduce the integrated logic complexity. They consist in making use of internal in- structions, called micro-instructions, which are stored directly into the control unit, so that a machine instruction is translated in a set of simpler operations executed by the architecture during some micro-steps. Small integrated memories allowed to store the directly in the control unit and, because of some technology process aspects, this set of microinstructions should often be expanded with zero or very little overhead and costs. Also this trend caused the integration of some capabilities previously committed to external subroutines, such as string editing, integer-to-floating point conversions and other data conversions. Due to expensiveness of memories, one of the main design constraints is to have very compact programs and this is another reason for instruction sets complexity

34 2 – The RISC microprocessor architecture growth. In fact, very complex instruction sets, were considered the optimal solution to obtain an high code density. Attempting to obtain code density by increasing the complexity of the instruction set is an arduous task, because supporting a big number of instructions and addressing modes means more bits to represent them and so more memory to store programs. For the previous reasons the code compaction could only be obtained by cleaning up the instruction set, instead of increasing the information to be encoded. The cost of incrementing the available memory, anyway, is often far cheaper than the introduction of architecture innovations on CPUs and the use of larger PLA3 also reduce the performance, due to decoding delays. In terms of marketing strategies, the upward compatibility needed to guarantee the software compatibility with newer machines, has led to introduce more complex- ity. A new architecture has to be completely compatible with their predecessors’ machine language, so the older instructions and addressing modes must not be re- moved from the instruction set. The concept of computer family was introduced as a method to guarantee the common feature of running the same software, this was implemented by using hardware or microprogram emulation on some processors. Another solution to improve the design characteristics was to add new features, this increased both the instruction number and their complexity. The increasing popularity of high level languages (HLL) led to new complex instructions, trying to cover the semantic gap between the processors capabilities and the single HLL instructions computation requirements. Due to the introduction of multiprogramming techniques and time sharing be- tween many processes, processors need not only the implementation of interrupts, halting the execution of processes and retaking their execution in a later time, but also of different operating modes to protect processes execution. Memory man- agement and paging require to add particular functionalities to processors, in fact an instruction can be halted before his completion and then restarted after some memory operations have been executed. The use of complex instruction sets and addressing modes increased the amount of information to be saved for every interrupt, this enlarged both the number of shadow registers and the microcode to be executed for interrupt management. When building real-time computing systems, worst-case response in terms of time ought to be granted. This requirement could be fulfilled by reducing the CPU interrupt latency. This issue is critical not only for Digital Signal Processors (DSP), but also for . In fact, automatic control applications need to run real-time routines, so the maximum response times must be ensured. This is accomplished by freezing the processor internal state and satisfying interrupts requests as soon as possible. However, as detailed in the previous paragraph, with this approach the

3Programmable Logic Array, is a programmable device used to implement combinational logic circuits, particularly instruction decoder within most processors.

35 2 – The RISC microprocessor architecture number of registers increases according to the operation complexity, particularly to store intermediate states of microcode execution. Other reasons for the instruction set complexity growing are furnished by com- puter programmers who use the : they want a CPU to support a full featured instructions set (e.g. orthogonal 4 instruction set). Some further considerations can be made about technology aspects. Improving the performance of a complex processor, by introducing architectural innovations, takes much time and the market trends have to take in account the design time of a new product, especially time-to-market and time-to-volume parameters, but also the possibilities given by the most recent semiconductors technology. Long design times could lead to a product which uses a target technology old of months or years, instead of try to pioneer a new technology.

2.4 The RISC architecture

RISC stands for Reduced Instruction Set Computer and refers to a simpler platform with respect to Complex Instruction Set Computer (CISC ) processors. The term RISC is in antithesis with the term CISC used, since then, to refer more complex platforms implementing instruction sets with many operations and several address- ing modes, as seen in the paragraph 2.3. The fundamental feature of this approach is to execute all operations only between the processor registers, accessing the memory exclusively by a couple of operations, load and store, for loading and saving data to and from registers. For this reason the equivalent name of load-store architecture is frequently used to refer RISC processors [4]. Other features common to RISC architectures are:

• Uniform instruction encoding, all the instructions are expressed by the same number of bit and the opcode (the bit field representing a unique operation) is always in the same bit position in each instruction: this approach allows fast decoding.

• A homogeneous register set, allowing any register to be used in any context and simplifying compiler design (although there are almost always separate integer and floating point register files).

• Simple addressing modes (complex addressing modes are replaced by sequences of simple arithmetic instructions for the address calculation).

4A computer’s instruction set is said to be orthogonal if any instruction can use data of any type via any , so programming is simpler, but complexity increases as every of the addressing mode must be supported by every instruction.

36 2 – The RISC microprocessor architecture

• Few data types supported in hardware (for example, some CISC machines had instructions for dealing with byte strings, others had support for polynomials and complex numbers); a RISC machine reduces the number of native data types as much as possible. The RISC CPU design philosophy was inspired by some observations about the real usefulness of many features included in previous processors, which appeared to be quite overdesigned and so less cost-effective with respect to simpler approaches. While some of the initial assumptions about the throughput difference between the memory and the CPU has changed, due to the introduction of faster semiconductor memories and the cache memories technology, the increasing complexity introduced other side effects. The execution of some complex instructions on particular processors, required more time than the alternative set of simple instructions which execute the same final set of operations, allowing to discover several architectural limits. Moreover, complex architectures has the problem of debugging, particularly for the microprogram control. To allow corrections to microcode parts some manufacturers implemented rewritable microprogram areas within the processors, this allows main- tenance even the processors are used in field, by distributing firmware updates to customers. Another solution adopted by processor producers was to place, beside the rewritable microprogram storage, an FPGA5 to patch also parts of the architecture and not only microcode. In the late ’70s some researchers, from various computer companies, demon- strated that the majority of the addressing modes implemented in the orthogonal instruction sets were unused by most programs. This was a side effect of the increas- ing use of HLLs and compilers to generate the programs, as opposed to write them in assembly. In fact, compilers used at the time, had only a limited ability to take advantage of the features provided by the complex instruction sets. The market was clearly moving to even wider use of HLLs and compilers for software development, diluting the usefulness of the orthogonal instruction sets even more. Moreover, since these operations were rarely used, they tended to be slower than smaller operations. Another part of RISC design came from practical measurements on real-world programs. Some researchers showed that 98% of all the constants in a program would fit in just 13 bits, yet almost every CPU design dedicated some multiple of 8 bits to store them, typically 8, 16 or an entire 32-bit word. Taking this fact into account suggests that a machine should allow for constants to be stored in unused bits of the instruction itself, decreasing the number of memory accesses. Since real-world programs spend most of their time executing very simple op- erations, some researchers decided to focus on making those common operations as simple and fast as possible. Since the clock rate of the CPU is limited by the time

5Field Programmable Grid Array

37 2 – The RISC microprocessor architecture it takes to execute the slowest instruction, speeding up that instruction (e.g. by re- ducing the number of addressing modes it supports) also speeds up the execution of every other instruction. The goal of RISC architectures was to make instructions as simple as each one could be executed in a single clock cycle. Code was implemented as a sequence of these simple instructions, instead of single complex instructions. This led to the possibility to insert data within an instruction reducing the need to use registers or memory. However, since a series of instructions is needed to complete even simple tasks, the total number of instructions read from memory is larger, and therefore it takes longer. Most of the designs, that made the history of RISC, were the results of uni- versity research programs on VLSI technologies started in the early ’80s. The first noteworthy study was the Berkeley’s RISC project, directed by David Patterson, followed by the MIPS project, started at Stanford University and directed by John L. Hennessy. The RISC project was based on gaining performance through the use of pipelining and a large use of registers known as register windows. A normal CPU has a small number of registers and a program can use any register at any time. In a CPU with register windows, there is a huge number of registers, but programs can only access a small number of them, according to certain rules. A program that employs few registers for any procedure, can make very fast procedure calls, in fact, the call and the return, simply move the window to the subset of few registers used by that procedure. In a normal CPU, most calls push the content of the registers to RAM to clear enough working space for the subroutine and the return restores those values. The RISC project delivered the RISC-I processor in 1982. Being made of only 44,420 transistors (compared to about 100,000 in newer CISC designs) RISC-I had only 32 instructions, and yet completely outperformed any other single-chip design. With the 40,760 transistor and 39 instructions, in 1983 the RISC-II was delivered: it ran over three times faster than RISC-I. The MIPS project focused almost entirely on the pipeline, although pipelining was already in use in other designs, several features of the MIPS chip made its pipeline far faster. The most important feature was the insure that all instructions complete in one cycle. This requirement allowed the pipeline to run at higher speed and is responsible for much of the processor’s performance. However, it also had the negative side effect of eliminating many potentially useful instructions, like multiply or divide, which necessarily require more clock cycles to execute. The earliest attempt to create and manufacture a CPU based on the RISC phi- losophy was a new project at IBM, which started in 1975. The work led to the IBM 801 CPU family which was widely used inside IBM hardware. In 1981 the 801 was eventually produced in a single-chip as the ROMP, Research (Office Products Division) Mini Processor. Nevertheless, the 801 inspired several research projects, including new ones at IBM that would eventually lead to their POWER system.

38 2 – The RISC microprocessor architecture

Berkeley’s research was not directly commercialized, but some years later the RISC- II design was used by Sun Microsystems to develop the SPARC architecture, by others to develop mid-range multi-processor machines and by almost every other company a few years later. It was Sun’s use of a RISC architecture in their new chips that proved RISC’s benefits were real, and their architectures quickly outpaced the competition and essentially took over the entire workstation market. On the other hand, MIPS went on to become one of the most used RISC architec- ture when it was included in the PlayStation and Nintendo 64 game consoles. In the same years, IBM, went on to design new machines based on their new POWER architecture and also moved their well known AS/400 systems to POWER chips, discovering that the modified system ran considerably faster with respect to the highly complex instruction set system used before. POWER architecture was also fundamental to develop the PowerPC design, which eliminated many of the “IBM only”instructions and created a single-chip implementation. Today the PowerPC is one of the most commonly used CPUs for embedded and automotive applications. It was also the CPU used in most Apple Macintosh machines sold until 2006, before Apple switched their PowerPC products to Intel processors. In the late ’80s Intel released the i860 and i960, whereas, Motorola built a new design called the 88000 in homage to its famed CISC 68000, but eventually aban- doned it and joined IBM to produce the PowerPC. AMD released their 29000 which would go on to become the most popular RISC design of the early ’90s. Today the vast majority of all CPUs in use are RISC CPUs and particularly microcontrollers. RISC architectures offer power in even small sizes, and thus have come to completely dominate the market for low-power embedded CPUs, which are by far the largest market for processors. In fact, while a family may own one or two PCs, their cars, cell phones, and other devices contain a number of embedded processors. RISC architectures had also completely taken over the market for larger workstations for much of the ’90s. After the release of the Sun SPARCstation other vendors rushed to compete with RISC based solutions of their own. Even the mainframe world is now completely RISC based. However, despite many successes, RISC has missed the desktop PC and commodity server markets, where Intel’s x86 platform remains the dominant processor architecture (it must be considered that AMD’s processors implement the x86 platform, or a 64-bit superset known as x86-64). The primary reason of this incomplete revolution is that the large base of proprietary PC ap- plications are written for x86, whereas no RISC platform has a similar installed base. The second reason is that, although RISC was indeed able to scale up in performance quite quickly and cheaply, Intel took advantage of its large market by spending vast amounts of money on processor development. The first x86 CPU to deploy RISC techniques was the NextGen Nx586, released in 1994, it expanded the majority of the CISC instructions into multiple simpler RISC operations. Internally the Nx586, Intel P6, AMD K5 and Cyrix 6x86 are RISC machines that emulate a

39 2 – The RISC microprocessor architecture

CISC architecture. In 2004 x86 chips were the fastest CPUs in SPECint displacing all RISC CPUs, but the fastest CPU in SPECfp6 is the IBM Power 5 processor. Still, RISC designs have led to a number of successful platforms and architec- tures, some of the larger ones being:

• MIPS line, found in most SGI computers and the PlayStation, PlayStation 2, PlayStation Portable and Nintendo 64 game consoles.

• IBM’s and Freescale’s (formerly Motorola SPS) Power Architecture, used in all of IBM’s supercomputers, midrange servers and workstations, in Apple’s Power Macintosh computers, in Nintendo’s Gamecube and Wii, Microsoft’s Xbox 360 and Sony’s PlayStation 3 game consoles, and in many embedded applications like printers and automotive applications.

• Sun’s SPARC and UltraSPARC, found in all of their later machines.

• Hewlett-Packard’s PA-RISC, also known as HP/PA.

• DEC Alpha, still used in some of HP’s workstation and servers.

• XAP processor used in many chips, e.g. .

• ARM Palm, Inc. originally used the (CISC) Motorola 680x0 processors in its early PDAs, but now uses ARM processors in its latest PDAs; Apple Computer uses the ARM 7TDMI in its iPod products; Nintendo uses an ARM7 CPU in the Game Boy Advance and both an ARM7 and ARM9 in the Nintendo DS handheld game systems; the small Korean company Game Park also markets the GP32, which uses the ARM9 CPU; many cell phones from like Nokia products are based on ARM designs.

• Hitachi’s SuperH, originally in wide use in the Sega Super 32X, Saturn and Dreamcast, now at the heart of many consumer electronics devices; the SuperH is the base platform for the Mitsubishi - Hitachi joint semiconductor group.

6SPECint and SPECfp are computer benchmark specifications for CPU’s integer and floating point performance calculations, they are maintained by the Standard Performance Evaluation Corporation (SPEC).

40 2 – The RISC microprocessor architecture

2.5 Pipelining and cache technology

In the early ’80s it was thought that existing CPUs were reaching the theoretical limits. Future improvements in speed would be primarily achieved through improved semiconductor ”process”, that is, smaller technology process features (transistors and wires). The complexity of the chip would remain near the same, but the smaller size allows to run at higher clock rates. A crucial structure introduced in CPU design was the pipeline, i.e. a chain of registers, which would break down instructions into steps, and work on one step of several different instructions at the same time. A very simple processor might read an instruction, decode it, fetch from the memory the data asked for, perform the operation, and then write the results in registers or memory. The key to pipelining is the observation that the processor can start reading the next instruction as soon as it finishes reading the last, meaning that there are now two instructions being worked on (one is being read, the next is being decoded), and after another cycle there will be three, because the previous decoded instruction could be executed in a third stage. While no single instruction is completed any faster, the next instruction would complete right after the previous one. The result was a much more efficient utilization of processor resources and these techniques are ordinarily used both on RISC and CISC processors. The goal of a pipelined architecture is to keep the pipeline full of instructions at all times and recent processors have very complex pipeline structures, with many stages. The use of the pipeline was primarily a characteristic of RISC designs, which shared a not-so-nice feature referred as branch delay slot. The branch delay slot is a side-effect of pipelined architectures due to the branch hazard 7, i.e. the fact that the branch would not be resolved until the instruction has crossed several stages of the pipeline, reaching the execution stage to update the program counter. The location of such an instruction in the pipeline is called a branch delay slot. Since the fetched and decode operations on the following instructions must be executed depending on the branch instruction, the pipeline needs to be flushed if the program execution continues at a new memory location. Consequence of a branch hazard can be unwanted actions performed by the processor and so opportune measures must be taken to avoid these behaviors. A simple design would insert stalls into the pipeline after a branch instruction until the new branch target address is computed and loaded into the program counter, in this case each cycle where a stall is inserted is considered one branch delay slot. A more sophisticated design would execute the instructions which do not de- pend by the result of the branch instruction. Moreover, this optimization, must be performed in software at compile time and consists in moving branch independent

7Branch hazards are also known as control hazards.

41 2 – The RISC microprocessor architecture instructions into branch delay slots. Modern techniques resolve this problem by using branch prediction algorithms and speculative execution, so that many branch delay slots are efficiently exploited, reducing the performance penalty. Branch prediction is based on complex algorithms which allow the processor to establish whether a conditional branch in the instruction flow of a program is likely to be taken or not. So it is possible to fetch and decode the right instructions without waiting for a branch to be resolved. Another prediction method is implemented by the branch target predictor, which attempts to guess the target of the branch or unconditional jump before it is computed, by parsing the instruction itself. Branch predictors are crucial in today’s processors for achieving high performance. When a conditional branch instruction is encountered, the processor guesses which way the branch is most likely to go (this is called branch prediction), and immediately starts executing instructions from that point. If the guess later proves to be incorrect, all computation past the branch point is discarded. The early execution is relatively cheap because the pipeline stages involved would otherwise be frozen until the next instruction is known. However, wasted instructions consume CPU cycles that could have otherwise delivered performance, and on a mobile device, consume batteries, representing a penalty for a mispredicted branch. The execution of code whose results can be unuseful is called speculative execution or out-of-order execution and, as discussed above, these techniques can improve processors performance. Another solution to improve instruction throughput was to use several processing elements inside the processor and run them in parallel. Instead of working on one instruction to perform an ALU operation, these superscalar processors would look at the next instruction in the pipeline and attempt to run it at the same time in an identical unit. However, this can be difficult to do, as many instructions in computing depend on the results of some previous instructions. Anyway, most of modern processors, have more than one execution unit for integer numbers, a separate unit for floating point numbers and sometimes circuitry for independent memory address calculations. All these units could be used at the same time if there are not data hazards, i.e an operation executed in a unit which depends on a previous instructions not completely executed yet. Both of these techniques relied on increasing speed by adding complexity to the basic CPU architecture, as opposed to the instructions running on them. Another problem with pipelined processor occurs when a part of the processor’s hardware is needed by two or more instructions at the same time. A situation like this is called a structural hazard and might occur, for instance, if a program wants to execute a branch instruction followed by a computation instruction. Because they are executed in parallel, and because branching is typically slow (requiring a comparison, program counter-related computation, and writing to registers), it is quite possible (depending on architecture) that the computation instruction and the branch instruction will both require the ALU at the same time. The most simple

42 2 – The RISC microprocessor architecture solution of a structural hazard is the insertion of one or more pipeline stall cycles or a sophisticated algorithm which may consent a different scheduling of the instructions, without losses in instruction throughput.

Figure 2.3. Pipelined processor example

Using a small amount of fast memory between the CPU and the main memory to store a copy of the most frequently used data, it is possible to reduce the average time to access needed resouces. This kind of memory is named cache memory and could be defined as “a temporary storage area where frequently accessed data can be stored for rapid access”. In modern computers the main memory needs to have much room for data and code and so must have low cost per megabyte, with dramatic effects on latency and bandwidth. If the main memory is implemented with dynamic memories, cache memories, which are very small (few megabytes), must be as efficient as possible, with high performance and very low latency, so they’re manufactured by using static memories. As long as most memory accesses are to cached memory locations, the average latency of memory accesses will be closer to the cache latency than to the latency of main memory. The structure of the access to the caches is often different. In most modern architectures, in fact, the CPU accesses two different portions of cache, one for the instructions and one for the data (as in Harvard architecture), but in case of a cache miss (a required data which is not available in the portion of memory copied into the cache), the data is retrieved from the main memory which is unique for code and data. For this reason, the off-chip memory resources are managed as in the Von Neumann architecture, so

43 2 – The RISC microprocessor architecture the global platform inherit the structure of both the models described in the previous paragraphs. RISC designs are also more likely to feature a Harvard memory model, where the instruction stream and the data stream are conceptually separated; this means that modifying the addresses where code is held might not have any effect on the instructions executed by the processor (because the CPU has a separate instruction and data cache), at least until a special synchronization instruction is issued. On the upside, this allows both caches to be accessed simultaneously, which can often improve performance.

data cache main CPU MMU instruction memory cache

Figure 2.4. Cached processor example

RISC was tailor-made to take advantage of pipelined and caching techniques, because the core logic of a RISC CPU was considerably simpler than in CISC de- signs. Although the first RISC designs had marginal performance, they were able to quickly add these new design features and by the late 1980s they were signifi- cantly outperforming their CISC counterparts. In time, this would be addressed as an optimal structure and the improvements in technology processes led to the point where all of this could be added to a CISC design and still fit on a single chip, but this took most of the late-80s and early 90s. Generally for any given level of general performance, a RISC chip will typically have many fewer transistors dedicated to the core logic than a CISC. This allows the designers considerable flexibility as they can:

• Increase the size of the register set.

• Implement measures to increase internal parallelism.

• Increase the size of caches.

• Add other functionality, like I/O and timers for microcontrollers.

• Build the chips on older fabrication lines, which would otherwise go unused.

• Offer the chip itself for battery-constrained or size-limited applications.

44 2 – The RISC microprocessor architecture

2.6 RISC vs CISC architecture

In certain circumstances RISC architectures offer significant advantages over CISC platforms and viceversa, so that the majority of today’s processors can not rightfully be identified as completely RISC or CISC [5]. The two architectures have evolved towards each other so that there’s no longer a clear distinction between their re- spective approaches to increasing performance and efficiency. As already discussed in the previous paragraphs, CISC architectures are based on reducing the amount of time spent retrieving instructions from memory by concentrating machine ele- mentary operation in more complex instructions, although these instructions need multiple clock cycles to execute. A typical CISC processor has most of the following properties:

• Uses microcode to simplify control unit’s architecture: the microcode is read from a resident ROM instead of implementing all in hardware.

• Has improved performance, since instructions could be retrieved up to ten times faster from ROM than from main memory.

• Instructions of variable size.

• Rich instruction set, including simple and complex instructions.

• Has large number of addressing modes.

• Has a small number of general-purpose registers, typically about 8 registers; this is a result of having instructions which can operate directly on memory (which means no address storing).

• Instruction interface with memory in multiple ways with complex addressing modes.

• Instructions generally take more than one clock cycle to execute.

• Orthogonal instruction set.

A RISC architecture has most of the characteristics listed here:

• Makes use of a small simplified instruction set in attempt to improve perfor- mance via a simpler architecture.

• Instructions execute in only one clock cycle.

• Uses pre-fetching techniques coupled with speculative execution (out-of-order execution).

45 2 – The RISC microprocessor architecture

• Pipelining.

• Instruction interface with memory via fixed mechanisms (load/store).

• Fast floating point performance.

• Few addressing modes.

• Large number of registers.

• Hardwired design (no microcode).

• Heavily rely on the compiler.

CISC machines have a variety of instruction formats for a large number of in- structions and instruction groups, this makes decoding more difficult and more time intensive. RISC greatly simplifies the instruction format for easy and fast decoding. Although the CISC architecture improves computer performance, it still has some drawbacks:

• Instruction set and chip hardware became more complex with each generation of computers, since earlier generations of processor family were contained as a subset in every new version.

• Different instructions take different amount of time to execute due to their variable-length.

• Many instructions are not used frequently; approximately 20% of the available instructions are used in a typical program.

On the other hand, RISC architectures suffer from some drawbacks:

• Programmers must pay close attention to instruction scheduling so that the processor does not spend a large amount of time waiting for an instruction to execute.

• Debugging can be difficult due to the instruction scheduling.

• Require very fast memory systems to feed instructions.

The primary reason for RISC arise is essentially the fact that, when RISC phi- losophy was introduced, CISC processors were manufactured using more than one SSI8 chip. Although VLSI technologies made the above problems even more critical, several factors indicated RISC architectures as a reasonable design alternative. The

8Short Scale Integration

46 2 – The RISC microprocessor architecture

first factor is the implementation feasibility: a great deal depends on being able to fit an entire CPU design on a single chip. A complex architecture is exposed to many realization problems in a given technology than a simple one, so improvement in VLSI technology will eventually make a single chip version dramatically unfeasible. RISC computers, therefore, benefit from shorter time-to-market than CISC ones. Design complexity is a crucial factor in the RISC architecture growth, in fact, if VLSI technology continues to almost double chip density roughly every two years, a design that takes only two years can potentially use a much superior technology and hence be more effective than a design that takes four years. RISC architectures demonstrated to obtain a better use of chip area: the area gained back by designing a RISC architecture rather than a CISC can be used to improve the RISC capabil- ities. For example, the entire system performance might improve if silicon area is used for on-chip caches or registers or even pipelining. The CISC also suffers from the fact that its intrinsic complexity often makes ad- vanced techniques even harder to implement. The ultimate test for cost-effectiveness and efficiency of a processor is the speed at which an implementation executes a given algorithm. Better use of chip area and availability of newer technology through reduced de- bugging time contribute to the speed of the chip. A RISC potentially gains in speed merely from a simpler design. Many of today RISC cores support just as many instructions as yesterday’s CISC chips, like PowerPC 601, which supports more instructions than the Pentium. Furthermore, today’s CISC CPUs, use many tech- niques formerly associated with RISC chips. In conclusion the difference between RISC and CISC approach is getting smaller and smaller. At the present RISC processors can take advantage on CISC for some other reasons than silicon area, like power consumption, environmental prescriptions, in- terrupt structure and costs, particularly in automotive and portable devices markets. A RISC based system can consume few compared to a CISC. Environmental aspects are influenced by this parameters, because a RISC-based device can work in high temperature places and often does not require low EM emissions certification as CISC does. The latter needs fans and efficient cool systems in particular environ- ments. Costs per chip are also in advantage of RISC, from some dollars to hundreds dollars for the CISC implementations, but this is not all, because of the costs of system which have to mount a more complex processor, implying many strictly re- quirements. Due to competition on x86 processor prices and also if RISC prices are dropping; a workstation using the CISC x86 PII architecture is less expensive than a Sun UltraSPARC machine which is equal performing. The biggest threat for CISC and RISC might not be each other, but a new technology called EPIC, which stands for Explicitly Parallel Instruction Computing. Like the acronym says the EPIC project points to execute many instructions in a parallel way and this an Intel and Hewlett Packard 64-bit architecture project (also

47 2 – The RISC microprocessor architecture identified as IA-64) which led to the development of the Itanium processors. The root of EPIC can be found in the Instruction Level Parallelism (ILP) philosophy, which uses the compiler to identify and leverage opportunities for parallel execution of instructions. Exploiting ILP techniques is possible to eliminate complex on-die scheduling circuitry in the CPU, freeing up space and power for other functions, including additional parallel execution resources. On the other hand, this approach can be exploit in a more explicit manner: VLIW architectures support multiple operations encoding in every instruction, and then process these operations by the same multiple execution units as discussed above. The goal of the EPIC philoso- phy is to produce a “post-RISC era”architecture that would address some of the key challenges faced by older RISC and CISC architectures enabling more efficient performance scaling in future processor designs.

48 Chapter 3

The ARM microprocessor architecture

The ARM architecture is a 32-bit RISC architecture widely used in embedded de- signs and, due to power saving features, ARM CPUs are dominant in the mobile electronics market, where low power consumption is a critical design goal. ARM is the acronym of Advanced RISC Machine and, prior to that, Acorn RISC Machine; ARM Ltd. is a society spun off by the Acorn Computer Company in 1990, with the target to develop this family of cores in collaboration with Apple Computers Inc.. The processor modelled in this thesis work is a ARM7TDMI, which is a mem- ber of the ARM general purpose microprocessor family. This processor offers high performance with low power consumption, so is one of the most diffused embed- ded microprocessor for mobile products like PDAs, mobile phones, media players, handheld gaming units, and calculators. The ARM processor family is based on the RISC principles and ARM7TDMI is implemented by a Von Neumann architecture; the instruction set and related decoding mechanism are much simpler than those of microprogrammed complex instruction set machines. This simplicity results in a high instruction throughput and impressive real-time interrupt response for a small and cost-effective chip. Pipelining is exploit so that all parts of the processing and memory systems can operate continuously. ARM7TDMI has a three stage pipeline and so, while one instruction is being executed, its successor is being decoded, and a third instruction is being fetched from memory. To ensure the instruction and data feeding to the processor units, a prefetch mechanism is also provided, so that the pipeline stages are nominally four instead of three. The ARM memory interface has been designed to allow good performances maintaning low costs for the memory system implementation. Speed-critical control signals are pipelined to allow system control functions to be implemented in standard low-power logic.

49 3 – The ARM microprocessor architecture

Figure 3.1. ARM7TDMI core

3.1 The ARM processor family

The ARM design was started in 1983 as a development project at Acorn Computers Ltd. and the first samples called ARM1 were available in 1985 [6]. The first notewor- thy production of the ARM family processors, reached the market in the following year with the ARM2. The ARM2 featured a 32-bit data bus, a 26-bit memory ad- dressing space (64 Mbyte addressing range) and 16 and 32 bit wide registers. One of these registers served as the (word aligned) program counter with its top 6 bit and

50 3 – The ARM microprocessor architecture lowest 2 bit holding the processor status flags. The ARM2 was possibly the sim- plest useful 32-bit microprocessor designed, with only 30,000 transistors (compared with Motorola’s six-year older 68000 with around 70,000). Much of its simplicity comes from not having microcode (which represents about one-fourth to one-third of the 68000 area occupation) and, like most CPUs of the day, not including a cache. This simplicity led to its low power usage, while performing better than the Intel 286 processor. Its successor, ARM3, was produced with a 4KB cache which further improved the ARM2 performance. In the late ’80s the ARM architecture was deeply revised, when Apple Computer Inc. started working with Acorn on newer versions of the core. The work was so important that Acorn spun off the relative design team in 1990 into a new company called Advanced RISC Machines Ltd.. Advanced RISC Machines became ARM Ltd when the parent holding company was listed on London and New York Stock Exchanges. Meantime, the first models of ARM6 were realized (1991); Apple used the ARM6-based ARM 610 as the basis for their Apple Newton PDA and in 1994, Acorn used the ARM 610 as the main CPU in their Risc PCs. In these evolutions the core has remained largely the same size, ARM2 had 30,000 transistors, while the ARM6 grew to only 35,000. The idea is that the Original Design Manufacturer combines the ARM core with a number of optional parts to produce a complete CPU, one that can be built on old semiconductor fabs and still deliver lots of per- formance at a low cost. While ARM’s business has always been to sell IP cores, some of the licensees generated microcontrollers based on this core. The most successful implementation has been the ARM7TDMI, with hundreds of millions pieces sold in mobile phones and handheld video game systems. DEC licensed the architecture and produced the StrongARM, a 233 MHz CPU which drew only 1 watt of power (more recent versions draw far less). This work was later passed to Intel as a part of a lawsuit settlement. Intel later developed its own high performance implementa- tion known as XScale and the common architecture supported on Windows Mobile smartphones, Personal Digital Assistants and other handheld devices.

3.2 The Thumb concept

The ARM7TDMI processor employs a unique architectural strategy denominate Thumb, which makes it ideally suited to high-volume applications with memory restrictions, or applications where code density is an issue. Beside the 32-bit in- struction set (usually called ARM instruction set), ARM7 processor, has also a reduced 16-bit instruction set, named Thumb instruction set. The 16-bit long in- structions of the Thumb instruction set allow to improve the density of standard ARM code retaining most of the ARM performance advantage with respect to tra- ditional 16-bit processor using 16-bit registers. Thumb code, in fact, operates on

51 3 – The ARM microprocessor architecture

32-bit registers as ARM code, but is able to provide up to 65% of the code size of ARM, and 160% of the performance of an equivalent ARM processor connected to a 16-bit memory system ([16]). It may be viewed as a compressed form of a subset of the ARM instruction set, Thumb instructions map onto ARM instructions, and the Thumb programmer model maps onto the ARM programmer model. Thumb is not a complete architecture, therefore the Thumb instruction set sup- ports only common application functions, allowing recourse to the full ARM in- struction set where necessary (for instance, all exceptions automatically enter ARM mode). An application can mix ARM and Thumb subroutines in a flexible manner to optimize both performance and code density. In the some applications the use of the Thumb instruction set can improve power-efficiency, save cost and enhance performance all at once. Thumb instructions operate on a subset of the standard ARM register configu- ration, allowing excellent interoperability between ARM and THUMB states, also in the same program code. Each 16-bit Thumb instruction has a corresponding 32-bit ARM instruction with the same effect on the processor model and the imple- mentation of Thumb architecture use dynamic decompression and then instructions execute as standard ARM instructions within the processor. The major advantage of a 32-bit (ARM) architecture over a 16-bit architecture is its ability to manipu- late 32-bit integers with single instructions, when processing 32-bit data, indeed, a 16-bit architecture will take at least two instructions to perform the same task as a single ARM instruction. If a 16-bit architecture only has 16-bit instructions, and a 32-bit architecture only has 32-bit instructions, then overall the 16-bit architec- ture will have better code density, and better than one half the performance of the 32-bit architecture. Clearly 32-bit performance comes at the cost of code density, the Thumb mode available on the ARM7 breaks this constraint by implementing a 16-bit instruction length on a 32-bit architecture, making the processing of 32-bit data efficient with a compact instruction coding. This provides far better perfor- mance than a 16-bit architecture, with better code density than a 32-bit architecture. This is the ability to switch back to full ARM code and execute at full speed. Thus critical loops for applications such as fast interrupts and DSP algorithms, which can be coded using the full ARM instruction set, and linked with Thumb code. The Thumb architecture, using 32-bit registers, can also address a large memory space efficiently. In this thesis work, anyway, the Thumb architecture is not treated and so the modelled architecture can only execute code which not make use of processor mode change.

52 3 – The ARM microprocessor architecture

3.3 The programmer model

3.3.1 Operating states and state switching

The ARM processor can operate in one of two possible states: ARM state or THUMB state. When it operates in ARM state, executes 32-bit word-aligned in- structions decoding them with respect to the ARM instruction set. If the processor is in the THUMB state, otherwise, the instructions are halfword-aligned and the bit 1 of the program counter indicates which of the two halfwords is selected for fetching; the decoding is done in according with the 16-bit Thumb instruction set. Switching from a state to the other is possible by using the branch1 and exchange (BX) opcode and the target state is defined by the least significant bit of the value contained in the operand register. Entry into THUMB state can be achieved by executing a BX instruction with the state bit (bit 0) set in the operand register. Transition to THUMB state will also occur automatically on return from an ex- ception, if the exception was entered with the processor in THUMB state. Entry into ARM state happens on execution of the BX instruction with the state bit clear in the operand register or automatically at every time the processor takes an ex- ception. In this second case, the PC is placed in the exception mode link register, and execution commences at the relative exception vector address. The exception handling is described with more details in paragraph 3.4.

3.3.2 Memory formats and data types

The processor views memory as organized as a linear collection of bytes, numbered upwards starting from zero, so bytes 0 to 3 hold the first stored word (32-bit word), bytes 4 to 7 the second and so on. The processor supports words stored in memory both in big endian and little endian format. In big endian format, the most significant byte of a word is stored at the lowest numbered byte and the least significant byte at the highest numbered byte. Byte 0 of the memory system is therefore connected to data lines 31 through 24, byte 1 is connected to data lines 23 through 16 and so on. The memory scheme for this organization is shown in fig.3.2. In little endian format, the lowest numbered byte in a word is considered the least significant byte, and the highest numbered byte the most significant. Byte 0 of the memory system is therefore connected to data lines 7 through 0, byte 1 to data

1a branch instruction allows to perform conditioned or unconditioned jumps to subroutine, i.e. to a portion of code within a larger program which performs a specific task and is relatively independent of the remaining code; for these reasons the subroutine code is stored at a different address with respect to the main program code.

53 3 – The ARM microprocessor architecture

Figure 3.2. Big endian memory organization

lines 15 trough 8 and so on. The memory scheme for this organization is shown in fig.3.3.

Figure 3.3. Little endian memory organization

The processor supports byte (8-bit), halfword (16-bit) and word (32-bit) data types. To grant the correct access to the memorized data, words must be aligned to four-byte boundaries and half words to two-byte boundaries; there are no restraint on the single byte data alignment. The memory interface and other details about the processor data management are reported in the section 3.7. To obtain flexible sign extend capabilities signed and unsigned data types, particularly for byte and halfword sized data, some specific instruction are also provided.

3.3.3 Operating modes

In order to support the normal program flow, but also different events like applica- tion unrecoverable errors, software or hardware interrupts, to furnish the privileged functions of an operating system and to access some reserved resources, the proces- sor can work in several operating modes:

54 3 – The ARM microprocessor architecture

user (usr): the normal application program execution state; FIQ (fiq): designed to support a data transfer or debug process; IRQ (irq): used for general-purpose interrupt handling; supervisor (svc): protected mode for the operating system; abort mode (abt): entered after a data or instruction prefetch abort; system (sys): a privileged user mode for the operating system; undefined (und): entered when an undefined instruction is executed. Application programs will normally execute in user mode. The other (non-user) modes, so called privileged modes, are entered in order to service interrupts or ex- ceptions or to access protected resources. Mode changes could be performed via explicit program instructions or using some dedicated input signals driven by pe- ripherals connected to the processor. A detailed description of these operating modes is furnished in the paragraph 3.4.

3.3.4 Processor resources The processor has a total of thirty-seven 32-bit wide registers; thirty-one of these are general purpose registers and the other six are status registers. Not all the registers can be seen at once, five of the status registers and fifteen of the general purpose registers are banked registers and can not be seen in user mode. The processor state and operating mode dictate which registers are available to the programmer. Up to sixteen general purpose registers and one or two status registers are visible at once. In privileged modes, mode-specific banked registers are activated and so can be accessed by the programmer. The ARM state register set contains sixteen directly accessible registers, named R0 to R15. All of these, except R15, are general-purpose and may be used to hold either data or address values. In addition to these, there is a seventeenth register used to store processor status informations (see section 3.3.5). Register R15 holds the Program Counter (PC). In ARM state, bits [1:0] of R15 are zero and bits [31:2] contain the instruction fetch address; this is because the alignment of 32-bit wide data in memory. In THUMB state, bit [0] is zero and bits [31:1] contain the fetch address; code words, in that state, are 16-bit wide and the half-word alignment has two alternative boundaries. Register R14 is used as subroutine link register (LR), so it receives a copy of the program counter when a branch with link (BL) instruction is executed. All other times it may be treated as a general-purpose register. The corresponding banked registers R14 svc, R14 irq, R14 fiq, R14 abt and R14 und are similarly used to hold the return values of R15 when interrupts and exceptions arise. These registers are also used when branch with link instructions are executed within interrupt or exception routines execution. Non-user modes has some other banked registers which can be accessed, these

55 3 – The ARM microprocessor architecture registers and their utilization are described in the paragraph 3.4, where exception handling is discussed. THUMB state has a reduced register set with respect to the ARM state. It can be substantially considered as a subset of the same ARM state register set, in fact eight general purpose registers can be accessed (numbered from R0 to R7 and referred as low registers in the programmer’s model) and they maps onto the respective registers in ARM state. It must be underlined that these register are all 32-bit wide, being the same both in ARM and THUMB state. Moreover, in THUMB state, a stack pointer (SP), the link register (LR) and the program counter (PC), are available. These last three register map respectively onto R13, R14 and R15 ARM state registers. In some particular conditions, also the other reserved registers (numbered from R8 to R15 and referred as high registers) can be accessed; this is useful after a branch and exchange operation, for example. These resources can be managed by using the high registers operations belonging to the Thumb instruction set.

3.3.5 The Processor Status Registers (PSRs) To store the internal state the processor is furnished of a 32-bit register named Current Processor Status Register (CPSR). The single bit of the CPSR can be grouped relatively to their function, so the least significant bits are called control bits and store the processor operating mode, the state bit and two flags for the interrupt enabling and disabling. The most significant bits are destined to hold information about the most recently ALU operation performed. Not all PSR 32 bit are used to store informations, some of them are free and reserved for future upgrades and new functionalities of the processor family, though they can not be used. The arrangement of bits is shown in Figure 3.4. Beside the CPSR, other five banked register, one for each privileged mode, are available and they are called Saved Processor Status Registers (SPSRs). Five of the control bits define in which operating mode (section 3.3.3) the processor is working; they’re indicated with M0 to M4 and named mode bits. The mode bits can assume the values reported in table 3.1. The subsequent bit is called T-bit and is a flag which indicates in which state, between ARM and THUMB, the processor is. An external signal (TBIT) reflects the T-bit state; if the bit is set the processor in in the THUMB state, works in ARM state if reset. I-bit and F-bit represent respectively standard interrupt and fast interrupt disabling if set. The most significant bits of the CPSR are occupied by the condition code flags, used for the ALU operations and also for the evaluation of the condition which accompa- nies every processor opcode, determining whether the instruction must be executed or not.

56 3 – The ARM microprocessor architecture

Table 3.1. Mode bits possible values M[4:0] mode 10000 user 10001 FIQ 10010 IRQ 10011 supervisor 10111 abort 11011 undefined 11111 system

These flags are: N bit: negative or less than flag; Z bit: zero result flag; C bit: carry or borrow or extend flag; V bit: overflow flag. All these flags can be modified by an arithmetic or logic operation performed by the ALU and the C bit is also used in arithmetic operations which consider previous carry or borrow. CPSR and banked SPSRs are identical both in ARM and THUMB state.

Figure 3.4. Program status register format

3.4 The exception handling

Exception handling is a computer mechanism designed to manage the occurrence of some operating conditions that changes the normal flow of execution; a condition which causes this behavior is properly called an exception. Sometimes this term

57 3 – The ARM microprocessor architecture

Figure 3.5. ARM7TDMI register set in ARM/THUMB state

is used only to designate error conditions and not to refer conditions that could be considered as part of the normal flow of execution, including interrupts management. In order to respect ARM7 architecture development guidelines, interrupts are also considered as exceptions and not only problematic situations. In presence of an exception the execution flow must be halted temporarily and the same exception must be recognized and managed via a routine called handler; usually the processor state needs to be frozen by saving all the values stored within the resources in memory or in the banked registers. Only in this way the normal execution flow can be retaken later, at the end of the exception handling. The processor, when entering an exception, performs the following operations:

• Preserves the address of the next instruction to be executed in the dedicated Link Register (LR); the address stored in the register, with respect to the program counter address, depends on the processor state (ARM or THUMB) and on the type of exception arisen, so that the program resumes from the

58 3 – The ARM microprocessor architecture

right place when the processor ends the exception handling, with a standard assembly instruction (MOVS PC, R14 ) and without storing further information.

• Copies the CPSR into SPSR , depending on the operating mode the processor is entering.

• The CPSR new value is forced from the exception entered.

• The Program Counter (PC) is modified to fetch the instruction indicated by the exception vector.

Entering an exception from the THUMB state implies the processor switches into the ARM state, after the exception vector address is loaded into the program counter. The exception vectors contain the addresses of the various handler routines; the first eight memory words are reserved to these pointers, respecting the scheme reported in table 3.2.

Table 3.2. Exception vectors address exception mode on entry 0x00000000 reset supervisor 0x00000004 undefined instruction undefined 0x00000008 software interrupt supervisor 0x0000000C abort (prefetch) abort 0x00000010 abort (data) abort 0x00000014 reserved reserved 0x00000018 IRQ IRQ 0x0000001C FIQ FIQ

The exception handler, after the execution of a set of proper instructions, must return the control to the program which was in execution and has to:

• Move the address contained in the link register (opportunely corrected via an offset which depends on the exception occurred) into the program counter.

• Move the SPSR content into the CPSR to restore the original processor state; by this operation also the ARM or THUMB state is restored and no explicit branch and exchange instruction is needed.

• Clear the interrupt disable flags whether they was set on exception entry;

59 3 – The ARM microprocessor architecture

3.4.1 Processor reset An asynchronous transition from high to low on the nRESET signal forces the processor to abandon the execution of the program or exception handler and when the signal returns to a high level the processor is reinitialized to a well defined state, i.e.:

• The M-bits of the CPSR (M[4:0]) are forced to “10011”, so that the execution resumes in supervisor mode.

• R14 svc and SPSR svc are overwritten by copying the current values of the program counter (PC) and status register (CPSR) into them; the value of the saved PC and SPSR is not defined, depending on the previous operating conditions.

• I-bit and F-bit in the CPSR are set, so that interrupt service is active.

• The T-bit within the CPSR is cleared, hence the execution resumes in ARM state.

• the program counter (PC) is reset, so the next instruction is fetched from address “0x00. . . 00”.

3.4.2 Interrupt and fast interrupt requests The Fast Interrupt Request (FIQ) is a particular exception of the ARM processor designed to support a data transfer or debug process with very low latency; in ARM state this is ensured by a group of seven banked registers reserved to the FIQ operating mode (numbered from R8 fiq to R14 fiq), so that register saving is not necessary. FIQ is not the unique method to forward interrupt request to the processor, beside this a standard interrupt request (IRQ) can be exploit, but the priority of the latter is lower and so the FIQ mode can mask a concurrent IRQ. IRQ mode benefits of only two banked registers, R13 irq and R14 irq; the latter is used as link register for the exception return. FIQ exception is entered by taking the nFIQ input low, the standard interrupt request can be forwarded by taking the nIRQ input low. This inputs can except either synchronous or asynchronous transitions, depending on the state of the ISYNC input signal. When ISYNC is low, interrupt requests on nFIQ and nIRQ are considered asynchronous and a cycle delay for synchronization is incurred before the interrupt can be analyzed by the processor. An interrupt handler should leave the interrupt by executing:

SUBS PC, R14 fiq, #4 or SUBS PC, R14 irq, #4

60 3 – The ARM microprocessor architecture

The interrupt requests may be ignored by setting F-bit and I-bit in the CPSR, but these operations may be executed only in privileged mode and not during the normal execution flow (user mode). If the F flag is clear, ARM7TDMI checks for a low level on the output of the FIQ/IRQ synchronizer at the end of each instruction.

3.4.3 Abort conditions The abort condition happens when a memory access can not be completed for any reason and this condition must be reported by the (MMU), which drives the ABORT input to a high level. Two different cases can occur:

• Prefetch abort, happens e.g. in presence of an invalid fetch address.

• Data abort, happens during a memory access for data load or store.

If a prefetch abort occurs, the processor does not enter the exception immediately, but marks the instruction as invalid and switches to the abort mode when that instruction reaches the execution stage. If a branch instruction comes before the invalid instruction, the processor does not take the abort exception. The management of data abort events depends on the instruction which is being executed. A data swap2 instruction (SWP) which generate an abort exception has no effects, as the instruction had not been executed. The execution of a load (LDR) or store (STR) instruction, in presence of the exception, writes back3 the modified base register and the effect must be taken in account by the handler. If a block data transfer 4 execution generates an abort exception, the operation is completely performed but in a load operation the remaining registers are not up- dated. If the base register is not in list and the writeback is required, its value is modified. After the solution of the problem which has caused the abort exception, the control should be returned to the normal flow by using:

SUBS PC, R14 abt, #4 for a prefetch abort, and SUBS PC, R14 abt, #8 for a data abort.

2A data swap instruction is a particular operation which swaps the values contained in a pro- cessor register and in a memory location, its usually performed for the implementation of software semaphores and so is executed in locked mode. 3In exception absence, the writeback operation is performed only if expected by the addressing mode or explicitly required by the instruction syntax. 4A block data transfer operation loads or stores a number of processor general purpose registers from/to the memory

61 3 – The ARM microprocessor architecture

The difference in the offset is due to the processor prefetch mechanism. In the abort mode the processor uses the normal accessible registers in ARM and THUMB state, excluding link register and program counter which are masked by the banked registers R14 abt and R15 abt respectively; these registers are the same in both operating states. Through the abort mechanism is possible to implement a paged virtual memory system: in fact, in presence of an unavailable data request, the memory management unit flags the abort to the processor, which activates the abort handling procedure. The MMU, subsequently, tries to find the required data in other memory pages to supply the processor the wanted data. The handler must be written so that the CPU is put in a wait state, until the MMU furnishes the right data or instruction and then continues performing the instruction required.

3.4.4 Software interrupts and supervisor mode The processor allows also the handling of software interrupts by using a dedicated instruction (SWI). These feature is used for entering the supervisor mode, usually to request a particular supervisor function, i.e. operating system operations. The instruction behavior is described in the section 3.5.12 and the handler should return by using:

MOV PC, R14 svc which is irrespective of the processor state at the exception entering, restores the CPSR and returns to the instruction following the software interrupt instruction. PC and LR banked implementation for this operating mode is also provided.

3.4.5 Undefined instruction When the processor decodes an instruction which can not be handled, it takes the undefined instruction trap. This exception allows to avoid the system to enter an unrecoverable state, but it is also a useful mechanism that can be used to extend the processor instruction set by software emulation. After the instruction emulation and irrespective of the state, the trap handler should execute the following instruction:

MOV PC, R14 und

This restores the CPSR and returns to the instruction following the undefined in- struction. The undefined mode benefits only of the link register and of the program counter in banked version, referred as R14 und and R15 und.

62 3 – The ARM microprocessor architecture

3.4.6 Exception priorities The exception handling must take in account some rules to establish whether a re- quest or an event has the priority with respect to another. The exception priority follows a fixed scheme in the ARM7TDMI processor, some of them have a highest priority: 1 reset 2 data abort 3 FIQ 4 IRQ 5 prefetch abort whereas undefined instruction and software interrupt have the same lowest priority but are, obviously, mutually exclusive. A particular case arise if a data abort occurs at the same time as a FIQ (with fast interrupt service enabled); in this situation the processor enters the data abort handler and then processes to the FIQ vector immediately. A normal return from FIQ will cause the data abort handler to re- sume execution, but placing data abort at a higher priority than FIQ is necessary to ensure that the transfer error does not escape detection.

3.5 ARM instruction set

The following sections describe the most important characteristics of the ARM in- struction set, grouping them with respect to the functional and coding features. A first coding scheme is furnished in figure 3.6 where the instruction classes are evident.

3.5.1 Conditional execution When the processor works in ARM state, all the instructions are conditionally ex- ecuted, by checking the condition field reported within the instruction coding and the CPSR’s condition flags. Every instruction has a 4-bit field which express one of fifteen possible conditions; one of these conditions avoids the condition checking so that the instruction is “always”executed. There is a reserved condition (“1111”) which must not be used. The condition is evaluated by parsing the ALU flags de- scribed in the paragraph 3.3.5, which report the informations about the result of the last ALU operation executed, indicating if it was negative or zero, if a carry or borrow arose or if a overflow condition occurred. For each condition a mnemonic two-character suffix is defined and it must be added at the instruction mnemonic to enable the optional conditional execution. The mnemonics and their meaning are reported in the table 3.3. A conditioned instruction is executed only if the con-

63 3 – The ARM microprocessor architecture

Figure 3.6. 32-bit instruction set format summary

dition is true and whether in the assembly code no condition suffix is expressed, the“always”coding is inserted by default. Every time an instruction remains unex- ecuted because of the invalid condition, it consumes a clock cycle and the control passes to the next instruction entering a new processor cycle.

3.5.2 Branch and exchange (BX) The branch and exchange instruction accepts the indication of a processor register containing the address of the branch destination. During instruction execution this address is copied into the program counter, so that, at the subsequent clock (MCLK)

64 3 – The ARM microprocessor architecture

Table 3.3. Condition code summary Code Suffix Flags Meaning 0000 EQ Z set equal 0001 NE Z clear not equal 0010 CS C set unsigned higher or same 0011 CC C clear unsigned lower 0100 MI N set negative 0101 PL N clear positive or zero 0110 VS V set overflow 0111 VC V clear no overflow 1000 HI C set and Z clear unsigned higher 1001 LS C clear or Z set unsigned lower or same 1010 GE N equals V greater or equal 1011 LT N not equal to V less than 1100 GT Z clear AND (N equals V) greater than 1101 LE Z set OR (N not equal to V) less than or equal 1110 AL (ignored) always rising edge, the instruction contained at the branch destination is fetched. If the condition is valid, the branch operation causes a pipeline flush and a refilling, so the instructions contained in the fetch and decode pipeline registers are removed (NOP5 are inserted). If the condition is not valid the processor does not execute the branch and waits the next clock cycle to execute the subsequent instruction. This instruction permits also the instruction set exchange, by inspecting the value of the least significant bit of Rn (Rn[0]), the processor determines whether the current state must be switched to ARM or THUMB state. The assembly syntax is:

BX{cond} Rn where the conditional (cond) mnemonic is optional. The coding format is reported in figure 3.6, where the 4-bit fields for condition and register are shown; all the other bit are used only for the instruction encoding. The instruction takes three clock cycles to execute, the first is non-sequential because of the jump to a new address, the remaining two are sequential cycles.

5NOP stands for No OPeration, is an instruction that does not perform anything leaving the processor state unchanged but consuming a clock cycle.

65 3 – The ARM microprocessor architecture

3.5.3 Branch and branch with link (B-BL) Branch and branch with link instructions perform the same operation, with the difference that the latter stores the address of the following instruction in the link register, to allow the execution to be retaken from the point it was left when the jump occurred. The branch destination address is relative to the instruction address and is obtained by adding to the PC a proper calculated offset. The signed 2’s complement 24 bit offset, expressed in the instruction assembly, is shifted left by two bits and sign extended to 32 bits; this quantity is added to the program counter to obtain the next instruction fetch address. Through the immediate offset, the instruction can specify a branch of 32 Mbytes ahead or back with respect to the PC. Branches beyond this amount must use an absolute destination previously loaded into a register, so relative addressing is allowed by performing some operations in advance. In this case the PC should be manually saved in the link register (R14) if a branch with link type operation is required. The branch offset must take in account the prefetch operation, which causes the PC to be two words ahead of the current instruction. Introducing the suffix“L”in the assembly instruction, the branch with link (BL) operation can be obtained, so that the processor writes the old PC into the link register (R14). The PC value is written into R14 when the instruction enters the execution stage and so must be adjusted considering the prefetch operation, in fact, it does not contain the address of the instruction following the branch instruction but a four bytes ahead value. The correction is performed via a subtraction which uses the processor ALU, so the operation can not be done immediately, because the ALU is engaged to calculate the branch destination. The operation causes the jump to a different fetch address, so it takes three clock cycles to complete. During the first non-sequential memory cycle the new fetch address is determined and two sequential cycles for the pipeline refilling follow. If the link option is enabled, the processor stores the current PC in LR during the first clock cycle and corrects its value in the last cycle. It is important to underline that the CPSR is not saved by the instruction and so can be necessary to insert a proper instruction before the branch is taken. To return from a routine called by branch with link a move instruction can be used, to copy the link register in the PC if its value is still valid, or a block data transfer instruction if the link register has been saved onto a stack pointed by a register. The assembly syntax is:

B{L}{cond} where the conditional mnemonic and the link label (L) are optional. The field reported as can be an immediate offset expressed via a preceding “#”character or a symbolic expression including assembly code labels; the right

66 3 – The ARM microprocessor architecture offset calculation is committed to the assembler. The coding format is reported in figure 3.6, where the 4-bit fields for condition and register are shown; all the other bit are used only for the instruction encoding.

3.5.4 Data processing instructions This class of operations produces a result by performing a specified arithmetic or logical operation on one or two operands. For a simpler comprehension of coding and functionalities of these instructions the coding scheme is reported in figure 3.7. A 4-bit field within the coding selects the operation to be performed, as can be

Figure 3.7. Data processing instructions coding seen in table 3.4. The same table reports the action executed by the ALU unit to obtain the result. Beside the well known operations, the bit clear can be underlined;

67 3 – The ARM microprocessor architecture this a useful logical operation which permits easy masking of operands. Between these operation also the MOV instruction is numbered, in fact it can also involve a data modification performed by the barrel shifter and not only a copy operation.

Table 3.4. Data processing operations summary Mnemonic OpCode Action AND 0000 operand1 AND operand2 EOR 0001 operand1 EOR operand2 SUB 0010 operand1 - operand2 RSB 0011 operand2 - operand1 ADD 0100 operand1 + operand2 ADC 0101 operand1 + operand2 + carry SBC 0110 operand1 - operand2 + carry - 1 RSC 0111 operand2 - operand1 + carry - 1 TST 1000 as AND, but result is not written TEQ 1001 as EOR, but result is not written CMP 1010 as SUB, but result is not written CMN 1011 as ADD, but result is not written ORR 1100 operand1 OR operand2 MOV 1101 operand2 (operand1 is ignored) BIC 1110 operand1 AND NOT operand2 (Bit clear) MVN 1111 NOT operand2 (operand1 is ignored)

The data processing operations can be classified as logical (AND, ORR, EOR, TST, TEQ, BIC, CMP, CMN, MOV, MVN) or arithmetic (ADD, ADC, SUB, SBC, RSB, RSC). The logical operations perform the relative operation on all correspond- ing bits of the operand or the operands to produce the result and some of them (TST, TEQ, CMP, CMN) do not write this result to a register. If the“S”label is expressed in the instruction, the data processing operations do not affect the CPSR flags, otherwise they are modified as follows: • The V-flag is unaffected by logical operations, but if the ALU detect an over- flow during an arithmetic operation it will be set.

• The C-flag is set if a carry or a borrow occur during an ALU arithmetic operation, is equal to the last bit shifted out by the barrel shifter if a logical operation is performed.

• The Z-flag is set only if the result is all zeros.

• The N-flag is always equal to bit 31 of the result, representing the sign of the obtained value.

68 3 – The ARM microprocessor architecture

For operations which do not write a result, the CPSR flag update is implicit, also if not expressed in the instruction. The“operand 2”field can be expressed in two different manners, by an immediate operand or via a opportunely shifted registered value. The immediate operand is expressed by using an 8-bit unsigned value and a 4-bit unsigned integer which specifies a rotation operation on the immediate value. This value is zero extended to 32 bits, and then subject to a rotate right by twice the value in the rotate field. This operation is performed by the barrel shifter and enables many common constants to be generated, e.g. all powers of two. The other method to express the second operand takes the value from the specified register and performs a barrel shifter operation which is controlled by the“shift”field in the instruction. This field indicates the type of shift to be per- formed, i.e. logical shift left (LSL), logical shift right (LSR), arithmetic shift right6 (ASR) or rotate right (ROR). Arithmetic shift left (ASL) and logical shift left (LSL) represent the same operation (the assembler assembles to the same code). The amount by which the register should be shifted may be contained in an immediate field in the instruction or in the least significant byte of another register (ref.3.8). When the shift amount is specified in the instruction, it is contained in a 5-bit

Figure 3.8. ARM shift operations coding

field, which accepts any value from 0 to 31; this method is referred as instruction specified shift amount. Some special cases are provided, in order to obtain the coding of particular barrel shifter operations: shift and rotate operations by a null amount (which do not modify any operand) are non coded and free combinations are exploit. The LSL#0 operation uses directly the contents of Rm as the second operand and the

6The difference between logical and arithmetic shift right is that a logical shift inserts only zeros from the MSB, whereas an arithmetic shift inserts a bit value in order to maintain the operand 2’s complement notation.

69 3 – The ARM microprocessor architecture shifter carry out is the old value of the CPSR C-flag (previous CPU cycle). The form of the shift field which might be expected to correspond to LSR#0 is used to encode LSR#32, which has a zero result with bit 31 of Rm as the carry output. Also the logical shift right by zero is redundant as it is the same as logical shift left by zero, so the assembler will convert LSR#0 (and ASR#0 and ROR#0) into LSL#0, and allow LSR#32 to be specified. The form of the shift field which might be expected to give ASR#0 is used to encode ASR#32. Bit 31 of Rm is again used as the carry output, and each bit of the second operand is also equal to bit 31 of Rm. The result is therefore all ones or all zeros, according to the value of bit 31 of Rm. The form of the shift field which might be expected to give ROR#0 is used to encode a special function of the barrel shifter, called rotate right extended (RRX). This is a rotate right by one bit position of the 33 bit quantity formed by appending the CPSR C-flag to the most significant end of the contents of Rm. The last mode to express the shift type and amount within a data processing instruction is represented by the register specified shift amount, which uses the least significant byte of Rs to store the shift amount. If this byte is zero, the unchanged contents of Rm will be used as the second operand, and the old value of the CPSR C-flag will be passed to the shifter carry output. If the byte contains a value between 1 and 31, the shifted result is obtained by a shift operation with the same amount. If the byte value is 32 or more, the result will be a logical extension of the shift operations described above, but the following rules are defined:

• LSL by 32 has result zero and carry out equal to bit 0 of Rm.

• LSL by more than 32 has result zero and carry out zero.

• LSR by 32 has result zero and carry out equal to bit 31 of Rm.

• LSR by more than 32 has result zero and carry out zero.

• ASR by 32 or more has result filled with and carry out equal to bit 31 of Rm.

• ROR by 32 has result equal to Rm and carry out equal to bit 31 of Rm.

• ROR by n, where n is greater than 32, will give the same result and carry out as ROR by n-32; therefore repeatedly subtract 32 from n until the amount is in the range 1 to 32 (a masking operations can simplify this deduction).

If the operation performed is in the logical class and if the“S”flag is set, all the shift operations (rotate included) save the last bit shifted out in the CPSR C-flag. The instruction cycle times must take in account some hardware implementation aspects. A normal data processing operation, which does not involve the program counter and a third register for the second operand shift, takes only one sequential

70 3 – The ARM microprocessor architecture memory cycle. Because of the pipeline structure, if the destination register is R15 (the program counter), the refill operation must be taken in account (and the further pipeline flush) and so the instruction needs two additional cycles to complete, one sequential and one non-sequential (a jump occurs). For reasons connected to the internal buses structure, the ALU unit can not access three registers during the same CPU cycle, so a further internal cycle is needed if an operation requires to read the amount of a shift operation from the register file. The instruction syntax is quite complex and due to their nature, it is differenti- ated by groups [19]. MOV and MVN accept a single operand but it can be expressed in various manner: {cond}{S} Rd, where is one of the mnemonics reported in table 3.4, cond is the condi- tional mnemonic (optional) reported in table 3.3, S is the optional label which forces the CPSR flags update, Rd is the destination register and can be: Rm{,} or <#immed8 r> where Rm is an operand register, which can be optionally shifted using the as- sembly syntax or <#expr>, or RRX. The field can be LSL (ASL), ASR, LSR or ROR, represent the regis- ter containing the shift amount and <#expr> is an integer in the range {0, 31} which represent the immediate shift amount and must be expressed with the“#”symbol preceding the number. The field <#immed8 r> is another immediate expression which the assembler will attempt to generate by using a shifted immediate 8-bit field as reported above. CMP, CMN, TEQ, TST are instructions which do not produce a result, so they do not accept a destination register; the assembly syntax is: {cond} Rn, where Rn is first operand register. For these operations the CPSR flags update is implicit and can be omitted in the instruction syntax. The other operations (AND, EOR, SUB, RSB, ADD, ADC, SBC, RSC, ORR, BIC) accept the following syntax: {cond}{S} Rd,Rn, where both destination and first operand registers are expressed.

3.5.5 PSR transfer instructions These operations are used to access directly the processor status registers CPSR and SPSR, modify their values or save them to other general purpose registers. The instructions are formed from a subset of the data processing operations and

71 3 – The ARM microprocessor architecture are implemented using the data compare instructions without the“S”flag set. Their encoding is quite sophisticated and is shown in figure 3.9.

Figure 3.9. PSR tranfer instructions coding

MRS instruction allows the contents of the CPSR or SPSR to be moved to a general register. The MSR instruction allows the contents of a general register to be moved to the CPSR or SPSR register. The MSR instruction also

72 3 – The ARM microprocessor architecture allows an immediate value or register contents to be transferred to the condition code flags (N,Z,C and V) of CPSR or SPSR , without affecting the control bits7. In this case, the top four bits of the specified register contents or a 32 bit immediate value are written to the top four bits of the relevant PSR. Some restrictions are imposed i.e., in user mode, the control bits of the CPSR are protected from change, hence only condition flags can be changed; on the contrary, in all privileged modes, the entire CPSR can be changed. It is important to underline that the software must never change the state of the T-bit in the CPSR directly8, but only by using a branch and exchange instruction. The SPSR register accessed depends on the processor operating mode during the instruction execution and no SPSR is accessible in user mode, since no such register exists. As discussed in the paragraph 3.3.5 and in order to ensure the compatibility with future processors, all reserved bits should be preserved when changing the value in a PSR. This means that a read-modify-write strategy should be used to alter the con- trol bits of any PSR register, i.e. the content of a PSR register must be transferred to a general register using the MRS instruction, the relevant bits must be changed and then the modified value should be written back to the PSR register, by using the MSR instruction. PSR transfer instructions take only a sequential memory cycle to complete, because they perform the operations directly between internal registers and require no extra times for resources access. The assembly syntax is as follows:

MRS{cond} Rd, to transfer PSR content to Rd; MSR{cond} , Rm to transfer Rd content to PSR; MSR{cond} , Rm to transfer Rd to PSR flags only; MSR{cond} , to modify PSR flags by an immediate value. The conditional mnemonic is optional, Rd and Rm are expressions evaluating to a register number, can be CPSR (also CPSR all) or SPSR (also SPSR all), must be CPSR flg or SPSR flg. From the , which must be expressed by using a prefix“#”and a 32-bit representable value, the assembler will attempt to generate a shifted immediate 8-bit field (immed8 r typical form) to match the expression and returning an error message if it is impossible.

3.5.6 Multiply and multiply and accumulate (MUL-MLA) The multiply (MUL) and multiply and accumulate (MLA) operations are performed internally by using the Booth’s algorithm on groups of eight bit per cycle. The

7i.e. the other bits as the mode bits, the TBIT and the interrupt disable flags. 8if this happens, the processor will enter an unpredictable state.

73 3 – The ARM microprocessor architecture

ARM7 datapath provides a high speed 32x8 multiplier [7] which is depicted in figure 3.10. The instructions perform the multiplication between two 32-bit operands and

Figure 3.10. LISARM 32x8 multiplier the result is still stored in a 32-bit register. The method allows to work on operands which may be considered as signed (in the 2’s complement form) or unsigned integers, by the fact that the results of a signed and unsigned multiply of 32 bit operands differ only in the upper 32 bits, hence the low 32 bits of two results are identical. The coding format is reported in figure 3.6. Both the operations use Rm as the multiplicand and Rs as the multiplier; the result is stored in Rd. The multiply instruction gives as result Rm ∗ Rs, while the third operand Rn is ignored and should be set to zero for compatibility with possible instruction set future upgrades. The Rn register plays a different role in the multiply and accumulate instruction, which gives Rd = Rm ∗ Rs + Rn using Rn as accumulator. The instruction can save an explicit ADD instruction in some circumstances and it is very useful in many

74 3 – The ARM microprocessor architecture applications. As data processing operations do, also multiply instructions the CPSR flags update is optional; the update is controlled by the“S”bit in the instruction. The N-flag is made equal to bit 31 of the result and the Z-flag is set if and only if the result is zero; the carry flag is set to a meaningless value and the overflow flag is unaffected. The instruction syntaxes are: MUL{cond}{S} Rd,Rm,Rs and MLA{cond}{S} Rd,Rm,Rs,Rn where Rd is the destination register, Rm and Rs are the multiplicand and the mul- tiplier respectively, and Rn is used as accumulator in the multiply and accumulate operation. The flag S enables the CPSR flags update if present. The algorithm used and the bit width of the multiplier (eight bits only) requires a variable number of clock cycles to complete, depending on the number of 8-bit groups of the multiplier which are all zero or all ones. Calling this number m, MUL takes one sequential and m internal cycles and MLA takes one sequential and (m+1) internal cycles to execute. To explain the concept, the number of internal cycles can be: • One if bits [32:8] of the multiplier operand are all zero or all one.

• Two if bits [32:16] of the multiplier operand are all zero or all one.

• Three if bits [32:24] of the multiplier operand are all zero or all one.

• Four in all other cases.

3.5.7 Multiply and multiply and accumulate long (MULL- MLAL) By using the same algorithm and method discussed in the section 3.5.6, the multiply and multiply and accumulate long instructions perform integer multiplication on two 32-bit operands and produce 64-bit results. Also in this case signed and unsigned operands are accepted, but to obtain the correct multiplication result the right in- struction must be used. The UMULL and UMLAL instructions treat all of their operands as unsigned binary numbers and write an unsigned 64-bit result. SMULL and SMLAL instructions treat all of their operands as 2’s-complement signed num- bers and write a 2’s complement signed 64-bit result. The multiply forms (UMULL and SMULL) take two 32-bit numbers and multiply them to produce a 64-bit result in the form (RdHi,RdLo) = Rm ∗ Rs. The lower 32 bits of the 64-bit result are written to RdLo, the upper 32 bits of the result are written to RdHi.

75 3 – The ARM microprocessor architecture

The multiply-accumulate forms (UMLAL and SMLAL) take two 32-bit num- bers, multiply them and add a 64 bit number to produce a 64 bit result in the form (RdHi,RdLo) = Rm ∗ Rs + (RdHi,RdLo). The lower 32 bits of the 64-bit number to add is read from RdLo. The upper 32 bits of the 64-bit number to add is read from RdHi. The lower 32 bits of the 64-bit result are written to RdLo. The upper 32 bits of the 64-bit result are written to RdHi. As the data processing operations do, also for multiply instructions the CPSR flags update is optional; the update is controlled by the“S”bit in the instruction. The N-flag is made equal to bit 63 of the result and the Z-flag is set if and only if the result is zero; the carry flag is set to a meaningless value and the overflow flag is unaffected. The instruction syntaxes are:

UMULL{cond}{S} RdLo,RdHi,Rm,Rs UMLAL{cond}{S} RdLo,RdHi,Rm,Rs SMULL{cond}{S} RdLo,RdHi,Rm,Rs SMLAL{cond}{S} RdLo,RdHi,Rm,Rs The number of cycles necessary to perform the operations can be determined by the same considerations explained for MUL and MLA instructions (par.3.5.6). MULL takes one sequential and (m+1) internal cycles and MLAL takes one sequen- tial and (m+2) internal cycles to execute, where m is the number of 8-bit multiplier array cycles required to complete the multiply, which is controlled by the value of the multiplier operand specified by Rs. Possible values are: for signed instructions SMULL, SMLAL:

• One if bits [32:8] of the multiplier operand are all zero or all one.

• Two if bits [32:16] of the multiplier operand are all zero or all one.

• Three if bits [32:24] of the multiplier operand are all zero or all one.

• Four in all other cases. and for unsigned instructions UMULL, UMLAL:

• One if bits [31:8] of the multiplier operand are all zero.

• Two if bits [31:16] of the multiplier operand are all zero.

• Three if bits [31:24] of the multiplier operand are all zero.

• Four in all other cases.

76 3 – The ARM microprocessor architecture

3.5.8 Single data transfer operations (LDR-STR) The single data transfer operations are a group of powerful instructions which allow many addressing modes for memory data accessing. As ARM7 implements a load- /store architecture, these instructions, beside the block data transfer operations, are the only way to access data stored in memory. Both the instructions allow to transfer data only between a general purpose register and an external memory lo- cation and data can be word sized or also single bytes. The memory address used in the transfer is calculated by adding or subtracting an offset from a specified base register. If auto-indexing is required the result of this calculation may be saved into the base register performing a writeback operation. The coding scheme is shown in figure 3.11.

Figure 3.11. Single data transfer instructions coding

The offset can be expressed via a 12-bit unsigned binary immediate value or by using the result of an opportunely shifted registered value (in the same way of the

77 3 – The ARM microprocessor architecture data processing instructions but registered amount is not allowed). By default the offset will be added to the base register Rn; to subtract it a“-”must be introduced in the instruction syntax, before the offset register or immediate value indication. The binary coding uses the“U”flag to establish if the base register address must be incremented or decremented. The offset modification may be performed either before (pre-indexed) or after (post-indexed) the base is used as transfer address, by a proper use of the square parenthesis in the assembly syntax; in the coding pre or post indexing condition is expressed by the“P”flag. The “W”bit gives optional auto increment and decrement addressing modes activating the writeback operation, although this bit is redundant in post-indexed data transfers, which always write back the modified base. This has a particular meaning in privileged mode, because it forces non-privileged mode for the transfer, allowing the operating system to generate a user address in a system where the memory management hardware makes suitable its use. To define the data size for the transfer, the“B”label must be set at the end of the assembly mnemonics, so that the processor can signal it to the memory system. In these operations the endianness configuration plays a fundamental role, because byte load operations expect the data communication on some specific lines, which change in big and little endian configuration. The store operation, on the other hand, repeats the same byte on all the 8-bit groups of lines but, reading the memory interface dedicated signals, the memory management unit has to determine the right location for the transfer action. Further details can be found in [16], where the complex mechanisms used in non-aligned data accesses are described. Some informations about processor behavior are discussed in the memory interface paragraph (3.7) and the fundamental endianness schemes are reported in the section 3.3.2. Whether during a memory access an error occurs, the MMU signal this situ- ation to the processor by using the ABORT line and the abort exception handling procedure must be activated. The actions performed by the processor to avoid un- recoverable state entering are discussed in the paragraph 3.4, where data abort trap is described. The store operation takes two non-sequential memory cycles to complete: in the first cycle the address calculation is performed by the ALU and in the subsequent clock cycle the data is stored. The load operation takes respectively an internal, a non-sequential and a sequential memory cycle to complete: in the first cycle the address calculation is performed by the ALU and in the subsequent clock cycle the data address is supplied to the MMU; in the third cycle the data is sampled on the data bus and registered. By the fact the destination register can also be the PC, a sequential and a non-sequential memory cycles must be added, in order to perform the pipeline refilling (a pipeline flush is also required). If the writeback operation is required, the base register is updated in the second cycle in both the instructions.

78 3 – The ARM microprocessor architecture

The assembly syntax for these instructions is quite sophisticated, because of the various addressing modes available. The general form is the following: {cond}{B}{T} Rd,

where cond is the conditional field, Rd is the destination or the source register, and B must be specified for a byte access. The label T, if present, forces non-privileged mode for the transfer cycle, but it is not allowed when a pre-indexed addressing mode is specified or implied. The
part discriminates between the various types of addressing modes, which are described in-depth in [19]. The target location can be PC relative, so that the
field is an expression which represents an offset the assembler uses with the PC value to generate a pre-indexed mode address. A zero offset addressing can be expressed by using [Rn] in the
field and other pre-indexing modes use one of the following formats: [Rn,<#expression>]{!} or [Rn,{+/-}Rm{,}]{!} where <#expression> is an immediate offset and“!”requires the writeback operation if present. The second form accepts an optional field which must be applied to the register content and it can be expressed by using the notation discussed in the data processing section (3.5.4). The post-indexing modes can be specified by: [Rn],<#expression>{!} or [Rn],{+/-}Rm{,} where the writeback operation is implicit and must not be required in the assembly instruction. In all the previous notations the square parenthesis contain the base register indication, with or without a post-indexed offset specification.

3.5.9 Halfword and signed data transfer operations These instructions allow to load or store half-words of data and also load bytes or half-words which represent signed or unsigned values and so need to be sign extended to 32-bit data. The instructions accept the same addressing modes seen for ordinary LDR and STR operations (section 3.5.8) with little differences in the assembly syntax. The encoding scheme is reported in figure 3.12, where some further flags, with respect to the ordinary instructions, are evident. The“S”flag is used to discern between signed and unsigned data and the “H”flag tells what size is the data to transfer. The same letters must be opportunely ex- pressed in the instruction syntax to allow the correct interpretation of signed values and data sizes. The result of the transfer operation is the sign extension to a 32-bit value for a signed data type or a zero filling from the left end for unsigned values. The assembly syntax is: {cond} Rd,

79 3 – The ARM microprocessor architecture

Figure 3.12. Halfword and signed data transfer instructions coding where H request the transfer of a halfword quantity, SB loads a sign extended byte and SH loads sign extended halfword. The last two suffixes are valid only for the LDR operation and all the other fields resemble the standard LDR and STR assembly scheme. Also the considerations about abort exception and instruction cycle times are the same already view in the section 3.5.8.

3.5.10 Block data transfer operations (LDM-STM) The block data transfer operations allow to load (LDM) and store (STM) the com- plete set of the general purpose registers or a user defined subset in memory. The assembly syntax provides many ways to express the register list, which in the bi- nary representation are encoded with a set of sixteen flags telling if a register must be transferred or not. The instruction supports all the possible stacking modes: starting from the base register value, the memory address can be pre/post incre- mented/decremented so that the stack can grow up or down in the memory space. The operation is useful in such cases in which the processor registers content must be saved in a stack, to pass the control to a subroutine or to another process allow- ing to save many memory cycles with respect to single data transfer operations. In the coding format (figure 3.6) many flags are reported; their meaning is the same

80 3 – The ARM microprocessor architecture discussed in the single data transfer section 3.5.8 for the addressing modes and write- back operation. For these operations an immediate offset can not be expressed and also byte data transfer is not allowed. A particular meaning has the“S”flag, which allows to execute the operations in privileged mode. The instructions are usually performed on the register bank of the current state but, if the register R15 is in list and the“S”flag is active, the LDM operation transfers the SPSR to the CPSR at the same time the PC is transferred9 (because of this action a mode change happens). The STM instruction, in the same condition transfers the user bank registers instead of the current mode bank registers. This behavior is useful in order to switch between processes, when user state freezing is necessary. If R15 is not in list, both the instructions transfer the user bank registers rather than current mode registers (the behavior for STM does not change with respect to the previous condition). In presence of memory access errors the behavior is the same discussed for single data transfer operations (section 3.5.8), with some differences: the STM instruction writes back the base register if required, hence the recovery of this situation must be performed by the software. The LDM instruction, after the abort exception occurs, does not update the remaining registers and saves the PC content, ensuring recoverability. The base register content is restored, in order to retry the load operation under the abort handler control. The assembly syntax is:

{cond} Rn{!}, { ^ } where Rn is the base register, is the register list, which can be expressed by separating register names with commas or by grouping them with“-”, so that all the included registers are transferred ((e.g. {R0,R2-R7,R10}). If the symbol“ ˆ ”is present, sets the“S”bit to load the CPSR along with the PC, or force user bank transfer when in privileged mode. The“!”symbol, if present, requires the writeback operation. The addressing modes has different mnemonics and names, which are reported in the table 3.5. The store operation takes two non-sequential and n−1 sequential memory cycles to complete: in the first cycle the address calculation is performed by the ALU and in the subsequent clock cycles the n registers listed are stored. The load operation takes respectively an internal, a non-sequential and n sequential memory cycle to complete: in the first cycle the address calculation is performed by the ALU and in the subsequent clock cycles the data addresses are supplied to the MMU for the loading of the n registers listed. After the first two cycles, the processor starts sampling the data bus and read values are registered. By the fact the destination register can also be the PC, a sequential and a non-sequential memory cycles must

9registers are transferred in order from R0 to R15, so the PC is always the last of them.

81 3 – The ARM microprocessor architecture

Table 3.5. Block data tranfer addressing mode names Name Stack Other L bit P bit U bit pre-increment load LDMED LDMIB 1 1 1 post-increment load LDMFD LDMIA 1 0 1 pre-decrement load LDMEA LDMDB 1 1 0 post-decrement load LDMFA LDMDA 1 0 0 pre-increment store STMFA STMIB 0 1 1 post-increment store STMEA STMIA 0 0 1 pre-decrement store STMFD STMDB 0 1 0 post-decrement store STMED STMDA 0 0 0 be added, in order to perform the pipeline flush and refilling. If the writeback operation is required, the base register is updated in the second cycle in both the instructions.

3.5.11 Single data swap (SWP) The data swap instruction is used to swap a byte or word quantity between a register and external memory. This instruction is implemented as a memory read followed by a memory write operation which are“locked”together. This instruction is parti- cularly useful for software semaphores implementation and so the processor cannot be interrupted until both operations have completed. A semaphore is a protected variable which represents the classic method for restricting access to shared resources (e.g. storage) in a multiprogramming environment. To avoid memory content mo- dification happens, during the swap operation, the MMU is warned to treat them as inseparable and so access to the memory is not allowed to other peripherals. The execution of a swap operation is signalled to the memory management unit using the LOCK processor output, which remains high during the operation completion. Via the“B”label expressed in the instruction, a byte quantity can be swapped. The SWP instruction is implemented as a LDR followed by a STR and the action of these is described in the relative section (3.5.8). If the address used for the operation is unacceptable to a memory management system, the memory manager can flag the problem by driving the ABORT signal high. This can happen during read or write cycle (or both), and in either case, the data abort trap will be taken (paragraph 3.4). The system software is expected to resolve the cause of the problem, so that the instruction can be restarted and the program execution continued. The assembly syntax is:

SWP{cond}{B} Rd,Rm,[Rn]

82 3 – The ARM microprocessor architecture where the conditional mnemonic and the byte transfer label (B) are optional, Rd is the destination register, Rm the source register (which can be the same of the destination register) and Rn is the base register, i.e. the register containing the address for the memory access. The swap instruction takes four clock cycles to complete, the first cycle is used to access the base register (internal cycle), the memory access is performed during the following two cycles, the first reads the memory location (non-sequential cycle) and then the same location is written (sequential cycle). The last cycle is needed to store the read data to the processor register and is a non-sequential cycle.

3.5.12 Software interrupt The instruction causes the software interrupt trap to be taken and is used to enter the supervisor mode. The instruction saves the program counter into the banked link register R14 svc and then forces it to the value fixed by the exception vectors (section 3.2). In order to restore the present state of the processor, CSPR content is saved into the SPSR svc register. The return instruction, as seen in the section 3.4, restores both the status register and the program counter in an atomic operation. The coding format is reported in figure 3.6, where the 24-bit comment represents a group of ignored bits. The operation causes the jump to a different fetch address, so it takes three clock cycles to complete: a non-sequential memory cycle and two sequential cycles for the pipeline refilling. After these cycles, the control is ceased to the exception handler.

3.5.13 Coprocessor instructions The processor can be connected with up to sixteen coprocessors via the dedicated coprocessor interface, as explained in the section 3.8. Three classes of coprocessor instructions are provided by the ARM instruction set:

• Coprocessor Data Operations (CDP)

• Coprocessor Data Transfers (LDC, STC)

• Coprocessor Register Transfers (MRC, MCR)

The coprocessor data operations (CDP) are used by the processor to request the execution of an operation by a defined coprocessor. All the coprocessor operations must provide the“CP#”field, which allows to select the right coprocessor by a unique number. No result is communicated back to the processor, and it will not wait for the

83 3 – The ARM microprocessor architecture operation to complete. The coprocessor could contain a queue of such instructions awaiting execution, and their execution can overlap other activity, allowing the coprocessor and ARM7TDMI to perform independent tasks in parallel. In this class of instructions only some bits are destined to the processor, i.e. the condition field and a 4-bit field which identifies the instruction itself. The remaining bits are used by coprocessors and some field names are used by convention; all of these fields, except the coprocessor identifier, may be redefined. The coprocessor data transfer operations are used to load (LDC) or store (STC) a subset of a coprocessors registers directly to memory. ARM7TDMI is responsible for supplying the memory address, and the coprocessor supplies or accepts the data and controls the number of words transferred by a fixed handshaking protocol. The coprocessor reserved fields are only the “CP#”identifier and a 4-bit index which points one of the internal registers. The processor has to decode all the other bits to identify the memory addressing informations. The addressing modes available are a subset of those used in single data transfer instructions (section 3.5.8) but pre-indexed, post-indexed and also PC-relative addressing modes are provided. The coprocessor register transfers operations (MRC, MCR) are used to communicate information directly between the processor and a coprocessor, transferring data from register to register. An important use of this instructions allows to transmit control information from the coprocessor to the ARM7TDMI CPSR flags, to control the subsequent flow of execution after, e.g., a arithmetic coprocessor operation. As discussed for the data operation class, for this instructions group there are few bits meaningful for the processor, i.e. the condition field and a 4-bit field which identifies the instruction itself. The remaining bits are used by coprocessors and conventions can be opportunely ignored, except the coprocessor identifier field. The coprocessor instructions cycle times depend on the number of busy-wait cycles required for the execution of internal operations and must be analyzed for each case, considering the coprocessor functionalities. The coprocessor instructions are not implemented in this thesis work, because the goal of the model built is to obtain an extensible processor description, which can also be integrated to perform other operations, without the need of external coprocessors.

3.5.14 Undefined instruction By the fact the undefined instruction is conditioned and this is useful to establish during the program execution if the undefined instruction trap to be taken or not, so that the relative handling routine can be activated. The execution of the undefined instruction involves the unrecognized instruction forwarding to the connected copro- cessors; if one of them accepts the instruction, the CPA and CPB signals are used for the handshaking and the coprocessor operations are activated. The instruction

84 3 – The ARM microprocessor architecture cycle time must consider the request forward to the coprocessors and their response; in case of coprocessor absent state the control must be transferred to the exception handler, so the operation takes an internal cycle, a non-sequential cycle (jump to the exception vector pointed location) and two sequential cycles (pipeline refilling) to complete. For its particular meaning and use, the operation has no assembler mnemonics, so can not be activated by a simple code line. The coding format is reported in figure 3.6, where many“don’t care”bits are evident.

3.6 Thumb instruction set

The Thumb instruction set provides less operations with respect to the ARM instruc- tion set and every instruction is only 16-bit wide. This property increase the code density but has the particularity of performing all the operations on the standard 32-bit registers of the architecture. In order to have less bit to encode, only eight of the sixteen ARM state general purpose registers are accessible to most operations, so that a register index can take three bits only. Through a special instruction, the high registers (from R8 to R13) can also be accessed, but for particular purposes only. As discussed in the paragraph 3.2, every Thumb instruction is dynamically trans- lated to an ARM instruction by the decoder and then executed. The translation guidelines are provided in [16]. It is obvious that a 16-bit encoded operation can not include the same features of a 32-bit instruction so a reduced number of logical- arithmetic operations are allowed. Many operations use as source and destination the same register and immediate values are provided only for some instructions. Sum an subtraction are encoded in a unique instruction format, separated from other ALU operations; less ALU operations are provided by the instruction set and all the barrel shifter operations are explicitly included in this group. The branch instructions can be in conditional and unconditional version, also long branch with immediate offset is allowed and, obviously, the branch and exchange instruction. Memory access operations can manage every size of data but with some limitations on addressing modes. Stack pointer (SP or R13) relative load/store and push/pop register operations ease the use of stack structures, saving bits for the encoding. Coprocessor operations are not provided in THUMB state, so the use of an external unit require a change of instruction set and processor state. Interrupt and exception handling is provided also in Thumb state, and it forces the switching to the ARM state; the software interrupt instruction is also provided so that the supervisor mode can be accessed.

85 3 – The ARM microprocessor architecture

3.7 The memory interface

The processor memory interface has a high configurability which allows to connect both SRAM and DRAM systems, but also ROM. Beside the 32-bit address bus, a bidirectional 32-bit data bus (D) is provided and also a couple of unidirectional data buses (DIN and DOUT) with the same bit width. A complete set of control signals is provided, in order to fully exploit DRAM page mode access. The processor accesses the memory system via four categories of memory transfer cycles: • Non-sequential cycle, performed when the processor requests a transfer to or from an address which is unrelated to the address used in the preceding cycle.

• Sequential cycle, when the processor requests a transfer to or from an address which is either the same as the address in the preceding cycle, or is one word or halfword after the preceding address.

• Internal cycle, in which a memory transfer is not required because the proces- sor is performing an internal function and no useful prefetching can be done at the same time.

• Coprocessor register transfer cycle, which allows the processor to use the data bus for coprocessor communications, so that any action by the memory system is not required. To communicate with the memory system and coprocessors connected to the bus which type of memory cycle the processor is executing, two signals are provided. The nMREQ output signal indicates that the processor requires memory access during the following cycle (memory request). The SEQ output signal will become high when the address of the next memory cycle will be related to that of the last memory access and the new address will either be the same as the previous one or four greater in ARM state, or greater greater in THUMB state. The SEQ signal is also used to discriminate between an internal cycle or a coprocessor transfer cycle, if a memory request is not forwarded by the processor. An example of the four types of memory access timings is reported in figure 3.13. To allow the correct use of dynamic memory systems, the processor must provide the address of the location to be accessed as early as possible, in order to permit the longer address decoding and the generation of DRAM control signals. To do so ARM7TDMI provides pipelined and de-pipelined access capabilities via an input signal (APE); by this input every single memory access can be configured to access also mixed DRAM/SRAM memory resources. The processor can transfer data of different size; due to the single byte addressing, words(four bytes), halfwords (two bytes) and single bytes can be accessed; the data size of the transaction taking place is signalled by the MAS[1:0] output and the

86 3 – The ARM microprocessor architecture

Figure 3.13. Memory cycle timings diagram

position of the sub-word sized data can be inferred by the address value. When a load operation on a half-word or byte is performed, the memory system can present the whole word on the data bus, in fact the processor selects the required part of the 32-bit data. The memory management unit has to give the right interpretation to the address communicated by the processor and to the MAS signal, ignoring the least significant bits in word and half-word accesses and presenting the correct word bounded data. For power consumption aspects and to have a simplest decoding logic, anyway, the MMU can only present the required part of the data without interfacing problems. During the store operations of sub-word sized data, the ARM processor broadcasts the byte or the half-word on all the data bus, so that the same byte is repeated within every byte boundary and the 16-bit data is replicated two times. In this case the MMU has to decode the least significant bits of the address to deduce the right collocation of the data to be stored. The instruction fetch must be done respecting the processor state, so that in ARM state the whole word is read from the data bus and in the THUMB state the right 16-bit instruction is selected. For the memory system there is no difference between the two operating states, because the MAS signal is driven also during code segment access and so is done for the fetch address. To allow the memory system flag the processor about impossible data access, the processor ABORT input may be used; as already discussed in the paragraph 3.4, this event occurs on memory page faults and the data abort handler must manage the situation to permit the MMU to retrieve the required memory page. Other signals like nRW are driven form the processor to access the memory in read or write mode and the nTRANS output indicates whether the processor is in user or a privileged mode; the latter information may be used to protect system pages from the user access, or to support completely separate memory mappings for system and user

87 3 – The ARM microprocessor architecture modes. During the execution of a swap instruction (SWP), which allows the contents of a memory location to be swapped with the contents of a processor register, the memory controller must not give access to another device, in order to prevent changing the affected memory location before the operation is completed. The reason is that the instruction is implemented as an uninterruptable pair of accesses to the memory: the first access reads the contents of the memory location addressed, the second writes the register content to the same memory cell. ARM7TDMI drives the LOCK signal high for the duration of the swap operation to signal the MMU a swap instruction is being executed. To ease the connection of the processor to sub-word sized memory systems, input data and instructions may be latched on a byte by byte basis. This is achieved by use of the BL[3:0] input signals, where every bit controls the latching of one byte of the word present on the data bus. By using this signal a word access to halfword wide memory must take place, obviously in two memory cycles: in the first cycle, the data of the first half-word is obtained from the memory and latched into the processor, when BL[1:0] are both high, in the second cycle, the other half-word is latched into the processor, when BL[3:2] are both high. Since two memory cycles are required, nWAIT is used to stretch the internal processor clock. When accessing slow peripherals, the processor can wait for an integer number of MCLK cycles by driving nWAIT low. Internally, this signal is put in an and gate with MCLK and allows to have an extended clock cycle, until its tied low. By the fact that the processor is furnished of a couple of unidirectional data bus, beside the bidirectional one, is possible to interface the processor with all kind of ex- ternal memory systems. In order to respect some ASIC specifications for embedded systems, anyway, the bidirectional data bus can not be used and the communica- tion is allowed only on unidirectional buses. The timings of the two types of buses are identical, but they can not be enabled at the same time and the input signal BUSEN allows to select which one is to be used. To use the bidirectional data bus, moreover, is necessary to establish when the processor uses it for a write cycle and when for a reading operation. Beside the nRW signal, the nENOUT output is provided and it is tied low to indicate that the data bus is driven by the proces- sor. Whenever the ARM does not want to send data on the bidirectional bus the signal is driven high. Another signal, nENIN, is provided to permit to drive the bus in three-state mode; the external unit ties up this input to signal that no data in furnished in input to the processor, so the bus can be put in high impedance state.

88 3 – The ARM microprocessor architecture

3.8 The coprocessor interface

The processor can be connected to a number of external coprocessors (up to sixteen) in order to extend the functionalities of the native instruction set. Every coprocessor is selected by using the dedicated 4-bit field (CP#) within the coprocessor instruc- tion and when it is not present, instructions intended for it will trap. In this case, to avoid the abnormal program termination, suitable software may be installed to emulate the functions of the missing external processor. The interfacing with the coprocessor is obtained via three handshaking signals (nCPI, CPA, CPB) and by using the same data bus connected to the memory system. The processor takes nCPI low whenever it starts to execute a coprocessor (or undefined) instruction and, by the fact it is connected to the data bus, each coprocessor receives a copy of the instruction. Every coprocessor inspects the“CP#”field, in order to establish if it matches with his own number, in affirmative case it should drive the CPA signal line low. If no coprocessor has a number which matches the CP# field, CPA and CPB will remain high, and the processor will take the undefined instruction trap. Otherwise the processor observes the CPA line going low, and waits until the same coprocessor ties low also the CPB signal. Only the coprocessor which is driving CPA low is allowed to drive CPB low, and it should do so when it is ready to exe- cute the instruction. The processor will busy-wait while CPB returns high, i.e. do not execute other operations until the coprocessor has not performed the operation requested, unless an enabled interrupt occurs. In that case it will break off from the coprocessor handshake to process the interrupt, retaking the coprocessor instruc- tion later and repeating the operations described above. When CPB goes low for the second time, the instruction continues to completion. This will involve data transfers taking place between the coprocessor and either ARM7TDMI or memory, until coprocessor ceases to be busy and the CPB signal indicates the instruction is completed. By the fact the coprocessors are all connected to the ARM data bus and the instructions are fetched by the processor via the same bus, also the coprocessors can load their internal pipeline without requiring a subsequent communication as reported above. To allow coprocessors discern between a data load or an instruction fetch, the nOPC signal is provided. To activate the communication between the processor and a coprocessor, a register transfer cycle must be performed, so the nMREQ signal is tied high to exclude the memory system from the transaction. To load or store coprocessor internal registers directly in memory, via coprocessor data transfer instructions, a particular method is used. The coprocessor controls the number of registers to be loaded/stored and performs the operation memory access via the data bus, directly with the memory system. The memory addressing and the driving of memory control signals is performed by the processor. The coprocessor is responsible of the number of words to be transferred and, since the

89 3 – The ARM microprocessor architecture transaction begins, ARM increments the starting address to execute a subsequent memory access. By driving the CPA and CPB lines high the coprocessor signals to the processor the termination condition. For particular uses, like activation of certain units, the coprocessor can restrict the execution of some instructions to privileged modes only. To do so, the coprocessor has to track also the nTRANS processor output. By the fact that undefined instructions are treated by ARM as coprocessor in- structions, all coprocessors must result absent (i.e. CPA and CPB must be high) when a real undefined instruction is presented, so that the processor will take the undefined instruction trap. To differentiate an undefined instruction from a copro- cessor one, the coprocessor need only look at bit 27 of the instruction (it is “0”for undefined, “1”otherwise). To avoid false behaviors in THUMB operating state (in which coprocessor instructions are not supported but undefined instructions are) all coprocessors must monitor the TBIT output and drive CPA an CPB inputs correctly appearing absent and the data bus must be ignored. In this way, coprocessors will not erroneously execute Thumb instructions, and all undefined instructions will be handled correctly.

3.9 The debugging system

The ARM processor in furnished of a debug interface based on the Boundary Scan standard (IEEE Std. 1149.1/1990), which represent an hardware extension with ad- vanced debugging features; it is intended to ease the development of application software, operating systems and hardware which embeds the core. A typical de- bug system is made up by three parts: a host (usually a PC running a debugging software), the ARM7TDMI core and, in order to manage the communications be- tween the two parts, a protocol converter. The debug extensions allow the core to be stopped either on a given instruction fetch (breakpoint) or data access (watchpoint), but also asynchronously by a debug-request signal. Entering the debug mode the core internal state and the system external state may be deeply examined, then the core and system state may be restored and program execution resumed. The core internal state can be examined via the JTAG serial interface (TAP controller10). The JTAG system includes a set of multifunctional registers (Fig.3.15) connected to all the inputs and outputs of the processor; the registers are also serially connected to each other by a line, so that they create a chain which ends are connected to the JTAG controller. The scan cells can force the input and sample the output signals of the processor, in order to inspect the internal state of the processor, and by resulting transparent allow also the normal execution flow. To do so, the neces- sary instructions must be loaded into the TAP controller. By using the scan chain,

10Test Access Port, its functionalities are defined by the Boundary Scan standard.

90 3 – The ARM microprocessor architecture instructions can be serially inserted into the core pipeline, without using the exter- nal data bus (fig.3.14). For example, when in debug state, a store-multiple (STM) could be inserted into the instruction pipeline and this would dump the contents of ARM7TDMI registers. This data can be serially shifted out without affecting the other parts of the system. The debug system is not fully JTAG compliant, anyway it supports all the mandatory instructions and also other standard typical operations.

Figure 3.14. ARM7TDMI Boundary Scan scan chain

ARM7TDMI processor is provided of the EmbeddedICE (or ICEBreaker) core extension as on-chip debug resource. It consists of two real-time watchpoint units, with associated control and status registers, as well as a set of registers implement- ing a communication channel with the debugger, referred as Debug Communications Channel (DCC). The communication on the channel is obtained via the standard method used for coprocessors, for this reason the coprocessor identifier 14 is re- served and not available for a normal external coprocessor. Watchpoint units can be programmed to halt the execution of instructions by the ARM core. Execution is halted when a match occurs between the values programmed into the EmbeddedICE macrocell and the values currently appearing on the address and/or data buses. Ei- ther watchpoint unit can be configured to be a data watchpoint (monitoring data accesses) or an instruction breakpoint. The ICEBreaker is programmed in a serial

91 3 – The ARM microprocessor architecture

Figure 3.15. ARM7TDMI Boundary Scan input scan cell manner by using the same TAP controller discussed above. Via the EmbeddedICE interface all the internal resources of the processor can be accessed and modified, so that a complete debugging of the software can be obtained and the correct behavior of many internal parts of the processor can be tested.

92 3 – The ARM microprocessor architecture

Figure 3.16. ARM7TDMI EmbeddedICE block diagram

93 Chapter 4

LISATek toolsuite

The diffusion of electronic devices in various aspects of the common life has deeply changed many constraints in the industrial production. Electronics and telecommu- nications markets impose very short life cycles for products, so that time-to-market and time-to-volume have become fundamental parameters to guarantee the economic return for a new device introduction. On the other hand, semiconductor technology has opened the horizons to the integration of great capabilities on a single chip and this, in concomitance with increasing consumer needs and the demand of new powerful applications, has led to enormous growing in the complexity of digital de- signs. The distributed hardware approach has been abandoned in advantage of SoC1 designs, which allow to manufacture heterogeneous components and multicore inter- communicating systems on a single die. This powerful technique, anyway, requires quite long development cycles so that the designer’s productivity has become a vital factor for successful products. For this reason, the idea of implement in software powerful algorithms for system functions and signal processing, reducing the com- plexity of hardware development, has led to the shift from purely hardwired digital systems to the inclusion of programmable cores in SoC designs [11]. This strategy reduces the burden on the hardware designer, which has to optimize less aspects of the target architecture with respect to the SoC approach and represents a new project methodology that leads to Application Specific Instruction-set Processors (ASIPs) design. In this environment LISATek toolsuite [8] represents an innovative approach, because introduces automation in a development sector in which most of the steps are executed manually, allowing the implementation of both the processor and the toolchain for software design.

1System-on-Chip or System-on-a-Chip

94 4 – LISATek toolsuite

4.1 The ASIP design flow

The design flow of an ASIP has some main phases:

• Architecture exploration.

• Architecture implementation.

• Application software design or toolchain creation.

• System integration and verification.

During the first phase many tools are required, because, starting from the appli- cation wanted to run on the processor, an HLL2 compiler, an assembler, a linker and a cycle-accurate model of the ASIP are needed. This phase is an iterative process which allows to profile and benchmark the model, so that some optimizations on the structure could be executed to fit the requirements of the target application. It is clear that, modifying a part of the model a revision of all the software tools used is needed and this represents a complication due to the absence of automatic tools. Once a valid cycle-accurate model is obtained, the architecture implementation could be started, so the functional description, often made by using an HLL like C, has to be converted in a synthesizable description via a HDL3 as VHDL or Verilog. This is another manual phase and this approach exposes the final result to consistency problems between HLL model and hardware model. In the software application design phase software designers need a set of software development tools, substantially the same instruments used in the exploration phase, but enhanced to guarantee short times for this step conclusion. The demands of the software application and the hardware processor designers are different, the latter needs a cycle-accurate simulator for hardware and software analysis and this is inevitably slow, the former requires more simulation speed than accuracy. These tools need two different approaches and optimizations and also these reimplemented versions must be done manually. The system integration and verification phase needs the realization of cosimula- tion interfaces, required to integrate both the hardware description of the processor and the software simulator into a system simulation environment, so that both de- scriptions are stimulated via the same patterns or executable files. Also in this case, an architectural modification, may require a revision of the written interface. The effort of designing a new architecture can be reduced by using a retargetable approach based on a description of both the platform and the instruction set. The Language for Instruction-Set Architectures (LISA) was developed for the automatic

2High Level Language 3Hardware Description Language

95 4 – LISATek toolsuite generation of consistent software development tools and synthesizable HDL code [9]. A LISA processor description covers the instruction-set, the behavioral and functional model, including the underlying hardware timings, and so provides all essential information for the generation of a complete set of development tools in- cluding C-compiler, assembler, linker, and simulator. By containing the definition of all microarchitectural details, a LISA description allows also to generate synthesiz- able HDL code of the modelled architecture, in either VHDL or Verilog. Another key point of these powerful language and toolkit is that changes on the architecture are easily transferred to the LISA model and are applied automatically to the toolchain and the hardware implementation. The LISATek toolsuite functionalities allow the automatic toolchain regeneration also in case of upgraded processor production, so that there is no need to rewrite them manually. The LISA statements represent an unambiguous abstraction of the real hardware [15], so a LISA model description bridges the gap between hardware and software de- sign. It provides the software developer with all required informations and enables the hardware designer to synthesize the architecture from the same specification the software tools are based on. The alternative approach to processors modeling and simulation uses HDL languages, but this way is mainly oriented to hardware optimization and results in many disadvantages for architecture exploration. The simulation of processor models written in HDL, in particular cycle-accurate models, covers many hardware implementation details which are not required to evaluate the processor performances in cycle-based simulations and software verification, but the real problem is that in-depth descriptions have dramatic effects on simulation speed. Many other machine description languages, providing instruction-set modeling capabilities, are available and most of them exploit retargetable code generation and simulation approaches. Some of them allow the generation of the complete toolchain for software development and the generation of synthesizable HDL code, but require complex descriptions with mixed behavioral and structural approach; anyway none allows a simple but efficient description of pipeline operations like flushes and stalls as LISA do [12]. Another noteworthy and diffused approach is provided by Tensilica, which holds a good market share, especially in mobile phones applications. Xtensa system [10] allows to customize and extend a RISC processor, which has a number of base in- structions, via a retargetable tool suite; by the fact it is based on a architecture template, this design method has the disadvantage of generate not too much opti- mized hardware for highly application-specific processors or very simple platforms. Starting for the aforementioned considerations we claim that the LISATek toolsuite represent a fully retargetable approach for architecture exploration, software tools development, architecture implementation and system verification and integration for a wide range of processor architectures, from very essential processors to pipelined

96 4 – LISATek toolsuite processors, superscalar architectures, single instruction multiple data (SIMD) and also VLIW processors.

Figure 4.1. The ASIP design flow

4.2 Architecture exploration

The exploration of the processor architecture starts from the analysis of the algo- rithms which must be executed on the programmable platform [13]. The develop- ment of these algorithms is beyond the LISA platform scope and is usually done by application-specific tools which focus on the system-level design. Often the result of this process is a pure functional specification represented by an executable prototype written in a high level language like C or C++ and accompained by requirements like cost and performance parameters of the desired system. The following step is deriving the figures of the most computational intensive blocks of the whole system and this task can be easily performed with a standard profiling tool4. This tool makes possible to extract some fundamental statistics during the simulation of the functional prototype. This procedure allows the designer

4a profiler is a performance analysis tool that measures the behavior of a program as it runs, particularly the frequency and duration of function calls; the output is a stream of recorded events (a trace) or a statistical summary of the events observed (a profile).

97 4 – LISATek toolsuite to focus the performance critical parts of the application code and therefore allows a correct approach to define the data path of the programmable architecture which need more care, working on the assembly instruction level. An easy way to begin the model development is to pick a simple LISA proces- sor model (like one of the example projects furnished with the tool or the tutorial project), which implements a basic instruction set, and then modify it by enhancing resources and creating new special-purpose instructions, in order to improve the performance of the considered application. Via this method the most complex and critical parts of the target application code are translated into assembly by making use of the specified special purpose instructions. By using assembler, linker and pro- cessor simulators derived from the LISA model, the designer can iteratively profile and modify the programmable architecture running the selected application, until it fulfills the performance requirements. When analysis and optimization of the application critical parts is completed, the instruction set needs to be extracted in order to allow the execution of all the other parts. These parts have usually little effects on the overall performance, therefore it is very often feasible to employ the HLL C compiler derived from the LISA model and accept suboptimal assembly code quality, in return of a significant cut in design time. Other optimizations can be performed on the microarchitecture by improving microcode efficiency, not only with respect to the software related aspects, but also with regard to hardware behavior. For this purpose, the LISA language provides capabilities to model cycle-accurate behavior of pipelined architectures. The LISA model is supplemented by the instruction pipeline and the execution of all instruc- tions is assigned to the respective pipeline stage, so the designer is able to verify that the cycle true processor model still satisfies the performance requirements. At the last stage of the design flow, the HDL generator allows to generate synthesizable HDL code for the basic structure and the control path of the architecture, but also to implementing the dedicated execution units of the data path. Futher informations on hardware cost and performance parameters (e.g. design size, power consumption, clock frequency,. . . ) can be derived by running the HDL processor model through the standard synthesis flow. On this high level of detail, the designer can tweak the computational efficiency of the architecture by applying different implementations of the data path execution units directly on the LISA model files instead of find suboptimal solutions acting on the HDL implementation obtained by the automatic tool.

98 4 – LISATek toolsuite

4.3 The architecture description: the LISA lan- guage

The LISA language is an Architecure Description Language (ADL) which inherit most of the C language characteristics and aims to the formalization and the de- scription of programmable architectures and their interfaces. The principal purpose of this description language is to close the gap between hardware description lan- guages (HDL) and the languages oriented at the instruction sets development. A LISA description is essentially made up by resources and operations descriptions. The resources represent the storage objects the processor can count on, i.e. general- purpose and dedicated registers, pipeline registers, memories and cache memories, which can capture the system state. The other description elements are the op- erations, which are intended to describe the behavior of the architecture, from the decoding of the instruction to their execution step by step, but also the structure of all the processor parts. The LISA approach to modeling the various parts of the architecture divides the problem in the following conceptual parts:

• Memory model

• Resource model

• Instruction-set model

• Behavioral model

• Timing model

• Microarchitecture model

These can be considered as model components and for their realization a series of information and properties must be obtained by the target architecture specification and also from other components of the same model, as depicted in the diagram (4.2) and in the figure(4.3). Each of the model components is described via dedicated LISA instructions, as is briefly reported in the following subsections.

4.3.1 Memory model The memory model is substantially a list of all the registers and memories the system is provided. The description includes their respective bit widths and ranges or, indirectly, these parameters are defined by using C languages built-in data types. A useful characteristic of the LISA language is the aliasing of some resources, so that

99 4 – LISATek toolsuite

Figure 4.2. LISA processor model parts

Figure 4.3. LISA model parts and file sections

a storage object, or part of it, could be referred by a common architecural name (e.g. the program counter, usually called “PC”). In this section the memory configuration must be provided, to allow the correct object code linking. During simulation, the entirety of storage elements represents the state of the processor and all the states of memory elements can be displayed in the debugger. The HDL code generator derives the basic architecture structure by the definitions of this section, so that the

100 4 – LISATek toolsuite memory model can be built. The resources, that can be declared into a memory model description, are:

• Simple resources, such as registers and register files, signals, flags and ideal memory arrays.

• Pipelines structures for instruction and data paths.

• Pipeline registers storing data for the shifting from one pipeline stage to the next.

• Non-ideal memories, such as caches, nonideal RAMs, buses (as part of the memory subsystem).

• Memory maps for processors which use more than one memory, to obtain the correct addressing of the single resources.

For memory system description many parameters can be defined within the LISA files, ranging from size, subblock and endianness organization of simple memories to accessibility of the resources (read, write and execute permissions) and from static or dynamic RAM timing parameters to cache access policies. Powerful operations are furnished for the management of the memories with respect to the abstraction level chosen for the model description, so that the access to the memory resources could be done via a purely functional method or by using cycle-count accurate or cycle-based techniques. The cycle-count accurate memory simulation allows a very easy modeling of the memory hierarchy and has the capability to provide the user with statistics and profiling data about the resource utilization. In cycle-based memory access, the read and write operations are implemented via requests to the respective memory (or bus) module, so that the operation will be executed only when the module is not busy and within a certain number of clock cycles by the request is accepted. Also in this case statistics and profiling informations can be collected for performance analysis.

4.3.2 Resource model The resource model describes the available hardware resources and is obtained eval- uating how the operations accesses to the elements of the memory model. Resources reflect properties of hardware structures, that can be accessed exclusively by one op- eration at a time. The instruction scheduling of the simulation compiler depends on this information and the HDL code generator uses this information for resource conflict resolution. Besides the definition of all objects, the resource section in a LISA processor description provides information about the availability of hardware

101 4 – LISATek toolsuite resources. In older versions of the toolsuite the behavior section within LISA op- erations was provided by a header of the behavioral section with the indication of which resources the operations needed to use and the information if the used re- source was read, written, or both. Last versions of the LISATek platform don’t need this indication, because the various tools parse the operations behavior statements to trace every access to processor resources.

4.3.3 Instruction-set model The instruction-set model identifies valid combinations of operation codes (opcodes), registered or immediate operands and other parameters which define the operations the architecture must be able to execute. It is expressed by the assembly syntax and by the instruction-word coding, and these specifications define the set of legal operands and addressing modes for each instruction. Via this model compilers and assemblers can identify instructions and the same information is used during the reverse process of decoding and disassembling. The specification of the instruction-set model is done by a couple of sections defined within the LISA operations:

• The CODING section describes the sequence of binary values which defines the instruction word.

• The SYNTAX section describes the notation of mnemonics and assembly syn- tax of instructions, operands and execution modes; perhaps it may contain a field for the conditional execution.

These two sections are deeply linked each other by a certain number of fields and identifiers, which refer the same object in both different domains, but also within the definition of the microcode (behavioral section). The DECLARE section contains local declarations of identifiers and other references for the immediate execution or activation of other LISA operations. The LISA language supports a very useful mechanism which enable the hierarchical structuring of the operations via a series of cross references between an operation and the others; a LISA code example is reported in figure 4.4. By these references is possible to create a LISA operation using other already defined operations and this is an optimal approach for two different aspects: the first, because the mechanism allows to simplify the behavioral description of an operation by breaking down the problem and writing the necessary microcode on a number of operations; the second, for the reason that some pieces of microcode could be shared between various operations and so the hardware circuits which implements the same behavior. The hierachical organization of the operations is realized using the the two sections described above, so that, part of the instruction syntax and of the coding scheme, call other operations creating a tree structure of

102 4 – LISATek toolsuite

OPERATION arith_logic_grp IN pipe.DC { DECLARE { GROUP opcode = {AND_dc || EOR_dc || SUB_dc || RSB_dc || ADD_dc || ADC_dc || SBC_dc || RSC_dc || ORR_dc || BIC_dc}; GROUP cond = {EQ_dc || NE_dc || CS_dc || CC_dc || MI_dc || PL_dc || VS_dc || VC_dc || HI_dc || LS_dc || GE_dc || LT_dc || GT_dc || LE_dc || AL_dc || unc_dc}; GROUP S = {PSR_no_update_req || PSR_update_req}; GROUP Rn, Rd = {reg_index}; GROUP operand2 = {shifted_reg_operand || immediate_operand}; } CODING{cond 0b00 operand2=[12..12] opcode S Rn Rd operand2=[0..11]} SYNTAX{opcode~ cond~ S~ " " Rd~"," Rn~ "," operand2 } ...

}

Figure 4.4. Syntax and coding sections example.

references (in some cases of cross-references). This technique is fundamental for the decoding and the scheduling of the operations and is referred as coding root operation; an example of this scheme, is reported in fig. 4.5. Starting from the bynary sequence fetched from the memory and stored in the instruction register, the parsing of the various fields is executed and the operations matching their coding are subsequently activated or called. This section is directly used by the assembler and the disassembler, by the simulator and is fundamental for the Processor Generator which has to generate the instruction decoder.

4.3.4 Behavioral model

The behavioral model defines the set of operations the hardware has to perform to execute the selected instruction. The abstraction level of this model can in- clude the hardware implementation level and the higher level of C statements. The BEHAVIOR and EXPRESSION sections within LISA operations are parts of the behavioral model and the C code parts are executed directly during simulation. The EXPRESSION section is useful to return operand values, register indexes or other

103 4 – LISATek toolsuite

Figure 4.5. A coding root example

resources references, execution modes used in the context of operations. The partic- ularity of accepting arbitrary C code permits to perform function calls to external libraries that can be linked to the executable software simulator. Within the beha- vioral description of a LISA operation all the processor resources are visible but also local variables can be used, but there isn’t the possibility of returning parameters as in common C specification.

4.3.5 Timing model Focusing on the implementation of the architecure, the timing model specifies the activation sequence of hardware operations and units, for the execution of the code- word loaded in the instruction register or for the instructions stored in the various pipeline stages. The instruction latency information lets the compiler find an appro- priate schedule for the operations and provides timing relations for their execution during the simulation and hardware implementation. Several parts within a LISA model contribute to the timing model, the declaration of pipelines and their stages in the resource section, the operations assignment to pipeline stages and, more ex- plicitly, the ACTIVATION section in the operation description, used to activate other operations in the context of the current instruction. The activated opera- tions are launched as soon as the instruction enters the pipeline stage which are assigned. In presence of non-assigned operations their execution is performed in the pipeline stage of their activation. The predefined functions as stall, shift, flush, insert, and execute, which are automatically provided by the LISA environment for

104 4 – LISATek toolsuite each pipeline declared in the resource section, also have effects on the activation of the operation, introducing required delay and respecting all the pipeline stages status. All these pipeline control functions can be applied to single stages as well as whole pipelines. Using this very flexible mechanism, arbitrary pipelines, hazards and mechanisms like forwarding can be simply modelled in LISA.

4.3.6 Microarchitecture model The microarchitecture model allows to define some groups of LISA operations so that the architecture description in HDL implements their functionalities in a unique unit. By this description the desired implementation of the microarchitecture could be obtained and the generated HDL will contain different structural components such as decoders, ALU, barrel shifters or other units defined in separated files. The operation grouping must be done by listing the LISA operations which must included in a HDL component into a UNIT section, the name of the section will be the name of the HDL relative file. This method is useful to identify the hardware organization also in the post-generation phase, for data path optimizations or verification purposes.

4.4 The LISATek model development tools

Beside the LISA language description, the LISATek toolsuite provides a series of useful GUI5 based tools aimed at simplify the processor modelling and debugging since the first development phases. These tools are:

• Processor Designer

• Instruction-set Designer

• Syntax Debugger and are synthetically described in the following sections.

4.4.1 The Processor Designer The Processor Designer represents the principal instrument for the processor model development, it is essentially made up by a GUI which allows to create and main- tain all the parts of the project, from LISA files to C-language headers and libraries. By using its intuitive interface, various directly linked LISATek tools and several menus and commands, is possible to build and configure depending toolsuite parts, like HDL generator, simulator and other software tools generation processes. The

5Graphical User Interface

105 4 – LISATek toolsuite most important part of the Processor Designer is the LISA language compiler and debugger, which provide a global view of all the files belonging to the project and the configuration files; a text editor with many functions oriented to files writing and maintenance is also provided. The generation flow can be controlled by dedicated buttons, from the implementation of single tools to the complete set of necessary instruments; a useful window reports all the messages about the operations per- formed by the application and also compilation errors with detailed references. The access to the product documentation completes the functions of this fundamental component of the toolsuite.

Figure 4.6. Processor Designer screenshot

4.4.2 The Instruction-set Designer The Instruction-set Designer represents an alternative method for the description of the processor instruction set, which uses the LISA language files, but also an optimal system to obtain a graphical scheme of the instruction set modelled. Via this tool instruction set inconsistency could be shown and correct, by analysing all the fields componing a single instruction in a very flexible graphical window. ISA

106 4 – LISATek toolsuite instructions can be created, removed, modified and deeply examined using this tool, so it could be seen as a complementary method to the textual description performed by LISA description. The Syntax Debugger can be started by the Processor Designer environment and allows to operate on the relative project.

Figure 4.7. Instruction-set Designer screenshot

4.4.3 The Syntax Debugger The syntax debugger can be started by the Processor Designer environment and allows to introduce single assembly instructions or pieces of code and starting a step-by-step recognition of the tokens written. The functions performed by the tool are very similar to the assembling process, but the Syntax Debugger is provided by a window which explain the assembly instruction encoding via LISA operation tracing and via another window a sort of report is posted, with all the references about the parts of the instruction recognized and its bynary coding. The purpose of this part of the LISATek toolsuite is oriented to the debugging of the instruction set described within the model which project is opened in the Processor Designer.

107 4 – LISATek toolsuite

Figure 4.8. Syntax Debugger screenshot

4.5 The architecture implementation

The principal guideline for the development of an ASIP is to obtain a highly opti- mized architecture for a specific application, so the automatically generated HDL code has to fulfill tight constraints, to represent a valid alternative to totally hand- written HDL code and to avoid many steps for refinements. Some critical aspects of the target hardware, as power consumption, chip area and computation speed, represent the most difficult challenges for these class of architectures and some of their parts need to be optimized manually, particularly for data paths. Anyway, the LISATek Processor Generator, generates well optimized parts of the processor as register files, pipeline structures and registers, the pipeline controller, the instruc- tion decoder and all the control signals for functional units activation. By using the LISA operation grouping capabilities, is possible to obtain an or- ganized set of HDL files, which represent the different parts of the architecture and this parts can be accessed and modified to improve and optimize the hardware de- scription, in order to obtain the desired performance. The hardware description

108 4 – LISATek toolsuite of a functional unit, in fact, is generated taking into account the behavioral state- ments expressed in the relative LISA operations, and the obtained HDL code is a behavioral-style translation of the C statements. The burden of create optimized data paths is postponed to the HDL synthesis step and so all the tests on their performance parameters. This represent a limitation of the approach based on the LISATek toolsuite as discussed in the previous paragraphs. The LISATek Processor generator generates the HDL description generation via the following main phases:

• Analyses the resource section of the LISA description to get informations about the main structure of the architecture, as pipeline stages and storage resources, and instantiate the components required (registers, pipeline registers, memo- ries, caches, intercommunication buses, input and output ports and all con- trollers with relative control signals for these parts).

• By analysing the grouping of the operations to functional units within the LISA description, the global structure of the HDL description can be instantiated, this phase allows to identify well defined blocks within the architecture for further manual optimizations, but units are also automatically grouped in the pipeline execution stages as fetch, decode, execute and writeback stages. Since in LISA is possible to assign hardware operations to pipeline stages, this information is sufficient to locate the functional units within the pipeline which they are assigned and this selection is done in any case.

• Generation of the instruction decoder derived from information in the LISA model reflecting the coding of the various instructions. Depending on the structure of the LISA architecture description, decoder processes are gener- ated in several pipeline stages. The specified signal paths within the target architecture can be divided into data signals and control signals. The control signals are a straight forward derivation of the operation activation tree, which is part of the LISA timing model. The data signals are explicitly modelled by the designer by writing values into pipeline registers and implicitly fixed by the declaration of used resources in the behavior sections of LISA operations.

For simulation and verification purposes, also the SystemC description of the architecture could be obtained, via a specific generation tool. SystemC is often thought of as a hardware description language like VHDL and Verilog, but is more aptly described as a system description language, since it exhibits its real power during transaction-level and behavioral modeling, in fact SystemC is a set of library routines and macros implemented in C++, which makes it possible to simulate concurrent processes, each described by ordinary C++ syntax.

109 4 – LISATek toolsuite

4.6 The application software design

The possibilities to generate automatically HLL C compiler, assemblers, linkers, and ISA simulators from LISA processor models allows the designer to explore all the aspects of the design very quickly, testing functionalities, discovering bugs or opera- tions needing further optimizations. In this section, specialities and requirements of these tools are discussed.

4.6.1 Assembler and linker The LISA assembler, lasm, processes the assembly source code files and transforms them into linkable object code6 for the target architecture. The base for this tool im- plementation is the instruction-set information defined within the LISA description of the processor. Besides the processor-specific instruction set, the generated as- sembler provides a set of pseudoinstructions, usually called directives, useful for the control of the assembling process and for the initialization of the data the program will work on. Some directives allow the grouping of assembled code into sections which can be positioned separately in the memory by the linker. Symbolic identifiers for numeric values and addresses are standard assembler features and are supported as well, moreover, also macro-assembler functionalities are implemented, so macro definition and recall is supported by the tool. After the assembling phase of the source code, the object file contains a number of symbols, i.e. some references to local and global routines stored elsewhere in the memory space. To obtain a unique object program is necessary to resolve these references retrieving the code lines to be linked to the main part of the application. The linking process, performed by the LISA linker llnk, is controlled by a linker command file that keeps a detailed model of the target memory environment and an assignment table of the module sections to their respective target memories. This file has extension ”cmd” and a reference must be put into the makefile7. The LISA linker allows to use external memories, i.e. separated from the architecture model, so that the code should be correctly linked directly there.

4.6.2 Disassembler The disassembler (ldasm) executes the opposite work done by the assembler and the linker, accepting the linked object file as input, i.e. the executable application, and transforming it in an assembly file. Also for the implementation of this tool the instruction-set specifications reported into the LISA model are used and the

6code directly executed by a computer’s CPU, machine code 7The makefile contains all the informations for the assembling and linking of the source assembly file

110 4 – LISATek toolsuite resulting disassembled file allows to check of the correct assembling and disassem- bling operations. The disassembled file reports a different form of the symbols and memory addresses defined into the source assembly file, due to the fact that this representation is done via the point of view of the memory system the processor has, so all the relative references written in the source files are substituted by this absolute addresses.

4.6.3 Simulator: the “Processor Debugger” The simulator is the tool which allows to understand how the model described works on the program given in input. The simulator is made up by a number of C++ files which describe the processor model behavior and the simulation control is allowed by using the GUI of the LISATek Processor Debugger. The debugger accepts an object file as input and allows to monitor various aspects of the model, like registers, memories, internal signals, pipeline behavior and events; by means of some windows reporting the disassembled code of the object file in execution, the original assembly file and the microcode of the single LISA operations in execution is possible to check the effects of every statement on the state of the processor. As a standard debugger, the tool is furnished of a number of commands for the microcode execution control, for step by step execution, breakpoint insertion and so on. The pipeline control is particularly rich of informations about timing and stage status and allows the profiling of the relative structure. Because of the different abstraction level the designer can exploit in the development of the model, the LISA simulator can be generated using several techniques to achieve more flexibility or higher simulation speed. With respect to some model description aspects, as simulation purposes (de- bugging, profiling, verification), architectural characteristics (instruction-accurate, cycle-accurate) and target software applications (DSP kernel or operating system), the most correct trade-off between performance and flexibility could be reached and the selection of the proper simulation technique could be chosen. The interpretive simulation offers more flexibility for testing operations and is the only one can be performed on every kind of architecture the LISA description allows to obtain. The disadvantage of this type of simulator is relative to the computation speed, because every aspect of the model behavior must be interpreted from the files which describe the architecture and the relative C code for the simulation must be generated step by step, without using compiling methods. An alternative technique, to the other extreme, is the fully compiled simulation, which allows to increase throughput and so the simulation speed by using a multi- step method for the LISA operations execution. The first of this steps is instruction decoding in which machine instructions, operands, and addressing modes are recog- nized for each instruction word of the input object file, with the particularity that

111 4 – LISATek toolsuite every repeated instruction is decoded once also for operations inserted in loops. By this method the decoding operation could be omitted at runtime, reducing the sim- ulation times. The subsequent phase is the operation sequencing, which determines all the operations required for the execution of every instruction found in the appli- cation program. These operations are organized in a table with an index useful for the operation call at runtime; thus, when an operation is called the behavioral part of the LISA operation is executed. The last phase performed by the simulator is the operation instantiation and the simulation loop unfolding in which the operation scheduling, that is the operation execution timing, is determined. In this step the behavior of the LISA operation is executed by calling the functions identified in the previous phase and the loops retrieved within the code are unfolded to drive the simulation in the next state. This simulation method, anyway, can be applied only to instruction-set accurate models or cycle accurate models without an instruction pipeline and under the assumption of a constant program memory. Besides the fully compiled simulation, described above, there are other simulation techniques which perform some of the required steps in different moments, also at the runtime, reducing the compilation time in disadvantage of the global simulation time. The dynamic scheduling method reduces the simulation time by executing the instruction decoding and the operation sequencing at the compile time, but the operation scheduling is done during the simulation. This technique can not be used in presence of external memories or self-modifying program code for obvious reasons, so is less flexible with respect to the compiled simulation, but allows a simpler approach for the pipelined architectures simulation. The pipeline control is described via a series of dedicated operations in the LISA files and the activation of these operations at the runtime modifies the simulation state because of data and control hazards occurrence. This behavior is very difficult to be predicted so all the operation are inserted and removed dinamically in the pipeline, although the disadvantage of the continuous maintenance of the pipeline that becomes very expensive to be done at the runtime. To avoid this time-intensive step at runtime, a static scheduling approach could be applied. This is a very sophisticated method because, starting from the analysis of the present state of the processor, particularly the pipeline state, is necessary to generate all the behavioral code of the LISA operations encountered in the program and for all pipeline possible events, such as flushes, stalls, normal shift, possible hazards, operation activation and removing in all the pipeline stages. This technique requires much time for the generation of a number of code lines for the simulator implementation, so before the simulation starts, but less time for the execution of the simulation with respect to the dynamic scheduling method. The last variant of simulation technique is the Just-In-Time Cache Compiled ISA Simulator, which is based on some of the aspects already seen above. The

112 4 – LISATek toolsuite basic idea of the JIT-CCS is to memorize the information extracted in the decod- ing phase of an instruction for later re-use in case of repeated execution. Although the JIT-CCS combines the benefits of both compiled and interpretive simulation, it does not achieve the performance of statically scheduled simulations but it has some advantages. The JIT-CCS exclusively incorporates dynamic scheduling for instruc- tion decode/compilation and since instruction decoding is performed on instruction register contents at simulator run-time, the usage of external memories is supported and program memory changes will be honoured.

Figure 4.9. Processor Debugger screeshot

4.6.4 The C-Compiler From the LISA description of the processor the generation of the C compiler could be retrieved. To do so the CoSy Compiler Development System is utilized, which follows a modular, engine based concept to perform parsing and semantic analysis of the input files in the front end, optimizations and transformations of the com- piler’s intermediate representation and code generation in the compiler backend.

113 4 – LISATek toolsuite

The compiler’s backend engines are generated by a tool which reads the so called code generator description (CGD) files and generates code selector, scheduler, and register allocator. The creation of the CGD files from the LISA processor descrip- tion is based on the LISA processor compiler. This tool reads the LISA processor descriptions and generates all software tools and the hardware model. The parts of the CGD description can be automatically generated from the LISA processor description. For the generation of the C Compiler the LISATek Compiler Designer tool is provided, which allows to describe the processor resources the compiler can use for optimal generation of the object code. Before the generation of the compiler some informations must be given to the Compiler Designer, like how to use processor registers, the specification of data and stack layout, some scheduler directives and other specifications and language conventions. The obtained compiler lcc is able to accept c-language input files generating object files directly executable on the processor model by the other simulation tools.

4.7 The system integration and verification

The system integration and verification phase includes the major task to evaluate the trade-off between various functionalities (e.g. speed, size, and power consumption) to determine which part of the overall system functionality should be implemented in software and which must be implemented in hardware (hardware-software parti- tioning). On the other hand, the verification of the obtained hardware description must be performed and again cosimulation interfaces are required, to integrate both the hardware description of the processor and the software simulator into a system simulation environment, so that twice the descriptions are stimulated via the same patterns or executable files. Obviously, some changes in the modelled architecture, could have effects on the interfacing system, so all these parts need to be regenerated to grant the correct communication and functionalities of the cosimulation system. The LISATek toolsuite includes the System Integrator Platform which allows sys- tem integration and verification capabilities but also the possibility of integrate the model into the context of the whole system (SOC) which includes a mixture of dif- ferent embedded processors, memories, and interconnect components. In fact verification of the complete system, including software, has become the crit- ical bottleneck in the design process of SOC. This is due to the diversity of em- bedded components sources, hardware-software cosimulation exploits hardware and software design techniques that incorporate the use of various languages (VHDL, Verilog, C/C++), formalisms, and tools. In order to support the system integration and verification, the LISATek system integrator platform provides a well defined application programmer interface (API)

114 4 – LISATek toolsuite to interconnect the instruction-set simulator generated from the LISA specification with other simulators. The API allows to control the simulator by stepping, run- ning, and setting breakpoints in the application code and by providing access to the processor resources as the LISATek Debugger does.

115 Chapter 5

The LISARM model

This chapter describes how the ARM processor instruction set has been modelled using LISATek tools and the LISA 2.0 language. The processor behavior is described by using only the LISA language and the code produced is organized in different files. The Processor Designer [20] and Debugger [22] tools were used exclusively for the code drawing up and testing, while Syntax Debugger and Instruction-set Designer played a reduced role in some consistency checking steps. Fundamental, to explore the model behavior and the LISA capabilities, were the automatically generated simulator and assembler; by using them the model were constructed step-by-step, alternating the code writing with its testing. Many powerful features offered by the language were used, in order to enable the LISA operation reuse and to obtain a noteworthy level of hardware optimization. Because of some capabilities, supported both by the LISA language and the sim- ulator but not by the HDL Generator, problems occurred in the creation of the unidirectional and bidirectional data bus which ARM provides. This is due to some memory management operations not yet supported for hardware implementation. The described processor respects the instruction cycle times described in the para- graph 3.5 or in [16], the memory interface features (section 3.7) are also compatible and the datapath structure is quite similar to the original one. The whole model is described in the following paragraphs.

5.1 The model structure

The ARM instruction set and all the processor parts behavior is described by using the LISA 2.0 language and the model is organized by dividing the produced code in the following files:

• main.lisa

116 5 – The LISARM model

• arm.h

• conditionfield.lisa

• alu operations.lisa

• barrel shifter operations.lisa

• multiplier.lisa

• branch instructions.lisa

• data proc instructions.lisa

• mem access instructions.lisa

• multiply instructions.lisa

• misc ops.lisa

• other instructions.lisa

The main.lisa file contains the fundamental processor resources and features definition and the principal LISA operations which describe the model behavior at every new clock cycle, in presence of a reset signal or when an interrupt request arises. Some of these parts are discussed in the section 5.1.1, while the model main operation is described in the paragraph 5.1.2. The decoding mechanism and the in- struction set description method are discussed in the section 5.1.3, while the various instruction behavior is described in the subsequent paragraphs, group by group. The arm.h file is a standard C header file which contains many definitions of enumer- ative types and labels, like all the operating states names, processor states, barrel shifter and ALU operations, ARM assembly condition mnemonics. Here are also defined a number of constant masks, used for sign extensions, arithmetic and logical operations, subword sized data types for memory access. Some further define dir- ectives are used to simplify the access to PSRs flags and external signals assignment by LISA predefined methods. In the header file all the exception vectors are defined (ref. section 3.4).

5.1.1 Processor resources, interface, internal units The processor resources are defined in the LISA model by using some specific mech- anisms and dedicated data types; all their definitions are reported in the main file (main.lisa), where many features as memory model, internal signals, interfaces,

117 5 – The LISARM model registers and pipeline structure are defined. Starting with the memory model, a unique memory is defined, by the fact that ARM adopts a Von Neumann architec- ture, thus it does not support more than one memory area or separated memories for code and data. The memory is implemented as a collection of 32-bit unsigned integer data and a 32-bit address bus is defined. The memory model definition rep- resents a problem for the ARM description, because fetch 16 or 32 bit instructions maintaining the single byte addressing capabilities is not possible within a LISA processor model, where at every program counter increment, a new 32-bit instruc- tion is fetched. This problem is described in the section 6.2, where some guidelines for the memory wrapping are furnished. The Instruction Set Architecture adopts a little endian organization that can not be dynamically changed during the execution flow, because this feature is not sup- ported by LISATek tools. This is another issue which has to be resolved by an external interface if big endian organization is needed. To define the processor operating state and mode, two global variables are used: processor state and processor mode. The main operation reads their content at every execution cycle and updates the mode bits and the T-bit within the CPSR. By the fact the Thumb instruction set is not implemented yet, the T-bit is ignored by all the LISA operations and so the execution is always performed in the ARM state. The processor mode variable, otherwise, is used every time a banked register is read or written, in order to select the bank to be accessed. The variable values are assigned in the behavior section of various operations by using the definitions reported in the arm.h header file, where ARM operating mode identifiers [16] are respected. The current processor status register (CPSR) and the banked saved PSRs are implemented by using a 32-bit CXBit data type, a particular LISA data type which allows to perform extraction and modification of single bits by using specific predefined methods. The register file and the other banked general purpose registers are defined as integer data type variables, but in the execution units their values are assigned to other CXBit types to enable the single bit accessibility. The LISA language executes type conversions by using very flexible functions, while some castings are implicit and should be omitted. All PSRs and GPRs data types are defined as clocked registers, to ensure their values are updated maintaining the real architecture timing and behavior. The pipeline structure is described in a very simple manner with LISA, listing the singles stage identifiers, i.e. PF (prefetch), FE (fetch), DC (decode), EX (execute) and ED(execute dummy stage), as can be inferred by figure 5.1. The prefetch stage instanced here is not explicitly defined in the ARM architecture [16], but a prefetch operation is executed and affects all the operations which involve the program counter contents. The last stage (execute dummy stage) is necessary to exploit the polling operation mechanism, which allows a LISA operation to reactivate itself in the following machine cycle by stalling the subsequent pipeline stage. For

118 5 – The LISARM model

PF/FE FE/DC DC/EX EX/ED

PF FE DC EX ED

prefetch fetch decode execute execute dummy

Figure 5.1. The LISARM pipeline structure

this reason it represents a dummy stage and, by the fact it does not perform any operation, the processor generator does not implement it in HDL. Every couple of the defined pipeline stage is staggered by a register, which contains a number of signals used to transfer information from a stage to the subsequent. The pipeline registers are created automatically for all stages and are referred as PF/FE, FE/DC, DC/EX and EX/ED. All these registers are generated in the HDL files, but if their outputs are not used by other components, they’re not implemented in the synthesized hardware. The pipeline structure is reported in figure 5.1. The used registers are:

• instruction register and the program counter pc.

• reg1 i, reg2 i, reg3 i which store the operand registers indexes.

• regd i for the destination register indication.

• op2 for second operand immediate values storing.

• registered opd which points out if the second operand is registered or an immediate value.

• PSR update f, PSR select, PSR flag assign and PSR transfer f for opera- tions which involves one of the PSRs.

• bs op, registered bs amount, bs amount, bs special op f and bs amount32 f for the barrel shifter setup.

• alu op for the ALU setup.

• write result f to enable the writing of the ALU result to the destination register.

119 5 – The LISARM model

• mem access f, writeback f, pre npost indexed f and byte access f for the memory access instructions. • mul acc f for the multiply and accumulate instructions. Most of them are used to transfer register indexes for the register file management and setup values for the datapath (barrel shifter, 32x8 multiplier, ALU) from the decoding to the execute stage. The instruction register have a meaning only for the first two stages, because the decoding stage splits the single instruction fields in the signals cited above. Many other variables, registers and signals are defined in the main.lisa file and they are all considered as global resources, hence accessible by all the LISA operations defined in the model. Among these resources all the wires defined for the ALU, multiplier and barrel shifter control signals and data assignment, some registers for the memory interface, the result and other auxiliary variables for the datapath, the register list register for the block data transfer operations, the cycle c and the branching c counters can be underlined. Main clock and reset input are generated automatically and also data and address buses; the memory interface signals, like MAS, nRW, LOCK, nMREQ and SEQ outputs and the ABORT input are also defined in the file resource section. Beside these signals the FIQ and IRQ input for the interrupt management are defined and also the coprocessor handshaking signal CPI, CPA and CPB, although the coprocessor instructions are not implemented in the model. An output signal was added to the model, the BS 2-bit line used to communicate the memory management unit which is the byte selected for the transfer within a word; some details about this signal are reported in section 5.8 and in the paragraph 6.2. In order to maintain a certain hierarchy in the generation of the HDL files, all the LISA operation used in the model are grouped using the language dedicated in- structions within the main.lisa file. Beside the fetch and prefetch units, which contain only the homonymous operation, the decoder unit groups all the decod- ing LISA operations. Four separate units are dedicated to ALU, barrel shifter, multiplier and condition checking operations; all the operations which contains the ARM instruction behavior are grouped in branch exec, data processing exec, memory access exec, multiply exec and other ops exec. The HDL Generator respects the grouping scheme proposed and generates different files and units (com- ponents in VHDL or modules in Verilog) for the every LISA unit defined.

5.1.2 The main LISA operation The main operation, described within the main.lisa file, is not assigned to a pipeline stage, by the fact it drives all the pipeline control signals and controls the single

120 5 – The LISARM model stages behavior. To do so, the main operation is automatically executed at every clock cycle, but before its first execution a processor initial status must be forced. To perform the processor initialization, the reset LISA operation is provided; the operation is executed automatically every time the reset input is driven low and before running every simulation within the Processor Debugger. It initializes to zero all the internal registers and flags, sets the processor state to ARM, the processor mode to“user”and assigns adequately the CPSR. The real ARM processor behavior, after a reset request, forces the execution to resume in supervisor mode but, for debugging purposes, here the user mode is assigned. By the fact the program counter is reset, it starts fetching the first instruction at the address 0x00: to avoid the fetching of the exception vectors, which follow in the memory, an initial branch operation has to be executed in every program executed. When the main operation is executed, the condition global variable is evaluated and one of the instructions described in section 5.1.3 is directly called to check if the condition expressed in the instruction is satisfied by the current PSR flags. The result of the evaluation is assigned to the condition valid global flag and used by the other LISA operations when an instruction enters the execution stage, in order to establish if the instruction itself must be executed or not. The main operation contains also two LISA specific statements which enable the concurrent execution of all the operations activated in the pipeline stages and then executes the pipeline shift, in order to transfer all the pipeline register values from a stage to the subsequent. The exception handling capabilities of the ARM processor are also implemented by the main operation, by monitoring some specific input signals. If the F-bit and the I-bit of the CPSR are set (ref. section 3.3.5), nFIQ and nIRQ inputs are checked and an interrupt request on these lines causes the immediate assignment of the exception vector to the program counter and a jump to the handler, as defined in section 3.4. If a multicycle operation is being executed, anyway, the interrupt request can not be satisfied immediately and the exception handling is postponed at the subsequent machine cycle, when the input relative signals are checked another time. To estab- lish if an operation which require more than one machine cycle is being performed, the LISA predefined .stalled() method is used. When a memory access operation is performed, the ABORT input line can be driven high by the memory management unit (MMU). This occurs when the memory sys- tem is unable to retrieve data or instructions because of access or addressing prob- lems (ref. sections 3.7 and 3.4). The main operation has to monitor the ABORT signal cycle by cycle to handle the proper data abort or instruction abort exception when one of them arises. To do so, the exception vector value is assigned to the program counter, without checking if a multicyle operation is being executed or the pipeline is stalled. The abort exception handler, in fact, has to manage the situation by itself, without particular hardware mechanisms, as described in the section 3.4.

121 5 – The LISARM model

In the activation section the prefetch, fetch and decode operations are ac- tivated, but only if the pipeline is not stalled. The prefetch LISA operation is responsible of the instruction fetching from the memory and so has to perform the program counter increment or, if a branch operation is being executed, it has to assign the branch destination address to the program counter. To do so the BPC global variable is used, where the branch operations store the address calculated by the ALU. Some flags and counter are used to recognize the subsequent branching cycles and this signals are used to communicate informations between the prefetch operation and the LISA operations involved in the branch execution. By the fact the byte addressing can not be realized in the LISA model, an Internal Fetch Ad- dress (IFA) is calculated and used within the specific LISA function which performs the instruction fetching. Some other statements in the operation are put only for debugging purposes and they are excluded from the HDL generation by using a C-precompiler pragma directive. The prefetch operation drives directly two of the memory interface signals, the nMREQ and SEQ outputs, which communicate to the MMU what type of memory cycle is performed (ref. section3.7). In order to select the appropriate sub-word sized data, the BS signal is assigned cycle by cycle at the same value of the least significant bits of the program counter for an instruction fetch and of the mem data reg for a data fetch; the reasons of that choice are explained in section 6.2. To do so the branching and the mem access f flags are inspected at every cycle and also some other pipeline registers. By the fact the prefetch operation is activated only if the pipeline is not stalled, these signals remain the same during these cycles, but this behavior is coherent with the ARM processor features. The fetch operation does not execute any statement and simply activates the subsequent stage operation, i.e. the decode operation. By the fact that main and prefetch operations assign only the program counter and the instruction register values, all the other pipeline flags and registers are au- tomatically initialized to zero, so there’s no need to reset them manually within the various LISA operations. To enable a particular execution stage function, therefore, only the specific flag or signal has to be assigned. For debugging purposes, many pipeline flags are used in the model, in order to have a better control of the processor behavior during simulations. This redundancy can also be reduced optimizing the number of the used signals, but this technique implies a sort of additional decoding phase to be performed in the execution stage, which effects can heavily affect the stage critical paths. The complexity can be reduced considering some parameters that only the HDL synthesis phase should provide, so that these modeling aspects can be analysed in depth.

122 5 – The LISARM model

5.1.3 The coding tree and the decoding mechanism The decode operation executes the decoding of the instruction loaded into the in- struction register and, to do so, the entry point (coding root) for the decoding step-by-step process is defined in its behavior section. Within the operation declare section the various instruction groups are defined and the LISA decoding mechan- ism evaluates the instruction binary format and the operations coding sections to establish which of them must be activated in the subsequent step. A sort of coding tree is formalized in figure 5.2 and the assembler and disassembler generation follows a similar approach to associate the syntax sections and the coding sections of LISA operations defined in the model. The behavior of every group of instructions is described by two sets of LISA op-

CODING ROOT multiply_grp other_grp

MUL SWI SWP MLA MULL MLAL PSR_access_grp branch_grp

MRS B MSR BL BX data_proc_grp

cmp_grp mov_grp

arith_logic_grp

CMP MVN CMN TST MOV TEQ ADD ADC SUB SBC mem_access_grp AND ORR EOR BIC std_data_grp block_data_grp

su_data_grp LDR LDM STR STM STRH LDRH LDRSH LDRSB

Figure 5.2. LISARM coding tree diagram erations: the decoding operations ( dc) and the execute stage main operations ( ex). The former set of operations inspects the instruc- tion coding and assigns a set of pipeline registers and flags, defined in order to store all the configuration informations for the datapath and the other execution stage parts. A list of all the LISA operations used for the model description is reported in appendix A. The execution stage operations, in the subsequent clock cycle, read

123 5 – The LISARM model these informations and assign to the multiplier, barrel shifter and ALU related wires the values stored in the pipeline registers. Before the execution stage operations are performed, anyway, the conditional execution modifiers are evaluated. This in- struction conditional execution is controlled by two set of LISA operations: the dc operations are activated by the instruction decoding operations and set the condition pipeline register; the ex operations, belonging to the execute pipeline stage, check if the condition is valid by verifying the CPSR flags. In affirma- tive case the condition valid global flag is set, so the execution of the instruction take place. With cond is referred one of the condition mnemonics reported in table 3.3, where the boolean expression used for the condition evaluation is reported. The execution stage operations are called directly by the main LISA operation where the pipeline stall events are monitored before the execution of the scheduled op- erations is performed. All the conditional LISA operations are described into the conditionfield.lisa file, which contains also an alias operation for the decoding of those assembly instructions which do not provide a conditional suffix.

5.2 The processor datapath

The ARM processor datapath is made up by a 32x8 multiplier, a 32-bit barrel shifter and the arithmetic logic unit (ALU) [18], interconnected with the register file and some other registers as reported in figure 5.3. The behavior of these blocks is de- scribed in the following subsections and their port map reflects most of the original core characteristics. By the fact that their behavior is described by using the LISA language, it is im- portant to underline that units are activated in sequence, i.e. the barrel shifter is activated by the multiplier and the shifter activates the ALU unit. To do so, the LISA activation mechanism is exploited and also the write result operation, de- scribed in section 5.3, belongs to these activation chain, by the fact it is activated by the ALU unit operation. Every subsequent operation scheduling, also within the same pipeline stage, is performed only after the behavior section statements are executed, by setting a particular Processor Designer option.

5.2.1 The barrel shifter unit

The barrel shifter operations are described in a separate file (barrel shifter.lisa) and its modeling style allows to obtain an independent HDL unit, which receive an operand and some control signals as input and furnishes the expected second

124 5 – The LISARM model

Register File

32 x 8 B_bus

Multiplier

A_bus

bs_carry_out Barrel Shifter C_flag

carry_out C_flag

Figure 5.3. LISARM datapath scheme

operand for the ALU unit in output. The description is made by two main opera- tions: barrel shifter op dc and barrel shifter op ex, belonging the former to the decoding stage, the latter to the execute stage. When the barrel shifter op dc operation is executed, it simply activates the expected suboperation, among ASL, LSL, LSR, ASR and ROR. The meaning of these mnemonics was discussed in section 3.5.4 and, apart from the assembly syntax purposes, the behavior section of these LISA operations sets the pipelined control signals for the barrel shifter operation to be performed in the following machine cycle. Two alias operations are used: by the fact that logical and arithmetic shift left are the same operations but have different assembly mnemonics, ASL is aliased by LSL. The other alias operation is RRX, which coding uses the ROR#32 combination and sets up a dedicated control signal. The barrel shifter functions, which must be implemented in hardware, are de- scribed in the barrel shifter op ex operation. Here the operation required is

125 5 – The LISARM model selected by reading the control pipelined signals; the bs op 2-bit line is inspected, in order to establish what type of shift must be performed. The shift amount is supplied by the bs amount 5-bit wide pipeline register, allowing all values among 0 and 31 to be expressed. bs special op and bs amount32 are two flags used to identify if particular encoded shifter operations are expected. The former flag tells that a special operation, among those discussed at page 70, is necessary; the latter specifies that a 32 bit shift is required. By using a combination of these two sig- nals the output operand and the bs carry out can be set as defined by the ARM data sheet [16]. The figure 5.4 reports the complete barrel shifter port map, where bs opd is the 32-bit input, alu opd2 is the output directly destined to the ALU second operand input and the bs carry out contains the last bit shifted out during the operation. The shift operation description (left and right, arithmetic or logical) uses C lan- bs_opd C_flag

bs_op 2

bs_amount 5 Barrel Shifter bs_amount32

bs_special_op alu_opd2 bs_carry_out

Figure 5.4. LISARM barrel shifter guage shift operators, and a particular trick is used to perform the logical shift right operation1. Instead of executing a shift operation by a specified amount, many single

1The C operator“>>”performs an arithmetic shift right and an operator for logical right shift

126 5 – The LISARM model bit shifts are executed and, at every new step, the most significant bit is reset. The rotate right operation is described by using a switch statement which selects one of the thirty-two possible values for the rotate amount. In every branch case a for loop is used to perform a shift right operation by a single bit, in order to save the shifted out bit and reinsert it at the most significant position. This description approach seems to be redundant, but is the unique possible, by the fact that LISATek HDL Generator does not accept variable indexes in“for”loops. To extract and modify single bits within an operand, the input value is loaded into a CXBit 33-bit variable and some of the predefined LISA language methods are exploit. By using the LISA casting functions the result is then converted in a suitable 32-bit data type for the barrel shifter output and a single bit for the carry out. The activation section ceases the control to alu operation, for selected computation.

5.2.2 The arithmetic logic unit The ALU functionalities are described using the same method seen in the previous paragraph (5.2.1), so that the HDL generator will produce a block which respects the scheme reported in figure (5.5). The port map is the same of the ARM architecture and all the input signals, except alu_opd1 alu_opd2

4 alu_op carry_out C_flag result

Figure 5.5. LISARM arithmetic logic unit the second operand alu opd2 coming from the barrel shifter, belong to the DC/EX is not defined.

127 5 – The LISARM model pipeline register output. The carry in signal is selected between the barrel shifter carry out and the CPSR C-flag, depending on the operation to perform. All the ALU operations are defined within the alu operations file and the fundamental LISA operation is alu operation, which reads the selected operation on the 4-bit alu op pipelined signal and selects the logical or arithmetic operation to call. There is a set of sixteen LISA suboperations which can be called and corresponds to the operations provided by ARM data processing instructions. These suboperations can be used both by data processing instructions and other ones, e.g. by the branch and memory access instructions to perform address calculations. The LISA suboperations are identified by alu , where opcode is the mnemonic defined in table 3.4. The action performed by the single operation is also reported in the same table and the result is assigned to a global variable to ensure it can be accessed by every other operation within the model. For basic ALU operation description C-like syntax is used, so it can be considered as a behavioral style modeling. In order to update the CPSR N-flag the bit 31 of the result is assigned to a single bit variable, this information is also useful to establish if an overflow has occurred during a sum or subtraction operation2. The CPSR flags update is executed only if the pipelined PSR update f flag is set, hence the alu operation activates some further operations with respect to the ALU calculation performed. ADD/ADC, SUB/SBC, RSB/RSC operations are executed in different manner and the overflow condition checking is made by using different techniques, for this reason three LISA operations are provided: add PSR update, sub PSR update and rsb PSR update. A fourth operation, logop PSR update, is activated directly by alu operation for logical instruction and by the three listed operations for arithmetic instructions. Not all the ALU operations need the result be stored in a general purpose register, an address calculation, e.g., requires to be stored in the memory access register. The main operation establish if the result must be written and where by checking the write result f and mem access f pipelined flags respectively, activating one of the operations between write result and mem access. These operation are described within the other operation section (5.3).

5.2.3 The 32x8 bit multiplier The ARM processor architecture provides a 32x8 bit multiplier, which exploits the 8-bit Booth’s algorithm to perform both a signed or unsigned multiplication. The internal ARM hardware uses an extension of the radix-4 Booth’s algorithm to im- plement an high speed multiplier [7], by using four carry save adder layers and some

2checking the overflow condition by a control on most significant bits carry is expensive in LISA, so the operands and result sign are taken in account.

128 5 – The LISARM model other combinational logic to perform 2-bit shifts. This complex structure is quite difficult to describe by using the LISA language mechanisms, so the description of this part of the model uses a behavioral style, leaving to the HDL synthesizer the burden of the structural implementation. The signed operands must be in 2’s complement format and by using this unit is possible to obtain a result which respects both signed or unsigned convention. To obtain the product of two 32-bit operands it is necessary to execute a multistep op- eration, which at every step shifts left the value furnished by the multiplier by eight bits or multiples and then sums the partial results. The multiplier op LISA op- eration describe the multiplier behavior by a C language statement which performs the multiplication of the 32-bit extended 8-bit multiplier and the 32-bit multipli- cand without describing the operation in depth. The result is put on the output wire mul result w and this signal is assigned to the barrel shifter input within the multiplying operations described in section 5.7. multiplicand_w mul_sel

2

32 x 8 8 mul_ctrl multiplier_w Multiplier mul_result

Figure 5.6. LISARM 32x8 multiplier

129 5 – The LISARM model

5.3 Other LISA operations

The model LISA operations use some common suboperations to convert numeric values expressed in the assembly syntax and also to access directly the register content. These operations use some specific LISA language capabilities to return an opportunely converted numeric value to the calling LISA operation. Moreover, no overhead is generated in the model, because the same logic is shared with all the units that require the conversion. The reg index operation converts the unsigned numeric index used in the assembly syntax to express a source or target register and returns to the calling operation a 4-bit unsigned integer. The return value is used every time the register file must be accessed in the same manner an array element is accessed in C. On the other hand, when the content of a register must be read, another similar operation can be used. In such case the conversion is a bit different and includes also the register identifier (its name) and not only its index. The instruction sets also a global flag named pc f, which is useful to signal the write result operation (described at the end of the paragraph) that the program counter (R15) access has been requested. A particular decoding operation is dedicated to the mechanism implemented by the ARM processor for the generation of some immediate values. The ARM as- sembler, in fact, accepts 32-bit immediate values as operands only if they can be expressed by using a 8-bit wide immediate value and 4-bit wide right rotation semi- amount. Starting from these two data, the barrel shifter recreate the original value introducing the immediate value into the barrel shifter and performing a right ro- tation by a number of bits equal to twice the amount value. It is obvious that only some particular integer values can be transformed in the so called immed8 r form, all powers of two, for example. The ARM assembler is able to check if an expressed immediate value can be transformed as described. If the assembly instruction is an arithmetic, move or compare operation, an alternative form can be exploit: by the fact that everyone of these operations provide an opposite operation the 1’s comple- ment form of the operand is used and the operation mnemonic which compares in the assembly code is changed, i.e. ADD becomes SUB, MOV becomes MVN, CMN substitutes CMN and so on. The LISATek toolsuite does not allow to develop the assembler in this manner but, to ensure the compatibility with ARM compilers and assemblers, some tools were added to the generated toolchain (ref. section 6.4). In order to allow the barrel shifter to extract the original value the immed8 r LISA op- eration performs the decoding of both the immediate 8-bit value and 4-bit amount, sets properly the op2 and bs amount pipeline registers and assigns the BS ROR operation to the bs op pipelined signal. The immediate values are recognized by calling two dedicated operations, discussed in section (5.3). Because the operand is put into the pipeline register, also the registered opd flag must be set.

130 5 – The LISARM model

Other operations, referred as immediate value , are used to convert un- signed immediate values expressed in the assembly syntax, which represent operands. Their behavior is quite similar to the reg index operation, but some of them find and remove the typical symbol“#”used in most assembly languages. These opera- tions converts unsigned integers values of different bit size: 4-bit rotate amount and 8-bit immediate value for the immed8 r format constants, 5-bit amount for the barrel shifter operations, 12-bit immediate offset for the memory access instructions and a 16-bit particular integer value for block data transfer register list coding. By the fact the immediate offset addressing mode accepts positive or negative integer values, the sign expressed in the syntax is not treated directly by the immediate value 12 operation, which converts only the numeric value. The write result operation is used every time the result of an ALU operation has to be stored in a register and is activated directly by the alu operation (ref. section 5.2.2). By checking the branching and the mem access f flags this oper- ation establish if the ALU result must be stored into the mem address reg for a memory access, in the BPC register for the branch performing or into the register file. In the last case, the destination register is selected by using the index stored into the reg2 i pipeline register and, if the PC (R15) is selected, a flush operation is scheduled3. By the fact that some operations require content transfer between CPSR and a banked SPSR, the PSR flag assign pipelined flag is checked and so the processor mode register is used to select the proper register bank. The same statements are used to perform the transfers required by MSR and MRS instructions (section 5.6).

5.4 The branch instructions

The ARM instruction set provides two types of branch instruction, a standard branch operation, with or without link (B or BL respectively), and a particular branch and exchange operation (BX), which provide the capability to switch be- tween ARM and THUMB state. The BX instruction accepts the indication of the register containing the branch destination address which must be an absolute value. On the other hand, the B and BL instructions use a branch offset and perform a PC relative jump to another address, requiring an ALU operation to determine the branch destination. When the decoding operation enters the branch operation group (branch grp), it has to inspect the coding of both the possible branch operations to select the right subsequent LISA operation to activate, i.e. B dc or BX dc. The BX dc operation resolves the source register index and stores it in the output pipeline register, in

3Two clock cycles for the pipeline refilling are required.

131 5 – The LISARM model order to allow the execution stage to retrieve the branch destination address. The BX ex operation, belonging to the EX pipeline stage, is activated, so that the other actions for the branch performing are scheduled for the subsequent machine cycle. Entering the execution stage, the operation schedules the predefined pipeline stall operation4, to avoid the execution of invalid and unnecessary instructions already loaded in the pipeline. Furthermore it copies the branch destination address, read by the operand register, into the branch program counter (BPC), which is then used by the prefetch operation. By using the predefined SetRegion5 method, the CPSR T flag is set to the operand least significant bit value (Rn[0]) and this can cause a state switching, from ARM to THUMB or viceversa. The standard branch instruction decoding is performed by the B dc LISA oper- ation, which is far more complicated with respect to the BX instruction. Here the branch destination address must be determined by using the PC content and the offset can be expressed by using an immediate offset, an assembly label or another symbolic name. To do so, an ALU operation must be scheduled and the related pipelined signals need to be set. The decoding operation selects the right operation to activate, between imm branch offset and symb branch offset. The former op- eration reads and converts a 24 bit immediate offset, expressed in the assembly code (section 3.5.3), shifts it left by two bits and sign extends6 the result to 32 bit. This value is then saved into the bs opd pipeline register, whose output is directly connected to the barrel shifter input port. The symb branch offset performs the same operations seen above, but accepts also an assembly symbol, hence it has to convert this information in a different manner. The LISA language provide some mechanisms to simplify these operations, which involve the same instruction mem- ory address and so the symbol table used both by the assembler and linker. The B dc instruction, finally, inspects the link bit used in the coding to establish whether the relative next instruction address must be saved in the link register. The link bit value depends on the presence of the“L”label in the assembly syntax and it enables further operations if the linkflag is set during the decoding step. The next stage instruction is activated. The B ex operation, like the other operation, schedules the pipeline flush and determines the branch destination address by adding the offset stored in the bs opd pipeline register to the program counter address, hence R15 is used as the first operand and assigned to the alu opd1 wire. The offset is assigned to the bs opd w wire and the other barrel shifter signals are properly modified, in order to avoid any shift operation7. The write result f flag is set and PSR update f is

4the flush method allows to clear single pipeline registers or whole stages by expressing their names in C++-like syntax. 5The SetRegion method is a predefined LISA function which is able to modify groups of bit within a CXBit variable or register. 6the sign extension is executed by masking the value with an ad hoc mask. 7LSL#0 is performed, so that the barrel shifter output is identical to its input.

132 5 – The LISARM model reset to avoid ALU flags modifications. The linkflag is checked to establish if the PC has to be saved into the link register (R14). The barrel shifter op operation is then activated and, if the branch is with link, also the BL ex poll operation is scheduled. The former implies the sequential activation of the alu operation and of the write result operation. This implies that the ALU result is saved into the branch program counter, used by the prefetch operation to perform the jump. The link register value correction is performed by the BL ex poll operation in three con- secutive machine cycles, for this reason, after the first activation through the BL ex operation, it needs to reactivate itself exploiting the polling mechanism. To do so, the EX/ED pipeline register is stalled during first and second operation execution and, to respect the BL instruction cycle times, the correction is executed in the last cycle it is activated. The operation uses the branching counter variable to estab- lish which step of the branch instruction is running and in the third cycle sets the input signals for barrel shifter and ALU units. Here the subtraction is selected by setting alu op to the ALU SUB value and selecting R14 as the first operand; the con- stant value“4”is used as second operand and must not be modified passing through the shifter (LSL#0 performed). The destination register is set to R14, and also in this case, CPSR flags update is not required. The barrel shifter op is activated and so the alu operation and write result operations; they perform the ALU operation and storing the result in the link register. Both dc LISA operations set the branching global flag, so that the prefetch operation can execute the statements dedicated to the jump performing and some other instructions introduced for model debugging. The branching flag is reset after some machine cycles by an ad hoc operation, activated by the prefetch operation.

5.5 Data processing instructions

The ARM instruction set provides sixteen data processing instructions, which can be grouped in:

• Arithmetic operations: ADD, ADC, SUB, SBC, RSB, RSC.

• Logical operations: ORR, AND, EOR, BIC.

• Compare operations: CMP, CMN, TST, TEQ.

• Register move operations: MOV, MVN.

The first group provides the sum, subtraction and reverse subtraction operations in two versions, with or without the input carry or borrow affecting the calculation

133 5 – The LISARM model result. The compare operations are based on logical operations but the result affects only the CPSR ALU flags and is not written. The details on the actions performed by the single operations are discussed in the paragraph 3.5.4 and reported in table 3.4. For assembly syntax and coding aspects, the logical operations are grouped with the arithmetic instructions in the arith logic grp. When the decoding operation enters the data processing operations group (data proc grp), it inspects the coding of the provided operations and activates the proper subgroup operation, which can be cmp grp, mov grp or arith logic grp. These groups have many differences in accepted arguments number, type and assembly syntax, but expect the same coding for the second operand, as discussed in section 3.5.4. The second operand decoding operation can be immediate operand or shifted reg operand and it is activated by everyone of the group decoding operations reported above. The discrimination between the required operation is done by inspecting the 4-bit opcode field and using this information the subgroup operations can activate the right LISA operation. The cmp grp LISA operation sets the PSR update f flag in the output pipeline register without checking any coding field, because for these instructions the CPSR update is implicit, the“S”suffix can be omitted in the assembly syntax. The behavior section stores the first operand index in the reg1 i pipeline register and then the activation section selects the proper dc operation, where the opcode in- struction can be one of the CMP, CMN, TST or TEQ mnemonics. All the operations which names are in the dc form, execute the same operation, in fact they set the alu op pipeline register to select the ALU unit oper- ation which will be performed when the instruction enters the execution stage. The last section of these operations activates the data proc setup operation, i.e. a set of statements which set up many of the signals for the execution step control, like ALU behavior and register file writeback operation. The mov grp LISA operation sets the PSR update f flag only if the S-bit is set in the coding; this option is selected by expressing the“S”suffix in the assembly syntax and causes the CPSR flags update at the end of the instruction execution. The destination register index is stored and then the activation section selects the subsequent decoding operation between MOV dc and MVN dc. The former sets the ALU so that the second operand is transferred to the destination register as it comes form the barrel shifter output, the latter requires the bitwise complement of that output. The ALU setup is done by using the usual alu op pipelined signal. Also arith logic grp LISA operation sets the PSR update f flag only if the S-bit is set in the coding, so that CPSR flags will be updated. First operand and destination register indexes are stored in pipeline registers and then the activation section selects the subsequent decoding dc operation, where opcode is a mnemonic among ADD, ADC, SUB, SBC, RSB and RSC. Their meaning are re- ported in table 3.4 and the ALU operation selection is done by using the alu op pipelined signal.

134 5 – The LISARM model

The second operand decoding operation is activated by every data processing in- struction, in order to set the control signals for the barrel shifter unit, where the operand should pass before entering the ALU. To do so the selected operation can be immediate operand or shifted reg operand. The first LISA operation activates directly the immed8 r operation without executing other statements, but its coding section is necessary to add the I-bit value (fig.3.7), which discern an immediate off- set from a registered offset. On the other hand the immed8 r operation was already described in section 5.3. If the second operand is a registered value, the I-bit value in the coding is zero and the shifted reg operand operation is activated. The operation stores the sec- ond operand register index in the reg2 i pipeline register and the registered opd flag is now set. By inspecting the instruction coding, a further operation among shifted reg, non shifted reg, RRX or reg amount shifted reg is activated, in order to set the barrel shifter control signals into the pipeline registers. The shifted reg LISA operation assigns the amount for the barrel shifter opera- tion to the dedicated pipeline register; to do so a 5-bit immediate conversion function is used. All the other settings are executed by the barrel shifter op dc operation (section 5.2.1), which is then activated. If the registered operand value must not be shifted in any manner, the operand has to pass through the barrel shifter reach- ing the ALU second operand input without alterations. To do so the barrel shifter control signal must be set to perform a LSL#0 operation. The non shifted reg op- eration is provided to assign the BS LSL value to bs op and to reset the bs amount pipeline register and the other special barrel shifter operation flags. The RRX LISA operation is activated if the homonymous barrel shifter operation is required and it acts as the non shifted reg operation discussed above, with the only difference that it assigns the BS ROR value to bs op pipelined signal. This is because the RRX operation exploits the coding of the ROR#0 instruction, as described at page 70. Since the shift amount can be stored in a register, the reg amount shifted reg operation sets the registered bs amount flag and stores the source register index in the bs amount reg pipeline register. This situation implies a particular execu- tion unit behavior, because the datapath must be used to read the source regis- ter in a first cycle and to perform the barrel shifter operation in the subsequent. To do so the data proc setup is activated and so it is scheduled for the subse- quent clock cycle (belonging to the EX stage). The register file access is done by the shift amount reg access LISA operation under the control of the sched- uled operation; its statements executes a set of assignments to bs special op f, bs amount32 f pipelined flags and to the bs amount register setting up the barrel shifter unit for the next cycle operation. The values assigned depend on the shift amount register content and respect the ARM behavior described in the data sheet [16] and at page 70.

135 5 – The LISARM model

As discussed above, when a data processing instruction must be executed, the data proc setup operation is activated in the execution pipeline stage, so that it is executed when the instruction enters the stage. Since the instruction execution may take more than one clock cycle, this operation exploits LISA polling capabilities and due to the registered bs amount value it can stall the EX/ED stage to allow its reactivation in the subsequent machine cycle. After the condition valid flag checking, in fact, the operation evaluates if the barrel shifter operation amount must be read from a register and if required the whole pipeline is stalled for one clock cycle. The polling activation variable is reset to avoid a new activation in the subsequent clock cycle and, as discussed above, the shift amount reg access is called. At the reactivation (second clock cycle) the setup operation activates the barrel shifter unit and assigns the registered control signals and operands to the internal wires connected to the ALU, so that the data processing instruction can be performed. If the instruction belongs to the compare group the write result f flag is reset, otherwise it is set to ensure the writeback operation is performed (section5.2.2). The barrel shifter ex operation is finally activated, so that the datapath components can operate. When the processor executes a data processing instruction and the destination register is the program counter (R15), two further clock cycles are expected for the execution. This behavior is due to the fact that a pipeline flush is necessary and the pipeline refilling needs two machine cycles before the new addressed instruction can be executed. To do so, the LISA operation schedules a flush operation on PF/FE and FE/DC pipeline registers for the subsequent clock cycle.

5.6 PSR transfer instructions

The PSR transfer instructions are used to access directly the processor status re- gisters CPSR and SPSR, modify their values or save them to other general purpose registers. The MRS instruction copies the CPSR or the SPSR content into the destination register expressed in the assembly syntax. On th other hand the MSR instruction, moves a register content into the CPSR or to the banked SPSR, respecting the processor operating mode. The latter instruction allows also the modification of the CPSR or SPSR ALU flags (ref. section 3.3.5), by using a registered quantity or an immediate value which can be expressed in the immed8 r format. All the PSR tranfer instructions execute in a single machine cycle, hence no polling operations are used for the behavior modeling. The decoding operation starts with the PSR access grp LISA operation, which activates directly another operation among MRS dc, MSR dc, MSR flg dc. The MRS dc operation converts the Rd destination register index and the PSR source register

136 5 – The LISARM model by checking the P-bit in the coding. The PSR selection allows different assembly formats, hence many suboperations are used to define the various syntax sections, also if the resulting coding is the same. The PSR transfer involves the datapath for the register file access and so the operation sets all the necessary signals to avoid data modifications; the write result f flag is also set. The PSR select pipelined flag is used to signal the execution unit to transfer the current status register (CPSR) or the saved status register (SPSR) and the MRS ex operation is then activated. Entering the execution stage, the MRS ex operation takes the control, checks the condition satisfaction and evaluate the PSR select pipeline flag to establish which PSR must be assigned to the barrel shifter input bs opd. If the SPSR has to be transferred, the processor mode is evaluated to perform the right register bank access. The datapath operations are enabled by activating the barrel shifter op and consequently the write result operation is activated at the end of the ALU operation. If the opposite operation must be performed, the MSR dc is activated for the decoding step; the operation converts the Rm source register index and selects the PSR destination register by checking the P-bit. For the PSR selection, the previous considerations are valid, the datapath and write result f flag setup is also the same. The PSR transfer f pipelined flag is used to signal the write result oper- ation that a PSR is selected for the transfer operation and the PSR select flag is used to signal the execution unit to transfer the current status register (CPSR) or the saved status register (SPSR). The MSR ex operation is activated before exiting. Entering the execution stage, the MSR ex operation takes the control and evaluates the PSR select pipeline flag to establish which PSR must be assigned to the bar- rel shifter input. If the SPSR has to be transferred, the processor mode is evaluated to perform the register bank selection. The datapath operations are enabled by activating the barrel shifter op and consequently the write result operation is activated at the end of the ALU operation. To modify only the ALU flags within the current or saved PSR, the MSR flg dc operation is activated. The operation works as MSR dc for the recognition of the se- lected PSR but activates an operation between MRS reg op or immediate operand to decode the source operand. The former operation converts the register index of the Rm source operand and assigns it to the reg2 i pipeline register; the registered opd flag is also set for the datapath configuration. The immediate operand operation activates directly the immed8 r operation, without executing other statements (its coding section is necessary to add a bit which allows to discern an immediate offset from a registered offset). The immed8 r operation is described in section 5.3. A dedicated pipelined flag (PSR flag assign) is set to transmit to the write result operation that only the PSR most significant bits are affected by the data transfer, so a masking operation must be executed. When the MSR instruction enters the execution pipeline stage, the MSR flg ex

137 5 – The LISARM model operation, activated by the MSR flg dc operation, takes the control and sets up the datapath with respect to the relative pipelined signals; the usual barrel shifter op operation is activated before exiting. The ALU result is used by the write result operation, which masks opportunely and executes a bitwise“or”operation with the unaffected bit values read from the selected PSR; the obtained 32-bit value is then written back to the same PSR.

5.7 Multiplication instructions

The ARM processor provide two multiplication instructions for 32-bit operands which can be both signed and unsigned. The signed operands must be expressed in 2’s complement notation and one of the instructions (MUL) performs the multipli- cation returning a 32-bit result only, so the sign of the operands does not matter and can be omitted in the instruction syntax. The MULL instruction, otherwise, return a 64-bit result by storing it into two registers (an high and a low register) expressed in the instruction syntax. To perform the operation the ARM processor uses the 8-bit Booth’s algorithm and stores the partial sums along a variable number of cycles. In this model, as discussed in section 5.2.3, the 32x8 multiplier block is not described in depth but only described in behavioral style. The instructions, due to the multiplying method, can save some machine cycles; to do so the most significant bits are inspected in 8-bit groups and if they’re all ones or all zeros some multi- plication steps are not performed (refer section 3.5.6). The discussed instructions provide also the accumulate option, so that a previous registered result can be added to the multiplication result, saving an ADD instruction. All the instructions accept four register in their coding, Rs and Rm are always the multiplier and multiplicand respectively, Rd represents the destination register in the standard instruction ver- sion and the high register in the long multiply instruction, Rn is the accumulator in MUL/MLA and the low register in the long version. The long multiply instruction uses the destination register also as accumulator and a previous value must be stored here. The S-bit is inspected to establish if the CPSR flags must be updated and this information is stored into the PSR update f pipelined flag. The multiply grp decoding instruction inspects the condition field and converts the involved registers indexes, assigning the pipeline signals for the execution stage. As discussed above, the register usage changes between the multiply instruction and their long versions but, in the coding scheme, their positions are the same. By the fact a fourth register is used, its index is assigned to the reg3 i pipeline register. The other coding bits are evaluated to select the subsequent LISA operation to be activated. If the MUL or the MLA form of the instruction has to be executed the MUL dc is selected. The operation must give Rd = Rm ∗ Rs if the A-bit is zero and

138 5 – The LISARM model

Rd = Rm ∗ Rs + Rn if it is set; in order to signal this condition to the EX stage the mul acc f pipeline flag is set. The multiply and accumulate form of the in- struction gives Rd = Rm ∗ Rs + Rn, which can save an explicit ADD instruction in some circumstances. The datapath signals are set up to perform the destination register initialization during the first execution cycle, setting all its bits to zero if the accumulate flag is reset or assigning the Rn register value otherwise. To do so the ALU MOV operation is assigned to the ALU alu op input. The BS LSL value is assigned to the bs op pipeline register and remains the same for all the steps; the shift amount bs amount register, otherwise, starts from zero in the second step and is incremented by eight unit at every subsequent reactivation of the operation8. The write result f pipelined flag is also set, in order to save the partial multi- plication result in the destination register. The MUL ex operation is then activated and so it takes the control when the instruction enters the execution stage. The operation exploits the polling LISA mechanism to ensure the reactivation during subsequent cycles, until all the multiplying steps are not performed. The first ex- ecution cycle initializes only the destination register and schedules a pipeline stall for the execution cycle. At the first reactivation, the least significant bits of the multiplier are selected and assigned to the mul opd8 8-bit input of the 32x8 multi- plier. The multiplicand value is assigned to the mul opd32 and does not change any more. The 32x8 multiplier result is assigned to the bs opd at every cycle, shifted by the needed number of bits and sent to the ALU, which performs the sum between the destination register. To activate the multiplier the multiplier op described in the section 5.2.3, is activated. The barrel shifter op operation is activated by the same operation. For the reasons explained above, at every cycle the most significant bits of the multiplier are inspected to establish if the subsequent step has to be performed or not. At the first reactivation of the operation the bits [31:8] are masked and evaluated, if they are all zeros or ones no cycle stall operation is scheduled, otherwise it will. To establish which step is in execution the cycle c counter is used and incremented at every cycle; this information is also used to perform the left shift of the partial 32x8 multiplication result, which must be added to the partial result stored in the destination register at every cycle. The various datapath configuration signals are simply transferred from the pipeline to the units and the barrel shifter op operation is activated. At the second reactivation the multiplier bits to be inspected are [31:16] and at the third [31:24]. If someone of these groups are all ones or zeros the last cycle is executed immediately, otherwise the 32x8 multiplication operation is activated and executed, by selecting the right group of bit to send to the 8-bit multiplier input. The MULL and MLAL instructions behavior is modeled by the MULL dc and MULL ex operations, respectively in the decoding and execution phase. The first

8the multiplication mechanism is described in the section 5.2.3.

139 5 – The LISARM model operation, assigns the same pipeline signals of the MUL dc operation and, in order to establish if the sign of the operands must be considered or not, the signed f is set or reset with respect to the U-bit defined in the coding. The MULL ex operation is then activated and so it takes the control when the instruction enters the execu- tion stage. The operation must give {RdHi,RdLo} = Rm ∗ Rs if the A-bit is zero and {RdHi,RdLo} = Rm ∗ Rs + {RdHi,RdLo} if it is set; to signal this condition to the EX stage the mul acc f pipeline flag is set. The datapath signals are set up to perform the two destination registers initialization during the first execution cycle, setting all their bits to zero if the accumulate flag is reset. To do so the ALU MOV operation is assigned to the ALU alu op input. The BS LSL value is assigned to the bs op pipeline register and remains the same for all the steps; the shift amount bs amount register, otherwise, starts from zero in the third step and is incremented by eight unit at every subsequent reactivation of the operation. All the other performed operations are similar to those seen for the ordinary operations, but the result is stored on two registers with respect to the previous version. To do so, some local variables and some masks are used within the LISA operations, in order to ensure the correct multiplication algorithm implementation. All the multiplying instructions, if the PSR update f pipeline flag is set, has to update some of the CPSR flags with respect to the result obtained. Only the N (Negative) and Z (Zero) flags are correctly set with respect to the result (N is made equal to bit 31 or 63 of the result, and Z is set if and only if the result is zero). The C (Carry) and the V (oVerflow) flags are unaffected, respecting the ARM processor behavior which assigns them meaningless values.

5.8 Single data transfer instructions

The processor furnishes two fundamental instructions for the memory access and the transfer involves a general purpose register (program counter included) and a memory location. A“load”and a“store”instructions are provided and both word and byte sized data transfer can be performed. A subtle variation of these first operations allow the sign extension of byte and half-word sized data, in order to give signed and unsigned sub-word sized data types support. Many addressing modes are provided, also in PC-relative version; the ordinary operations accept also shifted register offset, while the signed/unsigned versions of the instructions does not allow to express a barrel shifter modification of the registered offset. Immediate offset are also accepted, the difference between ordinary and signed/unsigned instructions is in the bit width, 12-bit for the former, only 8-bit for the latter. The complete set of signals for the memory management unit are defined in the model, in order to respect the memory model boundaries. Big endian memory format management

140 5 – The LISARM model is not yet implemented, for reasons connected to the development environment. The description of the LISA operations which implement the signed and unsigned data types is done at the end of the paragraph, by recalling what discussed for the ordinary ones. The main decoding operation is mem access grp, which activates directly one of three possible suboperations; std data grp includes LDR and STR intructions su data grp includes the signed and unsigned data versions and block data grp provides the support for the stacking multiple data transfer operations described in paragraph 5.9. The std data grp operation sets all the pipeline signals for the address calcula- tion, which will be performed by the execution stage in the subsequent clock cycle. The operations can access the memory in user mode or in privileged mode, so the privileged mode access f pipeline flag must be set or reset to allow the relative actions performing by the memory management unit. The source or destination re- gister is selected by storing its index in the regd i pipeline register and the transfer data size is selected by the byte access f pipeline flag. The operation activates the LDR ex or the STR ex operation, belonging to the execute pipeline stage, by inspect- ing the load/store bit in the coding. By using LISA coding inspection mechanism, one of the address decoding operations is activated, in fact the ARM processor ac- cepts three addressing modes: PC-relative, pre-indexed, post-indexed and also with no base displacement specification. If the offset is not specified, the zero offset LISA operation is activated, so that only the base register index is stored in the reg2 i pipeline register and to signal that to the execution unit, the registered opd pipeline flag is set. By default this addressing mode is considered a pre-indexed access, but the writeback operation is not scheduled, so pre npost indexed f and writeback f pipeline flags are set and reset respectively. The actions performed during the first clock cycle aim to transfer the base register content into the mem address reg, which is connected to the address bus during data load or store operations. To inhibit any barrel shifter operation the BS LSL value is assigned to the bs op pipelined signal and the shift amount register is set to zero. Also the ALU must be crossed without modifications in the base register value, so the alu op pipelined signal is set to ALU MOV. If the assembly instruction does not contain a register indication, the addressing mode is PC-relative and the decoding operation program relative converts the immediate 12-bit wide offset by calling the immediate value 12 operation. The converted value is then assigned to the op2 pipeline register. With respect to the zero offset operation the pipeline signals setup differs only in the ALU operation selected, because the fetch address contained in R15 must be used to calculate the memory access address. By inspecting the up/down bit value an operation between ALU ADD or ALU SUB is selected, respecting the immediate offset sign expressed

141 5 – The LISARM model in the instruction syntax. Using the program counter as the base register, the writeback operation is not permitted in this addressing mode, so the relative flag and the pre npost indexed f signal are assigned in the same mode. Some more details need to be discussed for the other LISA operations, which decode the pre-indexed and post-indexed memory access modes. The pre indexed operation assigns the base register index to the reg1 i pipeline register and, by inspecting the W-bit, sets the writeback f pipeline flag if the writeback operation is requested by the assembly syntax. The operation activates the right operation, between immediate offset and shifted reg offset in order to resolve the flexible offset encoding [19]. The immediate offset operation executes the same operations view for the PC-relative mode and so converts the 12-bit immediate offset and stores his value in a pipeline register. Analyzing the sign of that value, the up/down bit is set in the instruction coding and this information affects the ALU operation selected by using the dedicated pipeline signal: it will be ALU ADD for a positive offset and ALU SUB otherwise. Quite different the behavior of the shifted reg offset operation that is activated when a shift register operation is needed to perform the offset calculation. Here all the barrel shifter operations for the second operand modification, discussed in the section (5.5), are allowed, except registered shift amount transformations. The up/down bit in the coding depends on the assembly syntax and selects the alu operation to be performed in the execution step. The operation stores the Rn index in the op2 pipeline register and a bit of the coding is set to select this operation instead of the immediate offset one. By the reg shift label, a barrel operation between shifted reg and RRX is selected and activated. How they act for the barrel shifter signal setup is discussed in the data processing section 5.5. The post indexed operation accepts the flexible offset syntax and coding, as the pre-indexed dedicated operation, and also the base register indication seen above, but it does not accept the writeback request because the base register is implic- itly updated before the address is transferred to the memory management unit. Some differences must be underlined in the assembly syntax, because of the square parenthesis position and use, as reported in the section 3.5.8. For these reasons the pipelined signal setup for the execution units does not differ from the pre indexed LISA operation. When the memory access operation enters the execution stage, the scheduled operation takes the control. If the instruction register contains a store instruction the STR ex is activated and, at the first execution cycle, it reschedules itself by us- ing the polling LISA mechanism. Before accessing the memory, in fact, the memory address for the access has to be calculated, so the barrel shifter ex is activated exiting from the operation. The memory access is performed in the subsequent machine cycle and the cycle c

142 5 – The LISARM model counter9 is used to discriminate between the cycles. The pipeline is completely stalled to allow the performing of the described execution steps. The operation named barrel shifter ex activates the alu operation and this one activates the write result operation, which controls the mem access cycle f flag to establish if the ALU result must be stored to the mem address reg instead of affecting the regis- ter file. At the reactivation, the operation executes the setup of the memory interface signals assigning BS, nRW and MAS, taking in account the state of the byte access f pipeline flag. If the writeback f pipeline flag is set, the writeback op operation is activated, so that the mem address reg values is written back to the Rn register, by setting the datapath in the usual manner and using the reg1 i pipeline register value for register file indexing. nMREQ and SEQ are defined step by step by the prefetch operation, because they play their role at every clock cycle and at every memory access, i.e. an instruction fetch or a data transfer operation. The address bus (A) and the data bus (D or DOUT) values assignments are performed by a specific LISA operation, that has also the effect of read the source register value, by using the regd i pipeline register content as register file index. The data load instruction LDR behavior is modeled by the LDR ex operation, which performs a set of operation similar to the store STR ex operation but takes one more machine cycle to complete. In the first cycle the address is calculated in the same way seen above, during the second cycle the setup of the memory interface signals is executed and in the third cycle the data bus is sampled to store the value supplied by the memory management unit. To discern what step of the multicycle operation is in execution the cycle c counter is used and in the first two steps a stall operation is scheduled, in order to allow the LDR ex operation reactivation. At the first reactivation (second execution cycle), the operation activates the writeback op operation if needed and sets up the memory interface signals assigning BS, nRW and MAS for the access. The address bus (A) value is assigned by another specific LISA function, designed expressly for memory read operations. In the third step the data bus (D or DIN) is accessed, and the sampled value is saved to the mem data reg global register. The same value is written to the Rd register by setting the datapath signal opportunely, so that no data modifications are performed. If R15 is selected, two further clock cycles are expected for the execution. This behavior is due to the fact that a pipeline flush is necessary and the pipeline refilling needs two machine cycles before the new addressed instruction can be executed. To ensure this behavior the write result LISA operation schedules a flush operation on PF/FE and FE/DC pipeline registers for the subsequent clock cycle. If a load or store instruction involves a halfword sized data or when a load instruction requires the sign extension of the transferred data, the su data grp

9cycle c is a 2-bit counter used by all the memory access LISA operations that require more than one clock cycle to complete.

143 5 – The LISARM model operation is directly activated by the mem access grp one. Here some different suffix are used to express the data size and the signed version of the instruction and two bits of the coding scheme report these informations, S-bit and H-bit. A new pipelined signal is used to perform the setup of the execution stage, the signed f, which is assigned if the S-bit is active. The coding of these four operations has few differences with respect to the standard STR/LDR operations and the addressing modes are substantially the same described above. The immediate offset accepted is only 8-bit wide and no shift operation is allowed for registered offset values. The coding of the unsigned immediate value is divided in two nibbles (4-bit groups), but the bit splitting capabilities of the LISA language permit the correct conversion of the information and its storing into a pipeline register reserved to ALU operands. All the other bits like the up/down (U-bit), writeback flag (W-bit), load/store bit (L-bit) and pre/post indexing bit (P-bit) maintain the same position. The decoding operation behavior is similar to the analogous section in the std data grp and activates one operation among zero offset, program relative, pre indexed and post indexed in order to recognize the addressing mode and setup the ALU unit pipelined signals. These operations are the same discussed above but, the last two of them, checking the signed f, set the barrel shifter to perform the BS LSL with zero amount to avoid registered offset modifications. When an instruction among LDRH, STRH, LDRSB and LDRSH enters the execution stage, the operation su memory access, activated by the su data grp in the decoding stage, takes the control and performs the same operations described by LDR ex and STR ex, with some differences due to the subword sized data masking and addressing prescriptions. Here the STRH store instruction has to drive only the sixteen data bus lines interested by the source value, so the other bits are reset by a masking operation applied directly to the mem data reg register. Same thing for the LDRH load operation, where the proper mask is applied to the mem data reg register to zero all the bits which are not significant, before the register file update. The signed data load operation, otherwise, has to check the most significant bit of the value loaded into mem data reg register to choose the right mask to apply, in order to perform the sign extension expected by the instruction, the involved data size. Using the memory interface signals there’s no need to reset their values, because the nMREQ signal is active only when a memory access is needed and is directly managed by the main operation, which sets his value in the right manner cycle by cycle. The MAS signal is always assigned by using values defined in the model header file, i.e. wordsize, halfwordsize and bytesize. When a subword sized data is transferred, to avoid data overwriting, a masking operation is performed. This is necessary for data sampled from the data bus and stored to the register file and is useful when the data bus is driven by the processor. For some aspects this method reduce the system power consumption but, due to the ARM7 specification,

144 5 – The LISARM model the replication of subword sized data must be performed. This problem is discussed in paragraph 6.2, where the memory wrapping is discussed. All the masks used for single bytes and half-words are defined in the header file and to provide the memory address expected by the processor specification some tricks are used. Also these aspects are discussed in a dedicated paragraph in chapter 6 (section 6.2).

5.9 Block data transfer instructions

The block data transfer operations allow to load (LDM) and store (STM) a set of the general purpose registers from or to memory; they are designed to create and manage memory stacks in a very flexible manner. The instructions support all the possible stacking modes: starting from the base register value, the memory address can be pre or post incremented or decremented, so that the stack can grow up or down in the memory space. The decoding operation (block data grp), activated by the mem access grp operation, inspects some coding bits to understand what type of addressing mode is required and assigns the pipeline flags for the execu- tion stage operations. Here there’s only a register index which can be expressed, to define the base register for the memory addressing. The base register content must be transferred to the mem address reg during the first execution cycle, so the datapath setup is performed by assigning the dedicated pipelined signals, particu- larly write result f and mem access cycle f flags. Also the writeback operation is allowed and another information about the banked registers access for the transfer operation is defined within the coding, by the S-bit. If the S-bit is set, in fact, both the load and store instructions require the transfer of the user bank register also if the operating mode is different by the user mode. This behavior is useful in pro- cess switching mechanism and has the particularity of allowing the processor mode change by transferring also the SPSR to CPSR, if the program counter is in the register list. The decoding of the register list is performed by calling a 16-bit immediate value conversion function and the returned value is stored into an ad hoc global 16-bit register. Some details about the register list conversion are furnished in the section 6.4, where some LISA language and LISATek toolsuite limits are dis- cussed and an external solution is proposed. This register contains a flag for every possible register file index and, by setting or resetting a single bit, the transfer of the pointed register can be transferred or not. During the execution phase the block data transfer polling LISA operation takes the control and, setting up the datapath with respect to the pipelined relative flags, transfers the base register value in the mem address reg. If the writeback operation is requested, the writeback op is activated during the first execution cycle, so that

145 5 – The LISARM model the mem address reg value is written back to the base register in the following ma- chine cycle. At every execution of the block data transfer operation a pipeline stall for the subsequent cycle is scheduled, in order to allow the operation to be reactivated automatically in the following machine cycle. No transfer operation be- tween the memory and the processor is performed until the second cycle execution and subsequent, when the same operation is identically repeated for every register in list. In order to perform register transfer in numerical growing order (by index) the reg list global register is inspected bit per bit, starting from the least significant position (which contains the R0 relative flag) and growing until a non-zero bit is found. The pointed position represent the first register affected by the transfer op- eration and its index is used as the destination register for a LDM instruction or as the source register for a STM instruction. Here a check on the presence or absence of the PC in the register list is performed, by inspecting the bit 15 of the reg list register. By the fact the byte access f pipeline flag is used to store the S-bit value, it is accessed to establish, in combination with the previous information, what re- gister bank is involved in the transfer and if the CPSR must be subscribed with the banked SPSR. Now the memory interface signal setup is performed, in the same manner discussed in the previous paragraph (5.8); the address used for the memory access is incremented or decremented before of after the access itself, with respect to the relative pipeline flag. The store operation executes exactly n cycles addressing and tranferring the n registers that have the flag set high in reg list. The load operation, otherwise, starts the data bus sampling in the third cycle execution and so the datapath for the mem data reg content storing must be configured during the second step. The load operation takes one more cycle with respect to the store operation, because of the required sampling of the values and its writeback into the register file. Both instructions do not schedule other stall cycles when no further set bits are found in the reg list register; if the last register to be transferred is R15, the SPSR content is also copied into the CPSR. In this case a pipeline flush is scheduled and two further clock cycles are required for the pipeline refilling.

5.10 The data swap instruction

The data swap instruction is used to swap a byte or word quantity between a regis- ter and the memory. This instruction is implemented as a memory read followed by a memory write operation“locked”together. The processor cannot be interrupted until both operations have completed and, in order to avoid memory content mo- dification, the memory management unit is warned to treat them as inseparable, refusing memory access to other peripherals. The execution of a swap operation is

146 5 – The LISARM model signalled to the memory management unit using the LOCK processor output, which remains high during the operation execution. The instruction decoding is performed by the SWP dc LISA instruction, which is activated by the other grp instruction. By inspecting the instruction coding the indexes of the source, destination and base register are stored to dedicated pipeline registers and the barrel shifter and ALU setup signals are assigned to allow the base register content to be saved into the mem address reg. The B-bit in the coding is inspected to establish the transfer data size, which can be a word or a single byte; this information is stored into the byte access f pipelined flag. The decoding op- eration activates the SWP ex operation and here the usual LISA polling capabilities and the cycle c counter are used to reschedule four subsequent machine cycles. During the first execution cycle the pipelined signals for the execution units (ALU and shifter) are transferred to their wires and the stall operation is scheduled for the subsequent machine cycle. At the first reactivation (second execution cycle), the operation sets up the memory interface signals assigning BS, nRW and MAS for the memory read operation and the relative data size. The address bus (A) value is assigned by the specific memory access LISA operation, and the LOCK signal is tied high. In the third step the data bus (D or DIN) is accessed, and the sampled value is loaded in the mem data reg global register. The same value is written to the Rd register in the subsequent cycle, by setting the datapath signal so that no further data modifications are performed. In the same cycle the store operation is prepared, inverting the value of the nRW signal and maintaining high the LOCK sig- nal. Here the Rm register value is assigned to the data bus and the last stall operation is scheduled. During the fourth cycle the Rd register is written back and the LOCK control signal is tied low. When a byte wide data is transferred, the masking opera- tions already discussed for the ordinary load and store operations are also performed.

5.11 Software interrupt and undefined instructions

The software interrupt instruction (SWI) and the undefined instruction (UND) have the common behavior of changing the program counter content to execute a jump to the exception vector (section 3.4), allowing exception handling. Both the instruction have to evaluate the condition field and the CPSR flags in the usual manner and if the condition is true the expected operations are performed. The difference between the two instructions is that the software interrupt is executed without checking any other signal, whereas the undefined instruction is not. This is due to the fact that, when the UND instruction is executed, the same instruction is passed to the

147 5 – The LISARM model coprocessors connected on the data bus and only if the dedicated handshaking lines are tied low, the instruction is really considered as undefined and the undefined instruction trap is taken. Entering the decoding stage, the other grp operation is activated and, by in- specting the instruction coding, one operation among SWI dc, UND dc and SWP dc (see section 5.10) is selected and activated. The SWI dc operation ignores the 24-bit comment field, so it does not pass anything to the entered supervisor mode. In order to change the program counter value, assigning the data handler address (0x08), the datapath must be properly set to execute a MOV operation and the barrel shifter has to execute a LSL#0 operation, so that any data modifications are avoided. The destination register index (15) is stored into the relative pipeline register and the write result flag is set. Then the operation activates the SWI ex operation, which is executed in the subsequent clock cycle. When the SWI instruction enters the execute pipeline stage the PC saving is executed, by assigning its value to the supervisor mode link register (R14 svc), and the CPSR is also saved into the SPSR svc regis- ter. The processor mode variable is set to the supervisor value (defined in the header file) and this information is transferred to the mode bits by the main LISA operation. Because of the jump operation, a pipeline flush operation is scheduled, so the refilling takes two further clock cycles to complete the instruction execution. The undefined instruction is decoded by the UND dc LISA operation, which activ- ates the EX stage UND ex operation doing nothing more. When the execution stage instruction enters the pipeline EX stage, the polling LISA mechanism is exploit to re-execute the operation for two consecutive cycles and the cycle c pipelined signal is used for step counting; the whole pipeline is obviously stalled. In the first execu- tion cycle the coprocessor interface signal nCPI is tied low to start the handshaking (section 3.8) and in the subsequent clock cycle the CPA and CPB signals are evaluated to establish if the instruction is accepted by a coprocessor or not. If both the signal are tied low the instruction is really undefined and so the exception vector, for the undefined instruction trap (0x08), must be stored into the program counter. To do so, ALU and barrel shifter pipelined signals are assigned in the same way seen for the previous instruction and a pipeline flush is also scheduled due to the jump operation. To perform the operating mode change, the program counter value is assigned to the undefined mode link register (R14 und) and the CPSR is saved into the SPSR und register. The processor mode content is set to undefined, so that the main operation can properly assign the mode bits.

148 Chapter 6

LISARM support tools

This chapter describes the most important aspects of the tools used, generated using the LISATek development environment and then adapted to other tools yet available for ARM. By the fact the LISARM model and the generated toolchain are not completely compatible with ARM7TDMI specifications, some solutions are proposed here, in order to allow the memory interfacing and to make exploitable commercial ARM family tools with the obtained model. Some further informations about the generated VHDL and related tests and simulations are also reported in the last paragraph.

6.1 The ARM LISA simulator

In order to check the modelled processor behavior, the LISATek generated C++ sim- ulator has been used intensively. The Processor Debugger GUI accepts an object file as input and allows to monitor various model resources like registers, memories, internal signals, pipeline behavior and events which occur during the program exe- cution. By inspecting the windows reporting the assembly, the disassembly and the LISA microcode, the LISA operations behavior can be controlled in depth, step by step, checking its effects on the processor state. The model development has undergone two main phases:

• The instruction accurate model description.

• The cycle accurate model description.

In the first phase the pipeline behavior has been ignored and the instruction set coding and syntax for every instruction has been described. In this early model all the operations were executed in a single clock cycle, in order to check the vari- ous parts functionalities. In the second phase the pipelined structure behavior has

149 6 – LISARM support tools been modelled and the LISA operations code has been distributed on the respective pipeline stages. In all these steps the Processor Debugger played a fundamental role and a vast set of assembly file allowed to test many processor functionalities. All the instructions described in chapter 5 have been tested, by using the various operands syntax expression, addressing modes and conditional execution mnemonics in the written assembly code. Assembler, linker, disassembler and LISATek simu- lator allowed an intensive verification work, in which, step by step, the LISARM model is growth respecting the original ARM7TDMI core behavior.

6.2 The memory wrapping

The LISARM model memory management differs from the ARM processor behavior in some aspects, because the LISATek toolsuite and the LISA language do not allow to implement the same sub-word addressing features of the original processor. The LISA model resource section (ref. paragraph 5.1.1) defines a 32-bit memory organi- zation, with 8-byte sub-blocks and a 32-bit address bus for the memory interfacing. By the fact that every increment in the value which drives the address bus produces a 32-bit displacement, the address bus itself furnishes not enough information to the memory management unit for sub-word sized data types accessing. The Memory Access Size (MAS) output expresses only the data size, so the position of a byte or halfword which is not on a 32-bit boundary has to be communicated by using an additional signal, the 2-bit BS line cited in paragraph 5.1.2. This signal is assigned cycle by cycle as the two least significant bits of the program counter (for instruction fetch) or the mem address reg (for a data access). To obtain an ARM-compliant memory interface, the memory wrapper has to consider the least significant bits of the address bus as the thirty most significant bits of the real address, adding the BS value as its two least significant bits. In any case, the BS value tells the real position of the data wanted by the system, hence other approaches are allowed, with respect to the memory system specifications. The other problem presented by the LISATek model is the replication of sub-word sized data during the store instruction execution (STRB or STRH). The ARM pro- cessor furnishes the address of the exact location to access and assigns the same data byte on all the byte boundaries or the data halfword on the corresponding halfword boundaries. The byte-by-byte access can be performed also with LISARM, resolving the real address by using the BS output value as described above. Moreover, if the memory system requires this feature, the byte or halfword data can be connected to the other data bus lines dynamically, by observing the MAS output value. This choice, anyway, has to consider also power consumption aspects, which can advise against its implementation.

150 6 – LISARM support tools

Also the sub-word load operation executed by the ARM processor has particular features that can not be ignored, because it expects the data to sample in the right position with respect to the addressed memory location. The memory management unit has to evaluate the MAS signal to establish what size is the data required and the wrapper must perform the real address calculation. The byte or halfword value replication on all the other lines is optional and for power saving considerations it can be avoided. All the other memory interface signals, as nMREQ, SEQ, nRW and also the main clock input (MCLK) can be connected directly to the memory management unit. The nMREQ can be monitored in order to deactivate the wrapper when memory accesses are not required by the processor. The memory wrapper scheme is reported in figure 6.1. nMREQ MCLK nRW SEQ

address_bus RAM data_bus 2 2 MAS 2 MAS BS

MCLK Memory Wrapper

Figure 6.1. Memory wrapper scheme

The ARM processor allows to select big-endian or little-endian configuration for data read and write in memory by using a dedicated input (BIGEND). This signal can also be changed during the program execution and the processor has to manage the data transfer in an appropriate manner cycle by cycle, at every time it accesses the memory resource. The LISARM model does not allow the dynamic change of the endianess configuration, and this is due to the fact that LISATek tools and LISA language does not support this feature. The model described supports only the little endian memory organization, as selected in the resource section of the main.lisa file. This represents another problem that a memory wrapper can solve, crossing the single bytes by using a set of multiplexers when the big endian configuration is selected. A final consideration has to be done: the insertion of a memory wrapper and its internal structure influences the access timings of the whole system, hitting a well

151 6 – LISARM support tools known weak point, as the memory access is.

6.3 ARM commercial toolchains

The expansion of the ARM7TDMI processor in mobile devices market and its embed- ding in micro-controllers and complex systems, has led to the diffusion of toolsuites and development environments for software applications creation. By the fact ARM Ltd. does not sell microprocessors but processor cores, many manufacturers has pro- duced some proprietary chips based on their processors and also the relative tools for software applications development and optimization. The most part of these tools accepts C/C++ code and uses a compiler which target is the ARM7 instruction set and many commercial cross-compilers1 are also diffused. Cross-compiling tools are generally used to generate executable code for embedded systems or multiple plat- forms where it is inconvenient or impossible to compile, e.g. micro-controllers that run with a minimal amount of memory for their own purpose. During the model development, some of these tools were used, like the GNU/Linux ARM Toolchain or WinARM. Both the applications implement cross-compiling toolchains and use the GNU gcc C/C++ compiler to generate ARM72 executable binary files. The compilers accept many of the common gcc compiling options and also a complete set of tools like assembler, disassembler and linker are furnished. In order to explore ARM processor functionalities and to perform some compar- isons between original ARM and the LISA model behavior, other instruments have been used, like the SimIt-ARM tools. The SimIt-ARM suite contains an instruction- set emulator and a cycle-accurate simulator for the StrongARM architecture3, which is a 32-bit predecessor of the ARM7TDMI processor. The StrongARM processor is based on the ARMv4 architecture, the oldest version of the ARMv4T architecture which implements the additional 16-bit Thumb ISA included in the ARM7TDMI processor. For the explained reasons the SimIt-ARM tools are full compatible with the 32-bit ARM instruction set modelled by LISARM and both the instruction-set emulator and a cycle-accurate simulator had represented a valid alternative to the data sheet[16] specification analysis solely.

1a cross-compiler is a compiler capable of creating executable code for a platform other than the one on which the cross compiler is run. 2also other members of the ARM processor family can be selected as the target architecture. 3the StrongARM processor was a collaborative project between Digital Equipment Corporation (DEC) and ARM Ltd., to create a faster CPU based on the existing ARM line; the core was later sold to Intel, who continued to manufacture it before replacing it with the XScale.

152 6 – LISARM support tools

6.4 ARM model toolchain adaption

For some aspects introduced in the LISARM model description (chapter 5), the LISA language can not describe complex mechanisms for code assembling and so the generated assembler does not implement the original ARM assembler capabili- ties. The immediate values conversion in the immed8 r format, used in single data transfer and data processing instructions, and the register list conversion used in block data transfer instructions present some aspects which are not implementable in the LISATek generated assembler. The obtained model accepts the same codings of the two set of instructions, so that an application compiled with a commer- cial toolchain is compatible with the LISARM internal representation of immediate values and register list, but assembly source code for standard ARM processor can not be assembled by using the LISATek generated assembler. These problems can be surmounted by using a pre-assembler tool which parse the ARM assembly code, re- trieve the instructions which uses the unsupported arguments and transforms them in a format suitable by the LISARM processor4. As described in section 5.3, the ARM7 assembler accepts only particular im- mediate values in the assembly syntax, i.e. all those numbers representable by a 32-bit wide unsigned integer obtained by sign extending an 8-bit unsigned value to 32 bit and rotating right the result by a number of bit twice a 4-bit amount. It is obvious that only some particular integer values can be transformed in the so called immed8 r format, all powers of two, for example. The ARM assembler is able to check if an expressed immediate value can be transformed as described and it returns an error if the conversion is not feasible. If the assembly instruction is an arithmetic, move or compare operation, an alternative form can be exploited: by the fact that everyone of these operations provide an opposite operation, the 1’s complement form of the operand is used and the operation mnemonic contained in the assembly code is changed. The negation or logical inversion ensures the ALU operation equivalence and the data processing instructions which allow the so called instruction substitution are: ADD and SUB, ADC and SBC, AND and BIC, MOV and MVN, CMP and CMN. A particular decoding operation described within the model is dedicated to the immed8 r conversion, it decodes the 8-bit wide immediate value and the 4-bit wide right rotation semi-amount and drives the barrel shifter to recreate the original value, which is then supplied to the ALU. The pre-assembling tool takes the immediate value expressed in the instructions cited above and stores it in a 32-bit internal variable. By using an 8-bit mask and executing a number of steps in which the mask is moved on the value binary representation, groups of 8-bit are selected, to establish if the other bits are all zeros or all ones. Only if the

4the assembly syntax of these instruction has a more closer correlation with the binary coding fields.

153 6 – LISARM support tools other bits are all zero the value is representable in the immed8 r format and the instruction substitution is not required. Otherwise, if the other bits are all ones, the instruction substitution is needed and a bitwise not operation must be performed to obtain its 1’s complement. The 8-bit group selected is then re-converted in a numeric value and also the amount for the right rotation is saved. These values are then opportunely converted in an assembly string and in case an instruction substitution is required the appropriate mnemonic is used. The instruction in the assembly file is substituted by the supported format and this operation is executed for every other instruction contained in the source file before the control returns to the user. If an immediate value can not be transformed as wanted, an error message, reporting the code line number, is displayed and the program exits. The LISARM disassembler produces a code with the same syntax characteristics of the accepted format by the fact the same decoding LISA operation is used to generate both the assembler and the disassembler. To obtain an ARM compliant disassembly, the LISARM disassembled file has to be modified by another ad hoc tool, the post-disassembler, which calculates the immediate value as the processor datapath does, starting from the 8-bit and 4-bit values reported in the disassembly syntax and substituting the result in the file. The block data transfer instructions (5.9) accept a list of general purpose registers to be transferred, which can be expressed in various manner: the complete set or a subset of the sixteen general purpose registers can be selected and all the combinations are allowed. The single registers can be expressed separating their names by commas and the numerical order has to be respected. Consecutive registers can be grouped by using the“-”symbol and the notation implies that all the registers included within the external identifiers must be transferred. The pre-assembling tool finds LDM and STM opcodes expressed in the assembly file and parses the register list in their syntax. For every register index included in list the corresponding bit of an internal 16-bit variable is tied high and the variable bits are then inspected to recreate an explicit list, i.e. a list in which single registers identifiers appear in numerical order, separated by commas and without group notation. The block data transfer decoding LISA operation is build to accept the list of registers to be transferred as defined in the same way so the tool generated syntax substitutes the instructions retrieved in the assembly file. The decoding LISA operation influences also the generated disassembler behavior, which produces the same list of registers by checking which flags are high in the corresponding binary format, so it furnishes the explicit list without grouping symbols. A variant of the described tool can also be used to obtain the converted format for the immediate value to be expressed within the assembly code, in order to execute simulations an tests on the model or to check if the internal format corresponds to the expected value. The same thing can be done with the register list for block data transfer operations.

154 6 – LISARM support tools

The diagram of the complete toolchain is reported in figure 6.2.

C files

C−compiler

C libraries LISARM LISARM LISARM LISARM pre− post− assembler disassembler assembler disassembler binary file disassembly file

ARM

assembler

assembly files

Figure 6.2. Complete toolchain diagram

6.5 HDL generation and tests

The LISARM model hardware description has been generated by using the LISATek HDL Generator, selecting the VHDL as the target language. Inspecting the gener- ated files it is possible to find a single description file for each unit described in the model, hence instruction decoder, condition checker, pipeline registers and pipeline controller, ALU, multiplier and barrel shifter have been implemented as separate VHDL entities. The execution stage contains also the various units which imple- ment the instruction groups like data processing, memory access, branch and all the other operations are grouped in fetch, prefetch, decode and execute entities. A LISA processor description, as discussed in chapter 4, can be used as a universal source for the generation of both software tools (assembler, linker, simulator) and RTL code (using HDL languages). While the universality is a fundamental strength of the LISA model, there is the challenge that the generation of software tools and RTL code represent different abstractions of the processor. A software model is executed sequentially, an RTL simulator must emulate the parallelism which is in- herent in hardware. As a consequence, a LISA optimal description for software tools creation can not be as optimized as needed for hardware description generation. By the fact the HDL generation is only the last development phase, while the processor

155 6 – LISARM support tools behavior is analysed step by step using the software tools, the LISA description style must consider target hardware aspects from the very beginning. During the model development the HDL compiler has been used many times to evaluate the correct- ness of the written LISA code. The first step to obtain a feasible hardware is the selection of the right LISA resources: local and global variables, used for values com- munication between LISA operations, has been chosen keeping in mind LISA and RTL resources mapping. An appropriate TClocked data types selection has allowed to obtain the wanted behavior by the processor, particularly in the cycle-accurate modeling phase. Signals scope and their initialization had represented another im- portant aspect to take in account. As discussed in chapter 4, LISA operation reuse allows to obtain noteworthy hardware optimizations, here the right signals selection plays a fundamental role for every read and assignment operation, because an RTL signal can be driven only by a unit per time and the other components can only read its value. In order to understand which modeling style was the most appro- priate, many LISA code compilations and simulations were executed, so the HDL Generator had represented another fundamental tool for the model development. The LISATek HDL Generator, beside the VHDL and Verilog description, is able to provide the ModelSim configuration files for the RTL code simulation. The LISATek GUI allows to select a machine language file for the architecture simulation and, during the tools generation flow, its memory dump is automatically stored in a file. The LISATek tools generation flow creates also a ModelSim configuration file, which sets up the simulation environment for the HDL generated code testing. Launching ModelSim by this configuration file, the selected memory dump file is loaded in the processor resources, and a step by step architecture behavior verification can be executed. The ModelSim simulator furnishes many precious instruments for hard- ware behavior in-depth analyses, particularly by the wave viewer support. Since the first steps of LISARM development, hardware simulations allowed to discover model behavior inconsistencies and the subsequent correction of some parts of the LISA description.

156 6 – LISARM support tools

Figure 6.3. A ModelSim simulator screenshot

157 Chapter 7

Conclusions and possible future applications

The chapter contains some conclusive considerations about the thesis work presented in the volume and sketches out some future applications of the LISARM model realized.

7.1 Conclusions

The LISATek toolsuite, and particularly the LISA description language, has demon- strated to be an optimal software for the LISARM architecture development, thanks to the complexity tackling approach in the first place. The choice of beginning the model development by implementing a instruction-accurate model, instead of start- ing with its cycle-accurate description, has allowed to concentrate the efforts on the assembly syntax and the coding scheme imposed by the ARM instruction set. Here the LISA language tricks have simplified the ARM coding format descrip- tion, allowing the grouping of sparse bits into blocks to be decoded separately and the LISA decoding mechanisms were exploited to spread the operation complex- ity on many sub-steps. The subsequent phase, which concerned the shift from the instruction-accurate to the cycle-accurate model, has allowed to focus the atten- tion on instruction cycle timing aspects, exploiting LISA language capabilities for pipeline structure description and its events management and scheduling, in a very simple manner. Some of the LISA language mechanisms for multi-cycle instruction description were not be accepted by early LISATek HDL Generator versions and the real pro- cessor behavior was achievable only for simulation purposes, by using a scheduling mechanism then improved. By the fact the latest LISATek toolsuite versions, furnish the LISA operation polling capabilities, their rescheduling is allowed and the ARM

158 7 – Conclusions and possible future applications processor behavior for multi-cycle operations can be respected. A big challenge with LISATek was the adaptation of the description style to the hardware structural as- pects. Here a good HDL knowledge is required, in order to obtain an optimized architecture description and to enable the hardware hierarchy maintaining, which allows a subsequent optimization by exploiting HDL architectures and libraries used by synthesis tools. The LISATek toolsuite is a collection of software applications in uninterrupted growth and also if some capabilities can be implemented in the LISA language de- scription, they are not completely supported by all the tools belonging to the suite. The fundamental concept of obtaining a simulator, HDL description, toolchain (com- piler, assembler, linker, disassembler) from a single description and the possibility of co-simulate its hardware and software model, makes LISATek an ideal development approach, where architecture refinements or wide modifications allow to save time and money, avoiding the maintenance of all the model parts and tools.

7.2 Possible future applications

Starting from the LISARM model, some future applications can be obtained. Some of the model limits can be surpassed by using a more complete and efficient version of the LISATek toolsuite, but maintaining the most part of the LISA description. Some features supported by the Processor Generator, like intercommunication bus definition and full dynamic memory interfacing, are not yet supported by the HDL generator, so they can not be used in model description. Also the big endian memory organization is not supported by the current versions of the tools, although external hardware solutions can be adopted. The memory wrapper described in chapter 6 resolves another LISATek problem about the memory management but an efficient use of sub-word sized data can not be left to external units. The processor model object of this work is not intended to be a full ARM compli- ant clone, hence some ARM7TDMI architecture aspects were intentionally ignored since the first platform exploration steps. The coprocessor communication capabil- ities and the relative ARM instructions are not implemented in the model, because the LISA description allows the integration of the present instruction set without requiring external dedicated cores, so that a similar approach appeared quite super- fluous. Beside ARM instruction set extensions for specific applications, also unuseful instructions removing is allowed, so that an optimal instruction set architecture can be obtained. The Thumb micro-architecture and the relative instruction set implemented by ARM, which allow 16-bit instructions be executed on a full featured 32-bit archi- tecture, seems to be a valid strategy to reach an increased code density so it can

159 7 – Conclusions and possible future applications be considered as an enhancement of the obtained model. The dynamic translation method, which transforms the Thumb 16-bit instructions to the ARM 32-bit ones, can be exploited also in the LISARM model, in fact the LISA language allows the implementation of complex mechanisms used also for VLIW processor description. Here the problem of the sub-word data addressing arise another time but, for the considerations explained above, some solutions could be found in future toolsuite releases. Another deep revision of the obtained model can target the realisation of an Harvard architecture, with the instruction throughput advantages which can derive. Adding the appropriate pipeline stage and maintaining the most part of the LISA code which describe the single instructions behavior, is possible to move the memory access capabilities to the added stage, without the burden of re-engineering all the system and the pipeline controller. Before the model is used in embedded appli- cation, an intensive testing phase has to be executed, so that the LISARM model specification can be validated in-depth and all the processor architecture parts can ensure full ARM instruction set compliance. Beside the simulator, also the gener- ated HDL description must be verified and then its synthesis can be performed with commercial tools, to allow time and area parameters to be extracted. A subsequent functional verification phase can be executed also using the LISATek co-simulation tools, so that the real hardware description can be tested with the same patterns used for the simulator refinement, checking the response of both the models at the same time.

160 Appendix A

Model LISA operations summary

This appendix contains a summary of all the LISA operations contained in the model. For each operation pipeline stage, decoding group (see figure 5.2 and file which is assigned is indicated.

Table A.1. LISA operations summary LISA operation Stage Decoding group File (.lisa) ADC dc DC arith logic grp data proc instructions ADD dc DC arith logic grp data proc instructions add PSR update EX none alu operations AL dc DC condition conditionfield AL ex EX none conditionfield alu adc EX none alu operations alu add EX none alu operations alu and EX none alu operations alu bic EX none alu operations alu eor EX none alu operations alu mov EX none alu operations alu mvn EX none alu operations alu operation EX none alu operations alu orr EX none alu operations alu rsb EX none alu operations alu rsc EX none alu operations alu sbc EX none alu operations alu sub EX none alu operations AND dc DC arith logic grp data proc instructions arith logic grp DC data proc grp data proc instructions arith logic grp setup EX none data proc instructions

161 A – Model LISA operations summary

LISA operation Stage Decoding group File (.lisa) ASL dc IN DC bs op barrel shifter ASR dc IN DC bs op barrel shifter B dc DC branch grp branch instructions B ex EX none branch instructions barrel shifter op EX none barrel shifter barrel shifter op dc DC none barrel shifter BIC dc DC arith logic grp data proc instructions BL dc DC branch grp branch instructions BL ex poll EX none branch instructions block data grp DC mem access grp mem access instructions block mem access ex poll EX none mem access instructions branch grp DC none branch instructions branching flag reset FE none branch instructions BX dc DC branch grp branch instructions BX ex EX none branch instructions CC dc DC condition conditionfield CC ex EX none conditionfield CMN dc DC cmp grp data proc instructions CMP dc DC cmp grp data proc instructions cmp grp DC data proc grp data proc instructions CPSR DC none other instructions CPSR all DC none other instructions CPSR selection DC none other instructions CS dc DC condition conditionfield CS ex EX none conditionfield data proc grp DC none data proc instructions data sample op EX none mem access instructions decode DC none main EOR dc DC arith logic grp data proc instructions EQ dc DC condition conditionfield EQ ex EX none conditionfield fetch FE none main GE dc DC condition conditionfield GE ex EX none conditionfield GT dc DC condition conditionfield GT ex EX none conditionfield HI dc DC condition conditionfield HI ex EX none conditionfield imm branch offset DC none branch instructions

162 A – Model LISA operations summary

LISA operation Stage Decoding group File (.lisa) immed 8r DC none data proc instructions immediate offset DC none mem access instructions immediate operand DC none data proc instructions immediate value 12 DC none misc ops immediate value 4 DC none misc ops immediate value 5 DC none misc ops immediate value 8 DC none misc ops implicit PSR update req DC none misc ops LDM dc DC block data grp mem access instructions LDR dc DC std data grp mem access instructions LDRH dc DC su data grp mem access instructions LDRSB dc DC su data grp mem access instructions LDRSH dc DC su data grp mem access instructions LE dc DC condition conditionfield LE ex EX none conditionfield logop PSR update EX none alu operations LS dc DC condition conditionfield LS ex EX none conditionfield LSL dc IN DC bs op barrel shifter LSR dc IN DC bs op barrel shifter LT dc DC condition conditionfield LT ex EX none conditionfield main none none main mem access ex poll EX none mem access instructions mem access grp DC none mem access instructions mem access preset dc DC none mem access instructions MI dc DC condition conditionfield MI ex EX none conditionfield MLA ex poll EX none multiply instructions MLAL ex poll EX none multiply instructions MOV dc DC mov grp data proc instructions mov grp DC data proc grp data proc instructions MRS dc DC none other instructions MRS ex EX none other instructions MSR dc DC none other instructions MSR ex EX none other instructions MSR flg dc DC none other instructions MSR flg ex EX none other instructions MSR immed 8r EX none other instructions

163 A – Model LISA operations summary

LISA operation Stage Decoding group File (.lisa) MSR reg op DC none other instructions MUL dc DC multiply grp multiply instructions MUL ex poll EX none multiply instructions MULL dc DC multiply grp multiply instructions MULL ex poll EX none multiply instructions multiply grp DC none multiply instructions MVN dc DC mov grp data proc instructions NE dc DC condition conditionfield NE ex EX none conditionfield non shifted reg DC none data proc instructions NOP dc DC none misc ops NOP ex EX none misc ops ORR dc DC arith logic grp data proc instructions other grp DC none other instructions PL dc DC condition conditionfield PL ex EX none conditionfield post indexed DC none mem access instructions post store op EX none mem access instructions pre indexed DC none mem access instructions prefetch PF none main program relative DC none mem access instructions PSR access grp DC none other instructions PSR no update req DC none misc ops PSR update req DC none misc ops reg amount shifted reg DC none data proc instructions reg index DC none misc ops register list conv DC none misc ops reset none none main ROR dc IN DC bs op barrel shifter RRX dc IN DC bs op barrel shifter RSB dc DC arith logic grp data proc instructions rsb PSR update EX none alu operations RSC dc DC arith logic grp data proc instructions SBC dc DC arith logic grp data proc instructions shift amount reg access EX none data proc instructions shifted reg DC none data proc instructions shifted reg offset DC none mem access instructions shifted reg operand DC none data proc instructions SPSR DC none other instructions

164 A – Model LISA operations summary

LISA operation Stage Decoding group File (.lisa) SPSR all DC none other instructions SPSR selection DC none other instructions std data grp DC mem access grp mem access instructions STM dc DC block data grp mem access instructions STR dc DC std data grp mem access instructions STRH dc DC su data grp mem access instructions su data grp DC mem access grp mem access instructions SUB dc DC arith logic grp data proc instructions sub PSR update EX none alu operations SWI dc DC other grp other instructions SWI ex EX other grp other instructions SWP dc DC other grp other instructions SWP ex poll EX other grp other instructions symb branch offset DC none branch instructions TEQ dc DC cmp grp data proc instructions TST dc DC cmp grp data proc instructions unc dc DC condition conditionfield unc ex EX none conditionfield VS dc DC condition conditionfield VS ex EX none conditionfield write result EX none alu operations writeback op EX none mem access instructions zero offset DC none mem access instructions

165 Bibliography

[1] Von Neumann architecture: http://en.wikipedia.org/wiki/Von Neumann architecture, Wikipedia, the free encyclopedia. [2] D. A. Patterson, D. R. Ditzel, “The case for the Reduced Instruction Set Computer”, ACM SIGARCH Computer Architecture News, Vol. 8, Issue 6, October 1980, pp. 25–33. [3] CPU design: http://en.wikipedia.org/wiki/CPU design, Wikipedia, the free encyclopedia. [4] Reduced Instruction Set Computer: http://en.wikipedia.org/wiki/RISC, Wikipedia, the free encyclopedia. [5] L. Xing, G. Fernandes, P. Kulkarni, S. R. Marupudi, S. P. Melacheruvu, M. Pra- gada, “RISC versus CISC - Project report for computer architecture”, Univer- sity of Massachussets - Dartmouth. [6] ARM architecture: http://en.wikipedia.org/wiki/ARM architecture, Wikipedia, the free encyclopedia. [7] I. J. Hunag, Y. Liang Hung, C. S. Lai, “Cost-effective microarchitecture opti- mization for ARM7TDMI microprocessor”, National Sun Yat-Sen University - Taiwan. [8] CoWare website: www.coware.com. [9] LISA Language Reference Manual - CoWare - Product Version V2005.2.1 - Feb 2006. [10] Tensilica website: www.tensilica.com. [11] A. Hoffmann, T. Kogel, A. Nohl, G. Braun, O. Schliebusch, O. Wahlen, A. Wieferink, H. Meyr, “A novel methodology for the design of Application- Specific Instruction-Set Processors (ASIPs) using a machine description lan- guage”, IEEE Transactions on Computer-Aided Design of and Systems, Vol. 20, No. 11, November 2001, pp. 1338–1354.

166 Bibliography

[12] O. Schliebusch, A. Hoffmann, A. Nohl, G. Braun, H. Meyr, “Architecture imple- mentation using machine description language LISA”, Proceedings of the 15th International Conference on VLSI Design (VLSID) 2002, IEEE Computer So- ciety, 2002. [13] A. Hoffmann, T. Kogel, A. Nohl, G. Braun, O. Schliebusch, O. Wahlen, H. Meyr, “A methodology for the design of Application-Specific Instruction- Set Processors (ASIP) using the machine description language LISA”, IEEE Computer Society, 2001. [14] LISATek Release Informations - CoWare - Product Version V2005.2.1 - Feb 2006. [15] LISATek Methodology Guidelines for the Processor Generator - Product Version V2005.2.1 - Feb 2006. [16] ARM7TDMI Data Sheet, Advanced RISC Machines Ltd. (ARM), August 1995. [17] ARM Developer Suite - Assembler Guide, ARM Ltd., 2001, version 1.2. [18] S. B. Furber, ARM System-on-Chip Architecture, Addison Wesley Longman, March 2000, 2nd edition. [19] P. Knaggs, S. Welsh, ARM: assembly language programming, Bournemouth Univerity School of design, engineering and computing, August 2004. [20] LISATek Processor Designer Manual - CoWare - Product Version V2005.2.0 - Dec 2005. [21] LISATek Creation Manual - CoWare - Product Version V2005.2.0 - Dec 2005. [22] LISATek Processor Debugger Manual - CoWare - Product Version V2005.2.0 - Dec 2005.

167