Compiler Construction Lecture 4: Lexical Analysis in the Real World 2020-01-17 Michael Engel
Total Page:16
File Type:pdf, Size:1020Kb
Compiler Construction Lecture 4: Lexical analysis in the real world 2020-01-17 Michael Engel Includes material by Jan Christian Meyer Overview • NFA to DFA conversion • Subset construction algorithm • DFA state minimization: • Hopcroft's algorithm • Myhill-Nerode method • Using a scanner generator • lex syntax and usage • lex examples Compiler Construction 04: Lexical analysis in the real world !2 What have we achieved so far? • We know a method to convert a regular expression: (all | and) into a nondeterministic finite automaton (NFA): l a l a d n using the McNaughton, Thompson and Yamada algorithm Compiler Construction 04: Lexical analysis in the real world !3 Overhead of constructed NFAs Let’s look at another example: a(b|c)* • Construct the simple NFAs for a, b and c a b c s0 s1 s2 s3 s4 s5 • Construct the NFA for b|c b " s2 s3 " s6 s7 " s5 " s4 c Compiler Construction 04: Lexical analysis in the real world !4 Overhead of constructed NFAs • Now construct the NFA for (b|c)* " b " s2 s3 " " " s8 s6 s7 s9 " s5 " s4 c " • Looks pretty complex already? We're not even finished… Compiler Construction 04: Lexical analysis in the real world !5 Overhead of constructed NFAs • Finally, construct the NFA for a(b|c)* " b " s2 s3 " a " " " s0 s1 s8 s6 s7 s9 " s4 s5 " c " • This NFA has many more states than a minimal human-built DFA: b,c a s0 s1 Compiler Construction 04: Lexical analysis in the real world !6 From NFA to DFA • An NFA is not really helpful …since its implementation is not obvious • We know: every DFA is also an NFA (without "-transitions) • Every NFA can also be converted to an equivalent DFA (this can be proven by induction, we just show the construction) • The method to do this is called subset construction: The alphabet stays the same NFA: ( QN, �, �N, n0, FN ) � The set of states QN, transition function �N, start state qN0 DFA: ( QD, �, �D, d0, FD ) and set of accepting states FN are modified Compiler Construction 04: Lexical analysis in the real world !7 Subset construction algorithm q0 ← ε-closure({n0}); Idea of the algorithm: QD ← q0; Find sets of states that are WorkList ← {q0}; equivalent (due to "- transitions) and join these to while (WorkList != ∅) do form states of a DFA remove q from WorkList; for each character c∈�︎ do ε-closure: contains a set of states S and t ← ε-closure( N(q,c)); � any states in the NFA that can �D[q,c] ← t; be reached from one of the if t ∉ QD then states in S along paths that add t to QD and to WorkList; contain only "-transitions end; (these are identical to a state end; in S) Compiler Construction 04: Lexical analysis in the real world !8 Subset construction" example b q0 ← ε-closure({n0}); " n4 n5 " QD ← q0; a " " " WorkList ← {q0}; n0 n1 n2 n3 n8 n9 while (WorkList != ∅) do " n6 n7 " c remove q from WorkList; " for each character c∈�︎ do t ← ε-closure(�N(q,c)); �N a b c ε �D[q,c] ← t; n0 n1 – – – if t ∉ QD then n1 – – – n2 add t to QD and to WorkList; n2 – – – n3,n9 end; n3 – – – n4,n6 q0 ← {n0} end; n4 – n5 – – QD ← {n0}; n5 – – – n8 WorkList ← {n0}; n6 – – n7 – n7 – – – n8 n8 – – – n3,n9 n9 – – – – Compiler Construction 04: Lexical analysis in the real world !9 Subset construction" example b q0 ← ε-closure({n0}); " n4 n5 " QD ← q0; a " " " WorkList ← {q0}; n0 n1 n2 n3 n8 n9 while (WorkList != ∅) do " n6 n7 " c remove q from WorkList; " for each character c∈�︎ do t ← ε-closure(�N(q,c)); �N a b c ε while-loop Iteration 1 �D[q,c] ← t; n0 n1 – – – WorkList ← {{n0}}; if t ∉ QD then n1 – – – n2 ← q n0; add t to QD and to WorkList; n2 – – – n3,n9 c ← 'a': end; n3 – – – n4,n6 t ← ε-closure(�N(q,c)) end; n4 – n5 – – = ε-closure(�N(n0,’a')) n5 – – – n8 = ε-closure(n1) = {n1,n2,n3,n4,n6,n9} n6 – – n7 – D[n0,’a']←{n1,n2,n3,n4,n6,n9}; n7 – – – n8 � QD ←{{n0},{n1,n2,n3,n4,n6,n9}}; n8 – – – n3,n9 WorkList ← n9 – – – – {{n1,n2,n3,n4,n6,n9}}; Compiler Construction 04: Lexical analysis in the real world !10 Subset construction" example b q0 ← ε-closure({n0}); " n4 n5 " QD ← q0; a " " " WorkList ← {q0}; n0 n1 n2 n3 n8 n9 while (WorkList != ∅) do " n6 n7 " c remove q from WorkList; " for each character c∈�︎ do t ← ε-closure(�N(q,c)); �N a b c ε while-loop Iteration 1: �D[q,c] ← t; n0 n1 – – – if t ∉ QD then n1 – – – n2 WorkList ← {n0}; add t to QD and to WorkList; q ← n ; n2 – – – n3,n9 0 end; c ← 'b','c': n3 – – – n4,n6 end; t ← {} n4 – n5 – – no change to QD, Worklist n5 – – – n8 n6 – – n7 – n7 – – – n8 We will skip the iterations n8 – – – n3,n9 of the for loopD from that now do noton change Q n9 – – – – Compiler Construction 04: Lexical analysis in the real world !11 Subset construction" example b q0 ← ε-closure({n0}); " n4 n5 " QD ← q0; a " " " WorkList ← {q0}; n0 n1 n2 n3 n8 n9 while (WorkList != ∅) do " n6 n7 " c remove q from WorkList; " for each character c∈�︎ do t ← ε-closure(�N(q,c)); �N a b c ε while-loop Iteration 2 �D[q,c] ← t; n0 n1 – – – WorkList = {{n1,n2,n3,n4,n6,n9}}; if t ∉ QD then n1 – – – n2 ← q {n1,n2,n3,n4,n6,n9}; add t to QD and to WorkList; n2 – – – n3,n9 c ← 'b': end; n3 – – – n4,n6 t ← ε-closure(�N(q,c)) end; n4 – n5 – – = ε-closure(�N(q,’b’)) n5 – – – n8 = ε-closure(n5) = {n5,n8,n9,n3,n4,n6} n6 – – n7 – D[q,’a']←{n5,n8,n9,n3,n4,n6}; n7 – – – n8 � QD ←{{n0},{n1,n2,n3,n4,n6,n9}, n8 – – – n3,n9 {n5,n8,n9,n3,n4,n6}}; n9 – – – – WorkList ← {{n5,n8,n9,n3,n4,n6}}; Compiler Construction 04: Lexical analysis in the real world !12 Subset construction" example b q0 ← ε-closure({n0}); " n4 n5 " QD ← q0; a " " " WorkList ← {q0}; n0 n1 n2 n3 n8 n9 while (WorkList != ∅) do " n6 n7 " c remove q from WorkList; " for each character c∈�︎ do t ← ε-closure(�N(q,c)); �N a b c ε while-loop Iteration 2 �D[q,c] ← t; n0 n1 – – – WorkList = {{n1,n2,n3,n4,n6,n9}}; if t ∉ QD then n1 – – – n2 ← q {n1,n2,n3,n4,n6,n9}; add t to QD and to WorkList; n2 – – – n3,n9 c ← 'c': end; n3 – – – n4,n6 t ← ε-closure(�N(q,c)) end; n4 – n5 – – = ε-closure(�N(q,’c’)) n5 – – – n8 = ε-closure(n7) = {n7,n8,n9,n3,n4,n6} n6 – – n7 – D[q,’a’]←{n7,n8,n9,n3,n4,n6}; n7 – – – n8 � QD ←{{n0},{n1,n2,n3,n4,n6,n9}, n8 – – – n3,n9 {n5,n8,n9,n3,n4,n6}, n9 – – – – {n7,n8,n9,n3,n4,n6}}; WorkList ← {{n7,n8,n9,n3,n4,n6}}; Compiler Construction 04: Lexical analysis in the real world !13 Subset construction" example b q0 ← ε-closure({n0}); " n4 n5 " QD ← q0; a " " " WorkList ← {q0}; n0 n1 n2 n3 n8 n9 while (WorkList != ∅) do " n6 n7 " c remove q from WorkList; " for each character c∈�︎ do t ← ε-closure(�N(q,c)); �N a b c ε while-loop Iteration 3 �D[q,c] ← t; n0 n1 – – – WorkList = {{n7,n8,n9,n3,n4,n6}}; if t ∉ QD then n1 – – – n2 ← , q {n7 n8,n9,n3,n4,n6}; add t to QD and to WorkList; n2 – – – n3,n9 c ← 'b','c': end; n3 – – – n4,n6 t ← ε-closure(�N(q,c)) end; n4 – n5 – – = ε-closure(�N(q,’c’)) n5 – – – n8 = ε-closure(n5,n7) // we ran around the graph once! n6 – – n7 – n7 – – – n8 n8 – – – n3,n9 No new states are added D in this and the n9 – – – – to Q following iteration! Compiler Construction 04: Lexical analysis in the real world !14 Subset construction" example b b d2 " n4 n5 " b a " " " n0 n1 n2 n3 n8 n9 a d0 d1 c b " n6 n7 " c c d3 c " �N a b c ε n0 n1 – – – Set DFA NFA ε-closure( N(q,*)) n1 – – – n2 name states states � n2 – – – n3,n9 a b c { n1, n2, n3, n3 – – – n4,n6 q0 d0 n0 – – n4, n6, n9 } n4 – n5 – – { n1, n2, n3, { n5, n8, n9, { n7, n8, n9, q1 d1 – n5 – – – n8 n4, n6, n9 } n3, n4, n6 } n3, n4, n6 } { n5, n8, n9, n6 – – n7 – q2 d2 – q2 q3 n3, n4, n6 } n7 – – – n8 { n7, n8, n9, q3 d3 – q2 q3 n8 – – – n3,n9 n3, n4, n6 } n9 – – – – Compiler Construction 04: Lexical analysis in the real world !15 Subset construction: result " Our NFA for a(b|c)*: b " n4 n5 " a " " " n0 n1 n2 n3 n8 n9 " n6 n7 " c subset construction " algorithm b d2 b a still bigger than b,c d0 d1 c b c a s0 s1 d3 c constructed DFA minimal DFA Compiler Construction 04: Lexical analysis in the real world !16 Minimization of DFAs b d2 b,c b a a s0 s1 d0 d1 c b c d3 c • DFAs resulting from subset construction can have a large set of states • This does not increase the time needed to scan a string • It does increase the size of the recognizer in memory • On modern computers, the speed of memory accesses often governs the speed of computation • A smaller recognizer may fit better into the processor’s cache memory Compiler Construction 04: Lexical analysis in the real world !17 Minimization of DFAs b d2 b,c b a a s0 s1 d0 d1 c b c (states renumbered) d3 c • We need a technique to detect when two states are equivalent • i.e.