AN/N LOG N FOR MINIMIZING STATES IN kF I N ITE AUTOMATON

BY JOHN HOPCROFT

STAN-CS-71-190 January, 1971

COMPUTER SCIENCE DEPARTMENT School of Humanities and Sciences STANFORD UN IVERS ITY

AN N LOG N ALGORITHM FOR MINIMIZING STATES IN A FINITE AUTOMATON

John Hopcroft

Abstract

An algorithm is given for minimizing the number of states in a finite automaton or for

determining if two finite automata are equivalent. The asymptotic running time of the

algorithm is bounded by knlogn where k is some constant and n is the number of

states. The constant k depends linearly on the size of the input alphabet.

This research was supported by the National Science Foundation under grant number NSF-GJ-96, and the Office of Naval Research under grant number N-00014-67-A-0112-0057 NR 044-402. Reproduction in whole or in part is permitted for any purpose of the United States Government.

AN n log n ALGORITHM FOR MINIMIZING STATES IN A FINITE AUTOMATON

John Hopcroft

Stanford University

Introduction

Most basic texts on finite automata give for minimizing the number of states in a finite

automaton [l, 21. However, a worst case analysis of these algorithms indicate that they are n2 processes where n is the number of states. For finite automata with large numbers of states, these algorithms are grossly inefficient. Thus in this paper we describe an algorithm for minimizing the states in which the asymptotic running time in a worst case analysis grows as n log n . The constant of proportionality depends linearly on the number of input symbols. Clearly the same algorithm can be used to determine if two finite automata are equivalent.

The essence of previously published algorithms was to first partition the states according to their

outputs. The blocks of the partitions are then repeatedly refined by examining the successor state on a given input for each state in the block. States whose successor states on a given input are in different blocks are placed in separate blocks. When no further refinement is possible, all states in the same block of the partition can be shown to be equivalent. Consider the example in Figure 1. The initial partition is

(1,2,3,4,5)(6) . Since on input 0 , the successor Input states of states 1, 2, 3 and 4 are in the first block State 0 1 Output __ ---- of the partition and the successor of state 5 is in 1 2 1 0 2 the second block, the first iteration refines the 3 2 0 3 4 3 0 partition into the blocks (1,2,3,4)( 5) and (6) . 4. 5 4 0 Successive refinements yield (1,2,3)(4)(5) (6) ; 5 6 5 0 6 6 6 1 (1,2)(3)(4)(5)(6) and (l)(2)(3)(4)(5)(6) - Thus, I! ' Next in this example all pairs of states are inequivalent. State

For this example it is seen that as many as n Figure 1 iterations may be required and the total number of steps needed to execute the algorithm if implemented in a straightforward fashion on a digital computer is n2 .

The algorithm proposed in this paper may also require n iterations but the work per iteration summed over all iterations yields only n log n . We illustrate the algorithm by an example before specifying it in detail. Extensive use of list processing is employed to reduce the computation time. First the state table is inverted to obtain the table shown in Figure 2. The states are partitioned according to their outputs

($293,495) (6) . Next a block and an input symbol on which the partition is refined are selected. Assume the block (6) and input 0 are selected. The states in each block are further partitioned depending on whether on input 0 their next state is in block (6) or not. Thus the next partition is (1,2,3,4)(5)(6) .

Note that had we partitioned on the block (1,2,3,4,5) and input 0 we would have obtained the same result.

1 More generally, once we have partitioned on a block and an input symbol, we need never partition on that block and input symbol again until the block is

split and then we need only partition on one of 4 3 4 0 the two subblocks. Since the time needed to 5 4 5 0 6 .6 partition on a block is proportional to the 5,6 1 transitions into the block and since we can always previous state select the half with fewer transitions, the total Figure 2 number of steps in the algorithm is bounded by

n log n .

Formal description of the algorithm

Let A = (S,I,G,F) be a finite automaton where S is a finite set of states, I is a finite set of

inputs, 6 is a mapping from S x I into S and F c S is the set of final states. No initial state is

specified since it is of no importance in what follows. The mapping 6 is extended to SxI* in the usual * manner where I denotes the set of all finite length strings of symbols from I . States s and t are * said to be equivalent if for each x in I , 6(s,x) is in F if and only if 6(t,x) is in F .

The algorithm for finding the equivalence classes of S is described below.

1.Step trseS and ae1 construct

6-'(s,a) = {tlE(t,a) = s] .

Step 2. Construct B(1) = F , B(2) = S-F and for each a in I and l_< i 5 2 construct

a(i) = {s]ssB(i) and E-'(s,a) # @] .

3Step . Set k=3

Step 4. VasI construct

Cl3 if la(l) \ 5 142) I L(a) = . c23 otherwise

5Step . Select a in I and i in L(a) . The algorithm terminates when L(a) = $ for each a in I .

Step 6. Delete i from L(a) .

Step 7. Vj < k st 3% in B(j) with s(t,a) es(i) perform steps 7a, 7b,a7c, and 7d.

Step 7a. Partition B(j) into B'(j) = {tls(t,a) e&(i)] and B"(j) = B(j)-B'(j) .

7b.Step Replace B(j) by B'(j) and construct B(k) = B"(j) . Construct corresponding a(j)

and a(k) for each a in I . 7cStep . TagI modify L(a) as follows.

L(a) U Cjl if j{L(a) and 0 < la(j)1 _< la(k)\ L(a) = L(a) U ikj otherwise

Step 7d. Set k = k+l .

8Step . Return to Step 5.

Correctness of algorithm

The claim is made that on termination of the algorithm two states are equivalent if and only if they are

in the same block. The algorithm must terminate since the only times that an index is added to L(a) for

some a in I are in Step 4 which is executed only once and in Step 7c. An index is added at Step 7c only

after a refinement of a block of the partition. Each time Step 6 is executed, an index is removed from L(a)

for some a . Thus the algorithm must terminate.

It is easily shown by induction on the number of times Step 7a is executed that if s is in B(i) and

t is in B(j) , i # j , then s is not equivalent to t . Clearly, it is true the first time Step 7a is

executed since only two blocks exist, B(1) containing only final states and B(2) containing only nonfinal states. Blocks are refined at Step 7a only when successor states on a given input have previously been

shown to be inequivalent.

To see that two inequivalent states cannot be in the same block when the algorithm terminates, assume

that states s and t are in B(i) and that s and t are not equivalent. Without loss of generality,

assume 8( s, a) is in B(j) and s(t,a) is in B(k) where j f k . (If s(s,a) and s(t,a) are in the

same block then there exists a shortest x such that 6(s,x) and 6(t,x) are in distinct blocks. Clearly

an x exists and hence a shortest x since for same x one or the other of 6(s,x) and 6(t,x) but not

both is in a final state and each block consists solely of final or solely of nonfinal states. Let a be the

last symbol of x and write x = ya . Then 6(s,y) and F(t,y) are in the same block, ~(s,Y) and 6(t,y) are not equivalent and 6(6(s,y),a) and 6(6(t,y),a) are in different blocks. Replace s by 6(s,y) and

replace t by 6(t,y) .) Consider the point at which the block containing 6(s,a) and &(t,a) was

partitioned so that o(s,a) and s(t,a) first appeared in separate subblocks. At that point one of the

two subblocks was placed in L(a) . When this subblock is removed from L(a) , the block containing s and

t is partitioned with s and t going into separate subblocks. Thus s and t cannot both be in B(i) ,

a contradiction.

Analysis of the running time

The running time of the algorithm is clearly dependent on the implementation. The algorithm has been

programmed in AIGOL. Since the implementation consists of approximately 300 ALGOL statements we shall simply

indicate how the various steps were implemented and discuss informally their running time.

- 3 Tlie sets sui:h as 6 -'(s,a) , L(a) , etc. were represented by linked lists in such a way that an item could be added o-- deleted at the beginning of the l'st

B(i) and a(i) could be determined in a fixed number of steps.

Steps 1, I', and 4 are executed only once and eal:ire time proportional to the product of the number of states ti!?rs 'lf: number of input symbols. f,teps :, thr3';-‘tI 9 f orm a simple loop. The time necessary to traverse t?e l,;op for given a in I and i in I,(a) is proportional to the number of state transitions on input a 'erminating on states in B(i) . (To see tl.i,s, note that Steps 5, 6 and 8 are finite.

Step 7 WI do not need to examine B(j) for eacil J < 'I I-o see if there exists a t in B(j) with 8(t,a) in a(i) . !:ather we look at each state in a(i) and i,hen consult the inverse state table to find each t such t!,a: s(t,a) is in a(i) . Each time a new t ic found, the block containing t is located and t placed or- : list of states to be split off from the :jl~.:. The block is then placed on a list of blocks which na-,e becr~ refined if it is not already on the list. Finally we go down the list of blocks which have been refined and actually partition them. The number of blocks we must look at must be less than the number of state t-.a-,s'tians on input a terminatin,? on states in B(i) . The time necessary to actually partition a block is piorzrtional to the number of states to be split off. When the number is summed over all blocks which are part;t.ioned it adds up to the number of state transitions on input a terminating on states in

B(i) .) I; be the constant of proportionality.

Consider :,i,e time spent in the loop of Step j thro@~ 8 for a given input symbol a . Assume that at step > the blocks of the partition are B(l),B(Z),...,B(m) and that L(a) = {il,i2, . . ., ir] . Let

{ir+-l,ip32,...,im] = {1,2,...,m] -L(a) . The clati, is made that the total time spent on traversals of the loop for which input symbol a is selected in Step 5 until the program terminates is bounded by

m T=k(? ai log ai. + C ai. loe(ai ./T)! - j=l j J j=r+l J 3

Clearly the bound is valid if the algorithm has terminated. If the loop is traversed for an input symbol other than a , then the time spent is not included in T . However, since blocks get split and the set

L(a) modified we must show that the new value of T , call it !?' , is less than or equal to the old value of r . If a block whose index is in L(a) is partitioned, then a term of the form b log b is replaced by the expression c log c + (b-c) log(b-c) which decreases the value of T . If a block whose index is not in L(a) is partitioned, then a term of the form b log b/2 is replaced by the expression

-c log c + (b -c) log(b -c)/2 where c

c log c + {b -c) log@ -c)/2 < c log b/2 + (b -c) log b/2

_< blogb/? .

ither case $ is less than T . Finally, ass-Lme r # 0 and a and some B in L(a) has been

4 selected at Step 5. As we showed earlier, the time around the loop is bounded by kal . Thus by induction, the total time is bounded by m k[ap + a1 log(aJ?) + i a. log a. + ai W3(ai /2) 1 . lj c j=l lj j=r+l j j jb I

We must show that this eqression is less than or equal to T . That is, we need show

&P + aa log a!/2 _< a1 log ae .

Clearly al +a1 log a!/2 = ap(log a!/2 + log 2) = ap log aI . This completes the proof of the claim.

The first time Step 5 is executed the formula for T is bounded by knlog n . Multiply by the number of input symbols and adding in the time for Steps 1 through 4 yields a total bound proportional to n log n .

Experimental results and conclusions

In order to obtain timing information, the algorithm was applied to two classes of finite automata.

Automata in the first class are given by A(n) = ( fL2, . . ..n].{O,l],s,{l]) where 5(1,0) = 6(1,1) = 1 and

6(i,O) = i-l and 6(i,l) = i for 2 _< i _< n . Automata in the second class are given (for even n) by

B(n) = ([1,2,..., n3, CO,ll,~, Cill _< i,_< n/23) where 5(i,O) = 6(i,l) = n/2 + 2i-1 and 6(n/4+ i,O) = 6(n/4+ i,l) = 2i-1 for l_< i _< n/4 and S(n/2+i,O) = 6(n/2+i,l) = 2i-1 for n/2

n A(n) B(n)

100

1000

2003

Table 1. time in seconds

Note that A(n) is the example which required n2 steps for previous algorithms.

Our algorithm is particularly suited for A(n) and a detailed analysis shows that the running time

should grow linearly-with the number of states as the experimental evidence indicates. The worst case for our algorithm is typified by B(n) in which blocks are always partitioned equally. The running time for

B(n) should grow as n log n for both the current algorithm and for previously published algorithms. The results seem to indicate that the algorithm is practical for minimizing states in finite automata (or testing

equivalence of finite automata) of up to several thousand states.

References

1. Harrison, M. A., Introduction to Switching and , McGraw-Hill, New York, 1965.

2. McCluskey, E. J., Introduction to the Theory of Switching Circuits, McGraw-Hill, New York, 1965.

5

l- .-. r, -, i-/ : I

,. . - <:L‘! -- L:::t? -? 3,164 -- C^(C 3- :,:jr, -- oce7 -- *; *: f $3 - - 1; ,; t ‘1 - - $,;iC +a 1:71 -- 01: 7 2 - - ‘::73 -- :c74 -4 :c75 -- QC7h 4- :Ci7 -- :c73 -- cc79 -- !; r) I? fj - - ZCEl -- ,?CE2 -- ‘:Cf3 -4 ,CE4 -- >I(-,?5 -- LZEf? -- cce7 -- CC&?8 -- i;,;Eq -- :rco -- GL‘)n .c 1-- cc92 -- cc53 4- Ii794 -- cc<5 -- CO96 -4 CCC7 -- Cc58 -- LCCO -- )1,0 -- 01r1 4- ClCZ -- 0133 -- OlC4 -- ClC5 -- c!lCh -- OlC7 -- CLC8 -4 ClC3 -- OLLC’ -- 2111 -- CA12 -- 3113 4- Cl14 -- c115 5- @116 -- ?111 -- “11? e- Tll’> -- Cli j -- -2 1 L 1 - * CLLI -- -@123 h- :‘LL't -- 712s -- rllih -6 Cli7 -- )lZY -- :1is -5 J 1 3 ,I - - n131 :- c132 -- 1: 11-.J ’ -- Cl?4 t- 2135 -- ?L?F, -- :137 -6 313,; -- Cl?‘2 6- Cl443 -- 1:1t1 -- OlL.2 -5 Cl45 -- Gi44 -- Cl45 -5 0146 -4 Cl47 -- Cl48 4- 0143 -- c1c31 -- T'lC1 -/t Cl57 -- QlC3 4- E C C I h Cl54 -- @CC{ 3L’:CKLISTZEH(Jr3); Cl53 -- I:r,k!CCCKL ISTZE4C(2):=2; ClEh -4 ChS; (7157 -- !F (p~-ivc”~+:-( L)>=‘?l AhC 4 PfJF(Ch;_(l )<=‘,tJPr?‘4F(L) 1 TI-“_N

Cl75 -- J:=S~Tb(~LCCKtIST7~a~l~ 2170 -- ~F”CV~FI:lIY(i:LCCKLI~TZE~T:l; CL77 -- ChQLCCKLT~TZERC(Jl:=@: 8178 -- r,t: Tf-1 !?I\, ‘C~T~;LI~;TC~~I=~ TkEh 31Sl 3- P "C IN ala? -- ~:=C~TJ(?L~CKLISTC~F); 0183 -- RF~Cv~FR~V(eLCCKLISTCNF); Cl@4 -- ChCLCCKCI3TCNC(J):=3; 31&5 -- GC TODIVCNE; Clfb -3 EhC ClE7 -- ELSF CC TC FIhISkEC; Cl88 -- CIVZEFC: STATE:=ZCRObFXT(JI; c1e4 -- kHILE STATF-r=O DC 0190 3- f!EGIh 0151 -- CELL :=PREvChZERO(STATE-fob; Cl92 -- WkICE CELL.-=0 CO Cl53 4- YEGIh 0194 -- LASTcTATE :=CATA(CELL): 0195 -- K:=dLCCK(LASTSTATE) ; OlSb -- Arr>( $PLIT(KI ,LASTSTAT!l ; ClS7 -- if “LCCKSPL IT (K )=O THFY ACi)( 5PL ITf?Lr:CKS,K 1 ; OlSd -- hUMlhSPLIT(K1 :=hUCIhS”LIT(K)+l; 0159 -- HLCCKSPLIT(K) :=l; c2c3 -- CELL :=NEXT(CELLb; OZCl -4 ENP; 02c2 -- STATf :=tEFC\EXf(STbTE); 0223 -3 ErJC ; r)2C4 -- r,r TCFETUPh; 02C5 -- CILCAE: .iiTATE:=CIjEhFXT( J) ; G2Cb -- hl-ILE STATE-=0 DO zic7 3- SEGIh C2CH -- CELL :=PPEVCNGNE( STATE-N); 02C9 -- kt-ILE CFLL1=9 CC 3210 4- HEC IN 0211 -- LASTSTATE :=CATA(CELLJ; c/212 -- K:=HLCCK(tASTSTATE) ; c213 -- ACZ(SPLIT(totCASTSTATC); 0214 -- IFELCCKSPLIT(K)=O THEN ACII(SPLITBLCCKS,K~; 0215 -- hUYINSPLIT(K):=hU~INS~LIT(K)+~; C216 -- PLCCKSPLIT(Kl:=l; -3217 -- CELL :=NEXT(CELL) ; C2lR -4 EhC; 3214 -- STATE :=CNFhEXT( STATE) : c220 -3 EF\C; c221 -- CC~CEhT~*n*s4~~~4*~~d*****++**~~~**~~~~*~~~~~~*~~~~~**~~~~~$~~*~*~* 0222 -- * REcfhE BLCCKS 022.4 -- *~$1S$a9~**4****~$~~~~*~~~**~*~~~*~~~~~~*~~~~*~~~~~*~~~~~~**~~~ 0224 -- RfTURh : IF(Sf’LITt?LCCKS=Cl Tt-EN GU TCREPFAT; c225 -- RSPLIT:-CAT4(SPLITBLCCKS); 0226 -- PE?CVFF~Cb’(SPLITPLCCKSI; 0227 -- ~LCCK’PLIT(~SPLIT):=C; 022t3 -- IFhUNI~~SLOCK(B~PLIJ)=NU~I~~SPLIT@SPLT) TtIEh CZiQ 3- HECIh 023ll -- hL~IN’FLIT(BSPLIT):=0; C23L -- ht-ILF ISFL!T(SSPLIT)-.=O) C13 REFfOVF:~K~lY~SPCIT~R~PL!T~I; 0232 -- r,C TOF:rTLRy\‘;

c “233 -3 (32?L -- C’ 2 3 5 - - i’? 30 -- CL37 -- I;238 3- c23q -- c24u -- 2241 -- .)2LL -- :!24.$ -- c744 -- 2245 -- (:246 -- 3247 4- cc?43 -- c 2 4 '3 - - c25: -- ?LCl -- Cic2d -- 17c._ L , j -- CL54 -4 I?/';5 -- c2rf: 4- X%57 -- czj;j -- '325'; -- fl26Ci -- SZCl -- 2262 -- Tzt3 -4 :2t4 -3 OLt5 -- c2t5 -- 0267 -- 1269 -- ~oLt.9 -- 32 7 ': 3 - CLjl -- 2ILiZ -- t-273 -3 0274 -- CL75 ?- 3276 -- 2277 -- CL79 -? !)27c; -- d/F) -- 32&l 3- 0/"2 -- )2?+ -- ZiG/l -3 c2e5 -- CL56 3- ':')Fl -- C?[>j -- ,!Lc;'J -3 ‘,!';,- -- ST~:~FI:RC @LCoL w (VEQS IIXh z lP!OL6S 1 00 >ECFMRCR 1976 ci 12: 17 PAGF 6 c L b‘1 -- 32F2 -- 0293 -- ci94 -- 0245 -- c?i=t !- C2S7 -- CL58 -- c299 -- n3cc -- C3CL -- ll3c2 -- @3C3 4- 0304 -- 03c5 -- 03G6 -4 c3c7 -3 c3ce -2 c3c9 -1

I?,SZ.?3 SECCNCS IN CCMPILATICh, lCR32 3YTES CF CCDEGEhERATEC