Context-sensitive languages (CSL) Phrase-structure grammars

A phrase-structure grammar () is a 4-tuple G = (V, , R, S), where • V is the rule alphabet, which contains nonterminals and terminals. •  (the set of terminals) is a subset of V, • R (the set of rules) is a finite subset of V*(V - )V* V*, represented as    • S (the start ) is an element of V - .

   is also called a production Phrase-structure grammars

Let G = (V, , R, S), be a If    is a production, we say    Stands for the reflexive transitive closure An example

S  ABC AB  0AD AB  1AE DC  B0C EC  B1C D0  0D D1  1D E0  0E E1  1E AB   C   0B  B0 1B  B1 L(G) = ? Context-sensitive grammars (CSG) Let G = (V, , R, S), be a phrase-structure grammar If all productions    of G satisfy the condition that  is at least as long as  then G is called a context-sensitive grammar L(G) is called a context-sensitive language Context-sensitive languages (CSL) • The class of languages generated by phrase-structure grammars are called recursively enumerable languages • The class of CSLs is a proper subset of recursively enumerable languages • Almost any language one can think of is context-sensitive • Proofs of languages are not CSLs are based on diagonalization An example

S  ACaB | aa | a Ca  aaC CB  DaaB aD  Da AD  ACa AC  Aa aB  Ba AB  aa

L(G) = ? Normal form

Each production is of the form: A  ,  Variable A is replaced in the context  Written as: A  /

Theorem: Every CSL is generated by a grammar in which all productions of the form A  , where A is a variable and , ,  are strings of grammar symbols and  is not empty. Normal form

Given a CSG, G = (V, , R, S),

Step1: construct a new grammar G1 = (V1, , R1, S), as follows: Let V1 = V + Add to R1 all productions    s.t.   (V-) if    is in R, let ` be the string replacing all terminal strings a of  by a non terminal a`. Add to V1 a` Add to R1 productions `   and a`  a Normal form

Step2: Let w(G) = max {|| |    G} construct a new grammar G2 = (V2, , R2, S), from G1 such that G2 satisfies w(G2)  2. Given a rule: A1 … Am  B1 … Bn , m  n If n  2, OK If 2  m < n, create two rules A1 … Am  B1 … Bm-1X X  Bm … Bn If m = 1 and n  3, create n-1 rules: A1  B1X1 X1  B2X2 … Xn-2  Bn-1Bn If m = n and n  3, create n-1 rules: A1A2  B1X1 X1A3  B2X2 …. Xn-2An  Bn-1Bn Normal form

Step3: create G3 = (V3, , R3, S), as follows: If A   , OK If AB  CD and A = C or B = D, OK If AB  CD and A  C and B  D, replace it with the following four rules: AB  A1B A1B  A1B1 A1B1  CB1 CB1  CD

G3 is in normal form

M = (Q, , , , q0, B, F) where Q: finite set of states : finite set of input symbols : finite set of tape symbols,    : the transition function : Q   → Q    {L, R} q0: the start state B: the blank symbol, special symbol in  and not in  F: set of final states, a subset of Q Instantaneous Description (ID) or Configuration

X1…Xi-1qXi…Xn where q is the current state, the tape head is scanning Xi, and X1…Xn the tape content between the leftmost and rightmost non-blank symbols. A move is a relation between IDs.

Let X1…Xi-1qXi…Xn be an ID. If (q,Xi) = (p,Y,L) then X1…Xi-1qXi…Xn |— X1…Xi-2pXi-1Y…Xn . If (q,Xi) = (p,Y,L) then X1…Xi-1qXi…Xn |— X1…Xi-1YpXi+1…Xn The language accepted by Turing machine M, denoted by L(M) is * {w | w in  and q0w * p for some p in F and  and  in *}

L(M) is called a recursively enumerable language

A Turing machine halts if there is no move from the current state Theorem: If L is L(G) for unrestricted grammar G = (V, T, P, S), then L is a r.e. language

Proof: construct a two-tape Turing machine M to recognize L. Tape1 – input, w  L is placed on tape1 Tape2 – holds sentential form  of G, initialize  to S M repeatedly does the following: 1) Nondeterministically select a position i in  2) Nondeterministically select a production  of G 3) If  appears beginning in position i of , replace  by , shift as needed 4) Compare result with content of tape1, if they match accept, if not go to step1 All sentential forms appear on tape2. Also, only sentential forms can appear on tape2.

so, L(M) = L Theorem: If L is a r.e. language, then L = L(G) for some unrestricted grammar G

Proof: let M = (Q, ∑, , , q0, B, F), Construct G = (V, ∑, P, A1) as follows:

V = ((∑{}) X )  {A1, A2, A3}, and following productions:

1) A1  q0A2

2) A2  [a,a]A2

3) A2  A3

4) A3  [,B]A3 5) A3   6) q[a,X]  [a,Y]p for each a in ∑{} and q in Q and X and Y in , such that (q, X) = (p,Y,R) 7) [b,Z]q[a,X]  p[b,Z][a,Z] for each X, Y, Z in , q in Q, and a, b in ∑{} such that (q,X) = (p,Y,L) 8) [a,X]q  qaq, q[a,X]  qaq, and q   for each a in ∑{}, X in , and q in F Recursive sets are those languages accepted by a TM that halts on all inputs. Recursive sets are a proper subset of the class of recursively enumerable sets. Linear Bounded Autamata

A (LBA) is a nondeterministic TM satisfying the following conditions 1. Its input alphabet includes two special symbols ¢ and $, the left and right endmarkers respectively 2. It has no move left from ¢ or right from $, nor may it write another symbol over ¢ or $. 1. If L is a CSL, then L is accepted by some LBA 2. If L = L(M) for some LBA, then L – {} is a CSL 3. Every CSL is recursive

Hierarchy Theorem The regular sets are properly contained in the CFLs, the CFLs not containing  are properly contained in the CSLs, and the CSLs are properly contained in the r.e. sets and Machines

Language Grammar Machine (recognizer) Right-linear Right-linear NFA, DFA Left-linear Regular Context-free Context-free PDA

Context- Context- LBA sensitive sensitive Recursively Unrestricted Turing enumerable machine