Formal Models
Language
- "T" is some alphabet for some set of languages {e.g., character
set for machine}
- T* is all possible sequences: "strings", "words" or "tokens" that
can be formed from the alphabet
- a "language" is some sub-set of T*
- a Language is a 4-tuple:
L(T, N, P, S) T = terminals N =
non-terminals P = production rules S = start sym.
In other words, a language is defined by a grammar.
Classes of Grammars
In 1950 Noam Chomsky (a noted Linguist) described four classes of
languages (in order of increasing power).
- Type 3 - Regular languages (lex)
- Type 2 - Context free - Backus-Naur Form (yacc)
- restricted to rules of the form: A - > cBdE
(i.e., a non-terminal must appear singularily on the LHS of any
production rule, and the RHS side can be any
string of terminals and non-terminals).
- The non-terminal can be replaced by its RHS regardless of the
context it appears in.
- Type 1 - Context sensitive
restricted to rules of the form: a - > b where |a| < |b| (i.e., a and b
can be any string of terminals and non-terminals but b must be the
same length or longer than a)
- Type 0 - Unrestricted, recursively enumerable, phrase structure
- restricted to rules of the form: a -> b where a and b can be any
string of terminal and non-terminals
- Type 0 and type 1 are important in theoretical computer
science, but have little impact on programming language design.
Formal Machine Models
Finite State Automaton (FSA)
A graph with directed labeled arcs, two types of nodes
(final and non-final state), and a unique start state is an FSA:
What strings start in state A and end up at state C?
FSA's can have more than one final state:
Non-deterministic FSA vs. Deterministic FSA
- Deterministic FSA - For each state and for
each member of the alphabet, there is exactly one transition
- Non-deterministic FSA (NFA) - Remove above restriction.
At each node there is 0, 1 or more than one transition for each alphabet
symbol.
- Important early result: NFA = DFA
Outline of a Proof
Let subsets of states be states in DFA. Keep track
of which subset you can be in.
Any string from {A} to either {D} or {CD} represents a
path from A to D in the original NFA.
Regular Expressions
You can write any regular language as a regular expression:
0*11*1 + 0*11*(0 + 100*1)1*
The operators used in forming regular expressions are:
- Concatenation (adjacency)
- Or ( + sometimes written as | )
- Kleene closure (* -- 0 or more instances)
- also parenthesis for grouping
Regular expressions, regular grammars and FSA's
Theorem: Regular expressions, regular grammars and FSA's are all
equivalent---they can be used to define the same set of languages
The proof is ``constructive.'' That is given either a grammar G
or a FSA M, you can construct the other.
To go from a FSA to a regular grammar, make the following
transformations:
Why do we care about regular languages?
Programs are composed of tokens:
- Identifier
- Number
- Keyword
- Special symbols
Each of these can be defined by regular grammars:
Examples
- Example 1: An even number of 0's and an
even number of 1's
- Example 2: a(bb)*bc
- Example 3: Binary Odd Numbers
- Example 4: 00(1|0)*11
- Example 5: Even Number of b's
- Example 6: At Most Two Consecutive b