Formal Models

Language

"T" is some alphabet for some set of languages {e.g., character set for machine}
T* is all possible sequences: "strings", "words" or "tokens" that can be formed from the alphabet
a "language" is some sub-set of T*
a Language is a 4-tuple:
L(T, N, P, S) T = terminals N = non-terminals P = production rules S = start sym.
In other words, a language is defined by a grammar.

In 1950 Noam Chomsky (a noted Linguist) described four classes of languages (in order of increasing power).

Type 3 - Regular languages (lex)
- BNF rules that are restricted to the following form: A -> t N | t
  where: N = nonterminal, t = terminal
  Examples:
  Binary numbers:
  B -> 0 B | 1 B | 0 | 1
  Identifiers:
  I -> a L | b L | c L | ... z L | a | ... y | z
  L -> 1 L | 2 L | ... | 9 L | 0 L | 1 | ... 9 | 0
  | a L | b L | c L | ... Z L | a | ... y | z
Type 2 - Context free - Backus-Naur Form (yacc)
- restricted to rules of the form: A - > cBdE (i.e., a non-terminal must appear singularily on the LHS of any production rule, and the RHS side can be any string of terminals and non-terminals).
- The non-terminal can be replaced by its RHS regardless of the context it appears in.
Type 1 - Context sensitive
Type 0 - Unrestricted, recursively enumerable, phrase structure
- restricted to rules of the form: a -> b where a and b can be any string of terminal and non-terminals
Type 0 and type 1 are important in theoretical computer science, but have little impact on programming language design.

A graph with directed labeled arcs, two types of nodes (final and non-final state), and a unique start state is an FSA:

What strings start in state A and end up at state C?

FSA's can have more than one final state:

Deterministic FSA - For each state and for each member of the alphabet, there is exactly one transition
Non-deterministic FSA (NFA) - Remove above restriction. At each node there is 0, 1 or more than one transition for each alphabet symbol.
Important early result: NFA = DFA

Outline of a Proof

Let subsets of states be states in DFA. Keep track of which subset you can be in.

Any string from {A} to either {D} or {CD} represents a path from A to D in the original NFA.

You can write any regular language as a regular expression:

0*11*1 + 0*11*(0 + 100*1)1*

The operators used in forming regular expressions are:

Theorem: Regular expressions, regular grammars and FSA's are all equivalent---they can be used to define the same set of languages

The proof is ``constructive.'' That is given either a grammar G or a FSA M, you can construct the other.

To go from a FSA to a regular grammar, make the following transformations:

Programs are composed of tokens:

Each of these can be defined by regular grammars: