CSCI 431: Formal Languages Continued---Context-Free Languages

Background (from the last class)

The syntax of a programming language describes the structure of programs without any consideration of their meaning.
Grammars are rewriting rules and may be used for both recognition and generation of programs.
Context-free grammars are used to describe the bulk of the language's structure; regular expressions are used to describe the lexical units (tokens)

Programming Languages

Programming languages require two levels of description, the lowest level is that of a token. The tokens of a programming language are the keywords, identifiers, constants and other symbols appearing in the language. In the program
```
void main()
{
  printf("Hello World\n");
}
```
the tokens are
```
void, main, (, ), {, printf, (, "Hello World\n", ), ;, }
```
The alphabet for the language of the lexical tokens is the character set while the alphabet for a programming language is the set of lexical tokens;
The ordering of symbols within a token are described by regular expressions. The ordering of symbols within a program are described by context-free grammars.

Context-free grammars

Context-free grammars describe how lexical units (tokens) are grouped into meaningful statements. The alphabet (the set of lexical units) consists of the keywords, identifiers, constants, punctuation symbols, and various operators.
While context-free grammars are sufficient to describe most programming language constructs, they cannot specify context-sensitive aspects of a language such as the requirements that a name must be declared before it is referenced, the order and number of actual parameters in a procedure call must match the order and number of formal arguments in a procedure declaration, and that types must be compatible on both sides of an assignment operator.

An example

Below is a context-free grammar for a fragment of English.

The grammatical categories are: S, NP, VP, D, N, V.
The words are: a, the, cat, mouse, ball, boy, girl, ran, bounced, caught.
The grammar rules are:

S  --> NP VP
NP --> N
NP --> D N
VP --> V
VP --> V NP
V  --> ran | bounced | caught
D  --> a | the
N  --> cat | mouse | ball | boy | girl

Using the grammar above the sentence the cat caught the mouse can be generated as follows:

S ==> NP VP
  ==> D N VP 
  ==> the N VP 
  ==> the cat VP 
  ==> the cat V NP 
  ==> the cat caught NP 
  ==> the cat caught D N
  ==> the cat caught the N
  ==> the cat caught the mouse

This derivation is performed in a leftmost manner. That is, in each step the leftmost variable in the sentential form is replaced.

Sometimes a derivation is more readable if it is displayed in the form of a derivation tree.

                             S
                            / \
                          NP   VP
                         /\     /\
                        D  N   V  NP
                       /  /    /   \  
                    the cat caught  \
                                    /\
                                   D  N
                                   \   \
                                   the mouse

Another example: An expression grammar

   V = { E } 
   T = { c, id, +, *, (, ) }
   P = {E --> c,
        E --> id,
        E --> (E),
        E --> E + E,
        E --> E * E }  
   S = E

The derivation tree for this grammar is what we would expect

                              E
                             /|\
                            E + E
                           /   /|\ 
                         id   ( E ) 
                               /|\ 
                              E * E 
                             /     \
                           id       id

Parsing

An algorithm which recognizes a language (or a program written in a programming language) is called a parser. A parser either implicitly or explicitly builds a derivation tree for the sentences in a language.
There are two approaches to parsing. The parser can begin with the start symbol of the grammar and attempt to generate the same sentence that it is attempting to recognize or it can try to match the input to the right-hand side of the productions building a derivation tree in reverse. The first approach is called top-down parsing and the second, bottom-up parsing.

A top-down parse

PARSE TREE                 UNRECOGNIZED INPUT

     S                     the cat caught the mouse
    /\
   NP VP                   the cat caught the mouse
  / \  \
 D   N  \                  the cat caught the mouse
 |   |   \
the  |    \                cat caught the mouse
     |     \
    cat     \              caught the mouse
            /\
           V  NP           caught the mouse
           |   \
        caught  \          the mouse
                /\
               D  N        the mouse
               |  |
              the |        mouse
                  |
                mouse

Each line in the figure represents a single step in the parse. Each nonterminal is replaced by the right-hand side defining it. Each time a terminal matches the input, the corresponding token is removed from the input.

A bottom-up parse

PARSE TREE              UNRECOGNIZED INPUT

                        the cat caught the mouse
  
the                     cat caught the mouse
 |
 D                      cat caught the mouse
 |
 |  cat                 caught the mouse
 |   |
 |   N                  caught the mouse
 \  /
  NP                    caught the mouse
   |
   | caught             the mouse
   |   |
   |   V                the mouse
   |   | 
   |   | the            mouse
   |   |  |
   |   |  D             mouse
   |   |  |
   |   |  |  mouse
   |   |  |   |
   |   |  |   N
   |   |  \  /
   |   |   NP
   |   \  /
   |    VP
   \   /
     S

A top-down table-driven parser

A parser could be constructed using a driver routine (or control) a stack and the grammar (usually stored in tabular form). The driver routine follows the following algorithm:
1. Initialize the stack with the start symbol of the grammar.
2. Repeat until no further actions are possible
  1. If the top of the stack and the next input symbol are the same, pop the top of the stack and consume the input symbol.
  2. If the top of the stack is a nonterminal symbol, pop the stack and push the right hand side of the corresponding grammar rule onto the stack.
3. If both the stack and input are empty, accept the input otherwise, reject the input.

This approach is illustrated using the grammar for expressions and parsing the expression id+id*id.

STACK        INPUT          RULE/ACTION

   E]        id+id*id]     pop & push using E --> E+E
 E+E]        id+id*id]     pop & push using E --> id
id+E]        id+id*id]     pop & consume
  +E]          +id*id]     pop & consume
   E]           id*id]     pop & push using E --> E*E
 E*E]           id*id]     pop & push using E --> id
id*E]           id*id]     pop & consume
  *E]             *id]     pop & consume
   E]              id]     pop & push using E --> id
  id]              id]     pop & consume
    ]                ]     accept

Pushdown automata (PDA)

Although a top-down table-driven parser is only one of the several parsing algorithms for context-free languages, all parsers must hold some information on a stack.
In the case of the top-down parser, it must:
1. pop variables off the stack and push the corresponding right-hand side on the stack, and
2. pop terminals off the stack when they match the input.
This observation leads us to the notion of an idealized machine called a pushdown automata.
A pushdown automata has the following components:
- an input that it scans from left to right
- a stack, and
- a finite control to control the operations of reading the input and pushing data on and popping data off the stack.
The states in the control are similar to those in finite state automata
- an initial state
- a set of final or accepting states
The transitions are bit more complicated than those in FSA.

(q_i, a, A) --> (q_j, r)

Which is read as: when in the state q_i, reading (or consuming) a in the input and popping A from the stack, then go to state q_j and push the string r into the stack.
A string is acceptanced by a PDA when:
   (1) the entire string is read;
   (2) control in some final/accepting state;
   (3) the stack is empty.

An Example

Example: the PDA for {aⁿbⁿ | n>0}

State: {q0, q1}  Initial state: q0; Final states: {q0, q1}
Input Alphabet: {a, b}
Stack Alphabet: {A}
(Note: Stack symbols and input symbols may be from two different alphabets, although there should be some connection in between.)
Transitions:
   (q0, a, e) --> (q0, A)
   (q0, b, A) --> (q1, e)
   (q1, b, A) --> (q1, e)
How are ab, aabb accepted and ba, aaab, aabb rejected?
What role does the stack symbol A plays in this automaton?
Answer: It is used as a counter to keep track of the number of a's in the initial part of the input string.

Deterministic vs. non-deterministic

Deterministic: there is at most one move for any situation (current state, next input symbol to be read, and the topmost symbol in the stack). Formally put, a PDA is deterministic iff there are not two instructions
- (qi, a, A) --> (qj, r) and
- (qi, a, A') --> (qk, s)
  such that A=A' or A=e or A'= e.
Non-deterministic PDA's are not equivalent to deterministic PDA's, they accept different languages

Equivalence of context free languages and non-deterministic PDA languages

Conversion from CFG to NPDA
  (1) Initial symbol: (q0, e, e) --> (q1, S)
  (2) Rule: A --> x ==> (q1, e, A) --> (q1, x)
  (3) Terminals: (q1, a, a) --> (q1, e)
Example: G={V_N, V_T, S, R} or the language {aⁿbⁿ | n>=0}
V_N = {S}, V_T = {a, b}, R = { S --> aSb, S --> e}

The correspondent NPDA:
   State: {q0, q1}  Initial state: q0; Final states: {q0, q1}
   Input Alphabet: {a, b}
   Stack Alphabet: {S, a, b}
   Transitions:
      (q0, e, e) --> (q1, S)
      (q1, e, S) --> (q1, aSb)
      (q1, e, S) --> (q1, e)
      (q1, a, a) --> (q1, e)
      (q1, b, b) --> (q1, e)

Regular Expressions Revisited

While CFGs can be used to describe the tokens of a programming languages (why?), regular expressions are a more convenient notation for describing their simple structure.
The alphabet consists of the character set chosen for the language and the notation includes
- "·" to concatenate items (juxtaposition is used for the same purpose),
- "|" to separate alternatives (often "+" is used for the same purpose),
- "*" to indicate that the previous item may be repeated zero or more times, and
- "(" and ")" for grouping.

Regular Expressions and Egrep

The UNIX egrep command uses regular expressions to search for text.
Why look at egrep: egrep is not only one of the most useful commands, but also, mastery of egrep opens the gates to mastery of other tools such as awk, sed and perl .
egrep basically searches. More precisely,
egrep foo file returns all the lines that contain a string matching the expression "foo" in the file "file".
We can think of an expression as a string. So egrep returns all matching lines that contain foo as a substring.
Another way of using egrep is to have it accept data through STDIN instead of having it search a file. For example,
ls | egrep blah lists all files in the current directory containing the string "blah"

The Basics: Wildcards for egrep

egrep uses regular expressions which, as we know, go beyond wildcards, but we'll start with wildcards.

The canonical wildcard character is the dot "." Here is an example :


>cat file

big
bad bug 
bag
bigger
boogy

>egrep 'b.g' file

big
bad bug 
bag
bigger

Notice that boogy didn't match, since the "." matches exactly one character.
To match arbitrary strings, use the star (the Kleene star), which works in the following way:
the expression consisting of a character followed by a star matches any number (possibly zero) of repetitions of that character. In particular, .* matches any string, and hence acts as a "wildcard".

Examples:

The File for These Examples


>cat file
big
bad bug 
bag
bigger
boogy

Wildcards #1


>egrep 'b.*g' file
big
bad bug 
bag
bigger
boogy

Wildcards #2


>egrep 'b.*g.' file
bigger
boogy

Wildcards #3


>egrep 'ggg*' file
bigger

The "escape" character

The wildcards are a start, but regular expressions are more powerful.
For example, suppose we want an expression that matches Frederic Smith or Fred Smith. In other words, the letters eric are "optional".
First, we need the notion of an "escaped" character.
An escaped character is a character preceded by a backslash. The preceding backslash removes an implied special meaning from a character
Example:
To search for a line containing text hello.gif, the correct command is
egrep 'hello\.gif' file
since egrep 'hello.gif' file will match lines containing hello-gif , hello1gif , helloagif , etc.

Grouping expressions

Now we move on to grouping expressions, in order to find a way of making an expression to match Fred or Frederic
We start with the ? operator.
an expression consisting of a character followed by a question mark matches one or zero instances of that character.
Example bugg?y matches all of the following: bugy, buggy but not bugggy
Now how to group expressions. In our example, we want to make the string "ederic" following "Fred" optional, we don't just want one optional character.
An expression surrounded by parentheses is treated by a single character.
Example 'Fred(eric)? Smith' matches Fred Smith or Frederic Smith
Note that we have to be careful when our expressions contain white spaces. When this happens, we need to enclose them in quotes so that the shell does not mis-interpret the command. So to use our example above, we would need to type
egrep 'Fred(eric)? Smith' file

Other useful operators

To match a selection of characters, use [].
Example
'[Hh]ello' matches lines containing hello or Hello
Ranges of characters are also permitted.
The [] may be used to search for non-matches. This is done by putting a carat ^ as the first character inside the square brackets.
Example egrep '[^aeiou]' file returns any line containing a consonant---not very useful.

The Start of the Line and End of the Line

Suppose you want to search for lines containing a line consisting of white space, then the word hello, then the end of the line. Let us start with an example.
```
>cat file

	hello
hello world
	hhello

>egrep 'hello' file

	hello
hello world
	hhello
```
This is not what we wanted. So what went wrong ?
The problem is that egrep searches for lines containing the string "hello", and all the lines specified contain this. To get around this problem, we introduce the end and beginning of line characters
The $ character matches the end of the line. The ^ character matches the beginning of the line.
Returning to our previous example,
egrep '^[ ]*hello[ ]*$' file
This does what we want (only returns one line)
Another example: egrep '^[^aeiou]*$' file Returns all lines that contain no vowels.

Matching one of two strings

The expression consisting of two expressions separated by the or operator | matches lines containing either of those two expressions.

Note that you must enclose this inside single or double quotes.
Examples egrep 'cat|dog' file matches lines containing the word "cat" or the word "dog"
egrep 'I am a (cat|dog)' matches lines containing the string "I am a cat" or the string "I am a dog".

Special Characters

In egrep, the following characters are considered special:
```
?  \  .  [  ]  ^  $  * (  )
```
A closing square bracket loses its special meaning if placed first in a list. for example []12] matches ] , 1, or 2.
A carat ^ loses it's special meaning if it is not placed first
Most special characters lose their meaning inside square brackets

Quotes

Single quotes are the safest to use, because they protect your regular expression from the shell. For example, egrep "!" file will often produce an error (since the shell thinks that "!" is referring to the shell command history) while egrep '!' file will not.
When should you use double quotes ?
The answer is this: if you want to use shell variables, you need double quotes. For example,
egrep "$HOME" file
searches file for the name of your home directory, while
egrep '$HOME' file
searches for the string $HOME

For another source of information on using egrep see An introduction to UNIX by Dean Brock