Ambiguous: there are flying planes.
Languages are sets of strings of characters (called sentences) from some alphabet. (Set of all syntactically correct programs).
Syntax rules determine which sentences are legal (in the language).
natural formats
mnemonic operator symbols
embedded comments, unrestricted length identifiers
syntax differences reflect underlying semantic differences
concise and regular structures
unspecified declarations or operations
default typing in Fortran - side effect is undetected typo
Lisp - not easy to read or write, but easy to translate
Cobol- easy to read, but difficult to translate because of noise words and many ways of doing the same thing
Example: if if else
Languages are described in two ways:
Like a filter: separates correct sentences from incorrect sentences
There is a close connection between formal generation and recognition.
John Backus (1959) (modified by Peter Naur) designed BNF (Backus Naur Form) - Same as Chomsky's context-free grammars (Is a generative technique). BNF was invented ca. 1960 and used in the formal description of Algol-60.
Metalanguage: a language used to describe another language. BNF is a metalanguage for programming languages
It is just a particular notation for grammars, in which
BNF / CFG Grammar
Nonterminal | < sentence >, < subject >, < predicate >, < verb >, < article >, < noun > |
Terminal | the, boy, girl, ran, ate, cake |
Start symbol | < sentence > |
Rules | < sentence > ::= < subject > < predicate > |
< subject > ::= < article > < noun > | |
< predicate >::= < verb > < article > < noun > | |
< verb > ::= ran | ate | |
< article > ::= the | |
< noun > ::= boy | girl | cake | |
Operator | Replace any nonterminal by a right hand side value from any rule (==>) |
Sentence | String of terminals derived from < sentence > by application of replacement operator |
Examples: | |
< sentence > ==> | < subject > < predicate > |
==> | < article> < noun > < predicate > |
==> | the < noun> < predicate > |
... ==> | the boy ate the cake |
also ==> | the cake ate the boy |
Syntax does not imply correct semantics |
Two Notations:
< A > ::= < B > | < C > d
A -> B | C d
Derivation trees:
B -> 0B | 1B | 0 | 1
B ==> 0B
==> 01B ==> 010
From derivation get parse tree
But derivation may not be unique:
S -> SS | (S) | ()
S ==> SS
==> (S)S ==> (())S ==> (())()
S ==> SS
==> S() ==> (S)() ==> (())()
and same parse tree
But can get 2 parse trees from same string: ()()(), each corresponds to a unique derivation:
S ==> SS ==> SSS ==> ()SS ==> ()()S ==>
()()()
A grammar is ambiguous if some sentence has 2 distinct parse
trees.
The definition of a language does not require that it mean anything. Examples:
Functions of BNF
YACC - can build a syntax analyzer quickly from the context free grammar.
| - Choice
( ) - Grouping
{ } * - Repetition - 0 or more
[ ... ] - Optional
Example: Identifier - a letter followed by 0 or more letters or
digits:
I -> L { L | D }* | I -> L | L M |
L -> a | b | ... | M -> C M | C |
D -> 0 | 1 | ... | C -> L | D |
L -> a | b | ... | |
D -> 0 | 1 | ... |
Translators are grouped by number of passes required. Often two passes (decompose, generates object code).
[ Alphabet is the Union of the first three sets ] letter := { 'A',...,'Z' } digit := { '0', '1', ...,'9' } symbols:= { ':=', '+', '-', '*', '/', '(', ')', $$$ [means end of file] } id = letter ( letter | digit ) * [ except "read" and "write" ] literal = digit digit *
1. < pgm > - > < statement list > $$$ 2. < stmt list > - > < stmt list > < stmt > | epsilon 3. < stmt > - > id := < expr > | read < id > | write < expr > 4. < expr > - > < term > | < expr > < add op > < term > 5. < term > - > < factor | < term > < mult op > < factor > 6. < factor > - > ( < expr > ) | id | literal 7. < add op > - > + | - 8. < mult op > - > * | /
< expr > - > < factor > | < expr > < op > < factor >
< expr > - > < term > | < term > < add op > < expr >
< expr > - > < factor > | < expr > < op > < expr >
Let's see what happens when we try to compile a simple program to print the sum and average of two numbers:
read A
read B
sum := A + B
write sum
write sum / 2
There are 15 tokens in this program, which the scanner will pass on to the parser. The parser will discover the structure of the program and build a parse tree.
Because it is such a simple example, there is very little semantic analysis. We can check to make sure no variable is used before it is given a value, and that no variable is given a value that is never used.
For each symbol, we keep track of whether it has (a) no value, (b) an unused value, or (c) a used value in the symbol table. At the end of the program, we scan the whole symbol table to see if anything has an unused value, for example. If so, we print a warning message.
In addition to performing static checks, semantic analysis often simplifies the structure of the parse tree, getting rid of useless nodes, reducing the number of different kinds of nodes (e.g. expr v. term v. factor).
For intermediate code generation, the compiler will again traverse the parse tree, making additional annotations. Annotations for code generation might (in a real compiler) include:
Our intermediate code might end up looking like this:
read
pop A
read
pop B
push A
push B
add
pop sum
push sum
write
push sum
push 2
div
write
There are many other intermediate forms possible.
The target code generator will have to decide on how to use the resources of the target machine. Certain registers may be dedicated to special purposes, for example. The layout of main memory will have to be established.
For simplicity, let's assume:
Our code generator might produce the following in 680x0 essembly:
.data
A: .long 0
B: .long 0
sum: .long 0
.text
main: jsr read
movl d0,d1
movl d1,A
jsr read
movl d0,d1
movl d1,B
movl A,d1
movl B,d2
addl d1,d2
movl d1,sum
movl sum,d1
movl d1,d0
jsr write
movl sum,d1
movl #2,d2
divsl d1,d2
movl d1,d0
jsr write
Compilers without optimizers tend to produce code that is awful. In our case, much of the awfulness comes from doing a literal macro expansion of the stack-based intermediate code.
To do a good job with this language, we would want at least to apply a "peephole" optimizer to the assembly code. Better yet, pick a different intermediate language on which to do our optimizations.