Languages are sets of strings of characters (called sentences) from some alphabet. (Set of all syntactically correct programs).
Syntax rules determine which sentences are legal (in the language).
Ambiguous: The man saw the boy on the hill with a telescope.
natural formats
mnemonic operator symbols
embedded comments, unrestricted length identifiers
syntax differences reflect underlying semantic differences
concise and regular structures
unspecified declarations or operations
default typing in Fortran - side effect is undetected typo
Lisp - not easy to read or write, but easy to translate
Cobol- easy to read, but difficult to translate because of noise words and many ways of doing the same thing
Example: if if else
Languages are described in two ways:
John Backus (1959) (modified by Peter Naur) designed BNF (Backus Naur Form) - Same as Chomsky's context-free grammars (Is a generative technique). BNF was invented ca. 1960 and used in the formal description of Algol-60.
Metalanguage: a language used to describe another language. BNF is a metalanguage for programming languages
It is just a particular notation for grammars, in which
BNF / CFG Grammar
Nonterminal | < sentence >, < subject >, < predicate >, < verb >, < article >, < noun > |
Terminal | the, boy, girl, ran, ate, cake |
Start symbol | < sentence > |
Rules | < sentence > ::= < subject > < predicate > |
< subject > ::= < article > < noun > | |
< predicate >::= < verb > < article > < noun > | |
< verb > ::= ran | ate | |
< article > ::= the | |
< noun > ::= boy | girl | cake | |
Operator | Replace any nonterminal by a right hand side value from any rule (==>) |
Sentence | String of terminals derived from < sentence > by application of replacement operator |
Examples: | |
< sentence > ==> | < subject > < predicate > |
==> | < article> < noun > < predicate > |
==> | the < noun> < predicate > |
... ==> | the boy ate the cake |
also ==> | the cake ate the boy |
Syntax does not imply correct semantics |
The definition of a language does not require that it mean anything. Examples:
Two Notations:
< A > ::= < B > | < C > d
A -> B | C d
Derivation trees:
B -> 0B | 1B | 0 | 1
B ==> 0B
==> 01B ==> 010
From derivation get parse tree
But derivation may not be unique:
S -> SS | (S) | ()
S ==> SS
==> (S)S ==> (())S ==> (())()
S ==> SS
==> S() ==> (S)() ==> (())()
and same parse tree
But can get 2 parse trees from same string: ()()(), each corresponds to a unique derivation:
S ==> SS ==> SSS ==> ()SS ==> ()()S ==>
()()()
A grammar is ambiguous if some sentence has 2 distinct parse
trees.
<statement> ::= <unconditional> | <conditional>How do you parse: if exp1 then if exp2 then stat1 else stat2 ?<unconditional> ::= <assignment> | <for loop> |
begin {<statement>} end
<conditional> ::= if <expression> then <statement> |
if <expression> then <statement> else <statement>
Could be
Ambiguous
Pascal rule: else attached to nearest if
To get second form, write:
if exp1 then begin if exp2 then stat1 end else stat2C has similar ambiguity.
MODULA-2 and ALGOL 68 require "end" to terminate conditional:
Ambiguity in general is undecidable
YACC - can build a syntax analyzer quickly from the context free grammar.
| - Choice
( ) - Grouping
{ } * - Repetition - 0 or more
[ ... ] - Optional
Example: Identifier - a letter followed by 0 or more letters or
digits:
I -> L { L | D }* | I -> L | L M |
L -> a | b | ... | M -> C M | C |
D -> 0 | 1 | ... | C -> L | D |
L -> a | b | ... | |
D -> 0 | 1 | ... |
Translators are grouped by number of passes required. Often two passes (decompose, generates object code).
[ Alphabet is the Union of the first three sets ]
letter := { 'A',...,'Z' } digit := { '0', '1', ...,'9' } symbols:= { ':=', '+', '-', '*', '/', '(', ')', $$$ [means end of file] }
id := letter { letter | digit } * [ except "read" and "write" ] literal:= digit { digit } *
1. < pgm > - > < statement list > $$$ 2. < stmt list > - > < stmt list > < stmt > | epsilon 3. < stmt > - > id := < expr > | read < id > | write < expr > 4. < expr > - > < term > | < expr > < add op > < term > 5. < term > - > < factor > | < term > < mult op > < factor > 6. < factor > - > ( < expr > ) | id | literal 7. < add op > - > + | - 8. < mult op > - > * | /
< expr > - > < factor > | < expr > < op > < factor >
< expr > - > < term > | < term > < add op > < expr >
< expr > - > < factor > | < expr > < op > < expr >
Let's see what happens when we try to compile a simple program to print the sum and average of two numbers:
read A
read B
sum := A + B
write sum
write sum / 2
There are 15 tokens in this program, which the scanner will pass on to the parser. The parser will discover the structure of the program and build a parse tree.
Because it is such a simple example, there is very little semantic analysis. We can check to make sure no variable is used before it is given a value, and that no variable is given a value that is never used.
For each symbol, we keep track of whether it has (a) no value, (b) an unused value, or (c) a used value in the symbol table. At the end of the program, we scan the whole symbol table to see if anything has an unused value, for example. If so, we print a warning message.
In addition to performing static checks, semantic analysis often simplifies the structure of the parse tree, getting rid of useless nodes, reducing the number of different kinds of nodes (e.g. expr v. term v. factor).
For intermediate code generation, the compiler will again traverse the parse tree, making additional annotations. Annotations for code generation might (in a real compiler) include:
The compiler then tranverses the tree, via a recursive procedure starting with the leaves of the tree, and translates it into some intermediate form. Our intermediate code might end up looking like this:
read
pop A
read
pop B
push A
push B
add
pop sum
push sum
write
push sum
push 2
div
write
There are many other intermediate forms possible.
The target code generator will have to decide on how to use the resources of the target machine. Certain registers may be dedicated to special purposes, for example. The layout of main memory will have to be established.
For simplicity, let's assume:
Our code generator might produce the following in 68000 assembly code:
.data
A: .long 0
B: .long 0
sum: .long 0
.text
main: jsr read
movl d0,d1
movl d1,A
jsr read
movl d0,d1
movl d1,B
movl A,d1
movl B,d2
addl d1,d2
movl d1,sum
movl sum,d1
movl d1,d0
jsr write
movl sum,d1
movl #2,d2
divsl d1,d2
movl d1,d0
jsr write
Compilers without optimizers tend to produce code that is awful. In our case, much of the awfulness comes from doing a literal macro expansion of the stack-based intermediate code.
To do a good job with this language, we would want at least to apply a "peephole" optimizer to the assembly code. Better yet, pick a different intermediate language on which to do our optimizations.