CSCI 431 Chapter 3 --- Language Translation Issues

Language Specification

  • Syntax (Easy)
  • Semantics (Hard)

    Programming Language Syntax

    Ambiguous: there are flying planes.

    Languages are sets of strings of characters (called sentences) from some alphabet. (Set of all syntactically correct programs).

    Syntax rules determine which sentences are legal (in the language).

    Syntax Criteria

    1. readability: self-documenting - understandable without separate documentation

      natural formats

      mnemonic operator symbols

      embedded comments, unrestricted length identifiers

      syntax differences reflect underlying semantic differences

    2. writability:

      concise and regular structures

      unspecified declarations or operations

      default typing in Fortran - side effect is undetected typo

    3. ease of verifiability: mathematically proven
    4. ease of translation: regularity of structure

      Lisp - not easy to read or write, but easy to translate

      Cobol- easy to read, but difficult to translate because of noise words and many ways of doing the same thing

    5. lack of ambiguity

      Example: if if else

    Languages are described in two ways:

    (Formal) Grammars

    Chomsky Hierarchy:

    -
    Regular grammars (level 3)
    -
    Context-free grammars (level 2)
    -
    Context-sensitive grammars (level 1)
    -
    Phrase structures (level 0)

    BNF (Backus-Naur Form)

    John Backus (1959) (modified by Peter Naur) designed BNF (Backus Naur Form) - Same as Chomsky's context-free grammars (Is a generative technique). BNF was invented ca. 1960 and used in the formal description of Algol-60.

    Metalanguage: a language used to describe another language. BNF is a metalanguage for programming languages

    It is just a particular notation for grammars, in which

    BNF / CFG Grammar
     
    Nonterminal  < sentence >, < subject >, < predicate >, < verb >, < article >, < noun >
    Terminal  the, boy, girl, ran, ate, cake
    Start symbol  < sentence >
    Rules  < sentence > ::= < subject > < predicate > 
    < subject > ::= < article > < noun > 
    < predicate >::= < verb > < article > < noun > 
    < verb > ::= ran | ate 
    < article > ::= the 
    < noun > ::= boy | girl | cake
    Operator  Replace any nonterminal by a right hand side value from any rule (==>)
    Sentence  String of terminals derived from < sentence > by application of replacement operator
    Examples: 
    < sentence > ==>  < subject > < predicate >
    ==>  < article> < noun > < predicate >
    ==>  the < noun> < predicate >
    ... ==>  the boy ate the cake
    also ==>  the cake ate the boy
    Syntax does not imply correct semantics


    Two Notations:

    < A > ::= < B > | < C > d
    A -> B | C d

    Derivation trees:

    B -> 0B | 1B | 0 | 1
    B ==> 0B ==> 01B ==> 010

    From derivation get parse tree
     
     





    But derivation may not be unique:

    S -> SS | (S) | ()
    S ==> SS ==> (S)S ==> (())S ==> (())()
    S ==> SS ==> S() ==> (S)() ==> (())()

    and same parse tree
     
     





    But can get 2 parse trees from same string: ()()(), each corresponds to a unique derivation:

    S ==> SS ==> SSS ==> ()SS ==> ()()S ==> ()()()
     
     

    A grammar is ambiguous if some sentence has 2 distinct parse trees.
     

    The definition of a language does not require that it mean anything. Examples:

    1. set of all C assignments
    2. set of all C programs
    3. set of all identifiers
    4. set composed of even number of a's followed by a b

    Functions of BNF

    YACC - can build a syntax analyzer quickly from the context free grammar.


    Extended BNF (EBNF)

    EBNF is (any) extension of BNF that provides a a shorthand notation for BNF rules. It adds no power to the syntax, only a shorthand way to write productions. It usually includes these features:

        | - Choice

        ( ) - Grouping

        { } * - Repetition - 0 or more

        [ ... ] - Optional
     

    Example: Identifier - a letter followed by 0 or more letters or digits:
     
    I -> L { L | D }*  I -> L | L M
    L -> a | b | ...  M -> C M | C
    D -> 0 | 1 | ...  C -> L | D
    L -> a | b | ...
    D -> 0 | 1 | ...


    Stages in translation

    1. lexical analysis (tex2html_wrap_inline162 tokens) Also called scanning
      • Scanning divides the program into "tokens", which are the smallest meaningful units.
      • You can design a parser to take characters instead of tokens as input, but it isn't pretty.
      • The scanner may assign type tages to tokens, e.g., keywords, operators, identifers.
      • The scanner also typically:
        • removes comments
        • produces a listing if desired
        • saves text of strings, identifiers, numbers
        • evaluates numeric constants (maybe)
        • tags tokens with line numbers, for good diagnostics in later phases
    2. syntactic analysis (tex2html_wrap_inline162 parse tree)
      Also called parsing
      • Parsing discovers the "context free" structure of the program. There are many different kinds of parsers. We will look at a Recursive Descent Parser next class if time permits.
    3. semantic analysis
      • The compiler actually does what is called STATIC semantic anaylsis. That's the meaning that can be figured out at compile time.
      • Static semantics includes things like
        • insertion of implicit information
        • error detection: recognize and produce informative messages
        • making sure identifiers are declared before use
        • type checking for assignments and operators
        • checking types and numbers of parameters to subroutines
        • making sure functions contain return statements
        • making sure there are no repeats among switch statement labels
    4. Intermediate Code Generation
      • After semantic analysis (if the program passes all checks), the compiler generates code in some intermediate form (IF).
      • Intermediate forms are often chosen for machine independence, ease of optimization, or compactness (these are somewhat contradictory). They often resemble machine code for some imaginary idealized machine; e.g. a stack machine, or a machine with arbitrarily many registers.
      • Many compilers actually move the code through more than one IF.
    5. Optimization
      • Optimization takes an intermediate-code program and produces another one that does the same thing faster, or in less space.
      • The optimization phase is optional.
    6. Code Generation
      • The code generation phase produces assembly language or (sometime) relocatable machine language.
      • Occasionally (especially in pedagogical environments, or scripting languages) load-and-go absolute machine language.
    7. Machine-Specific Optimization
      • Certain machine-specific optimizations (use of special instructions or addressing modes, etc.) may be performed during or after target code generation.
      • For example, register allocation and instruction scheduling happen after target code generation.
    8. Symbol table
      1. All phases rely on a symbol table that keeps track of all the identifiers in the program and what the compiler knows about them.
      2. entry for each identifier
        • type of identifier: array, variable, subprogram anme, parameter
        • type of values: integer, real
      3. This symbol table may be retained (in some form) for use by a debugger, even after compilation has completed.

    Translators are grouped by number of passes required. Often two passes (decompose, generates object code).

    An Example: ``Desk Calculator Language''

    Definintion of the Language