CSCI 431 Lecture Notes - Data Types

How do we describe a programming language?

Syntax:
The syntax of a language must be structured enough that it can be automatically recognized by an implementation. This requires describing:
- Lexical Level:
  - literals
  - keywords
  - identifiers
  - punctuation
- Grammar Level:
  - combinations of lexical elements
  - This is done with syntax diagrams in the form of trees
Semantics
- Semantic descriptions which you have seen have probably been verbal. The idea of semantics in programming is "what is the effect of 'executing' a particular program statement?" Assertions and pre and postconditions are an attempt to reflect semantic information in the text of a program.

How do we implement a programming language?

A language implementation provides a couple of very standard capabilities:

syntax checking:
- A program is input (usually from a source file) and its form is checked against the syntax of the language.
code generation:
- If the syntax of the program is in compliance with the syntax of the language, then the text of the program is translated into the target language (usually machine code or assembly language) and the resulting translated form is saved in a file (the object file).

A language implementation can take two forms:

Compiler:
- A source program is translated to machine code (or assembler language) and this code can be executed directly on the host machine.
Interpreter:
- A source program is translated to a new language which is executed by another program (the interpreter). Interpreters are often interactive, with the programming activity carried out in an environment which is more flexible.

The structure of a translator.

The translation process of a compiler or interpreter can be seen in several intertwined steps.
- Tokenizer:
  - The tokenizer is usually implemented as a function. When called it reads the input stream and stops when it has read the next lexical element - i.e., token. The tokenizer is called as needed by the parser.
- Parser:
  - The parser looks at tokens as needed and compares their order to that required by the language (its actually easier than this sounds). If it runs into an unexpected token, it reports an error and continues on (in some error recovery mode) or quits.
  - The parser has two other required activities which are carried out in anticipation of error free parsing:
    - Symbol Table:
      - Every time the parser sees a token which is an identifier, it makes an entry in the Symbol Table . Depending on the language syntax, making the entry is only done if the identifier is not already there.
    - Code Generation:
      - As executable program structures are encountered, appropriate code is generated, either into a table for further processing, or into an object file.
- This process is a simplistic view of a real translator. Often this process takes several passes through the source file. One pass may be used to check tokens and create a symbol table. A second pass can involve filling in the symbol table with appropriate attributes taken from the source code. A third pass will generate code and the final pass do some optimization on that generated code.

Identifiers and Binding

The most interesting tokens are identifiers.
Identifiers are abstractions representing language entities or constructs
An identifier is the name of a
- variable,
- constant,
- function,
- type,
- parameter,
- or other programmer defined constructs.

Names are for human readers of programs, good names help understanding, bad names hinder it.

  /* Can you guess what this function is doing? */
  int factorial(int x, int sum) {
    return x + sum * previous;
    }

  /* How about with these Names? */
  int next_height(int height, int velocity) {
    return height + velocity * ELAPSED_TIME;
    }

Every identifier must somehow be defined - either explicitly or implicitly. (In Fortran if a variable name begins with an `I' it is an integer variable)

Name Spaces

Flat Name Space - in early computer languages such as COBOL, there is only one name space in a program (so all names would be global local static variable in C).
Scope - portion of code in which the name is meaningful
- scope rules govern to which name space a name belongs
Lifetime - the variable lifetime is the time between birth and death
Note: The lifetime and scope are not always the same (consider a local static variable in C)
Example: C has a block-structured name space

   +---------------------+----global name space
   |int x;               |
   |                     |
   |main (...) {         |
   |  +----------------+------main name space
   |  |int x;          | |
   |  |  {             | |
   |  |  +-----------+--------block inside main name space
   |  |  |int x;     | | |
   |  |  |           | | |
   |  |  +-----------+ | |
   |  |  }             | |
   |  |  ...           | |
   |  +----------------+ |
   |  }                  |
   |                     |
   |foo  (...) {         |
   |  +----------------+------foo name space
   |  |int x;          | |
   |  |  ...           | |
   |  +----------------+ |
   |  }                  |
   +---------------------+

Scope of Declarations and Blocks

Typically, the scope of a declaration is from the end of the declaration to the end of the enclosing name space or block.
A block can be loosely defined as a program phrase that delimits the scope of declarations that it might contain.
For example, Pascal has blocks which are structured via begin-end pairs, and C/C++ blocks are delimited by { } pairs. The bindings produced inside the block override or shadow the bindings of the outside environment.

   {--- Pascal example}
   procedure cat (...);

   var dog : Integer;  {--- override any external binding of "dog"}

   begin  {--- start of the block}       
      ...
   end;   {--- end of the block, "dog" returns to external binding (if any)}

Free vs. Bound variables

A variable in a program block can be thought of as free if it is not bound within the block (e.g., an access to a global variable from within a function is an access to a free variable). Consider the following C program fragment.

   int x = 3;

   foo(int y) {

     x = y;             /* x is free, y is bound */
   }

It is important to identify free vs. bound variables since free variables can have different bindings/meanings each time the block is evaluated (the binding depends on something external to the function).

More about Bindings

Every identifier has certain attributes associated with it.
- type (i.e., simple variable, array name, function name, formal parameter, etc.)
- type of value (i.e., integer, float, etc)
- memory location
- value
- component---binding of a data object to other data objects
- referencing environment
- potentially other information as well
Binding is the process of associating attributes with an identifier.
Binding time is the time in the lifetime of an identifier when a particular attribute is bound to it.
Some attributes are bound at translation time (e.g., type in Pascal), others are bound at execution time (e.g., value and address).
Commands and expressions cannot be interpreted in isolation, their interpretation depends on how the identifiers are bound. E.g. expressions like "N+1" and commands like "int N=1" are valid only if N is declared as an integer (or maybe a float).
It is also possible to further complicate matters given that the same identifier can be bound to different things at different points in the program, for example:
```
  int X;

  void foo (int X)
  {
    ...
  }

  int main() {
    int X;

    foo(X);
    ...
  }
```
Here the interpretation of X depends on the current bindings of X. X can refer to either a global variable, a variable local to main or a parameter of the function foo.

Environments

An environment is a set of bindings. Each expression and command is interpreted in a particular environment, and all identifiers occurring in the expression or command must have bindings in that environment.

An environment may be thought of as a mapping from identifiers to the entities that they denote. For example:

  int X;
 
  void foo (int X)
  {
(3)
    ...
  }
 
  int main() {
(1)
    int X;
(2) 
    foo(X);
    ...
  }

The environment at point (1) is:

  X -> a global integer variable
  foo -> a function abstraction

At point (2):

  X -> an integer variable local to main
  foo -> a function abstraction

At point (3):

  X -> a parameter of the function foo
  foo -> a function abstraction

Static and Dynamic Binding

There are two obvious interpretations of bindings, called static bindings and dynamic bindings. Static binding (or static scoping) involves the function body being evaluated in the environment of the function definition also referred to as the lexically enclosing environment. This is what we have seen in the examples so far. Dynamic binding (or dynamic scoping) is slightly different in that the function body is evaluated in the environment of the function call.

For example, consider the following Pascal code:

  const s = 2;
  const h = 10;

  function scaled (d : Integer);
    begin
      scaled := d * s;
    end;

  function call_scaled(blah : Integer);
    const s = 3;
    begin
      ... scaled(h);
    end;

  begin {---main}
    ...
  end.

If the above program uses static binding, s is evaluated at the point of the function definition (i.e. in the function scaled) and the value of scaled(h) is 20. If dynamic binding is used, s is evaluated at the point of the function call (i.e. in the function call_scaled) and the value of scaled(h) is 30.

With static binding, we can determine the bindings of identifiers at compile time, but with dynamic binding we must delay this until run-time. As such, most programming languages use static binding.

The Run-Time Stack

storage for objects with nested lifetimes (such as function parameters and local variable)
very useful in recursion
central concept is the stack frame (also called activation record)
- visual depiction of a frame:
- parameters: function parameters
- return address: where to begin execution when function terminates
- dynamic link: pointer to caller's stack frame
- static link: pointer to lexical parent (for nested functions)
- return value: where to put the return value
- local variables: the functions local variables
- also a local work space which is for temporary storage of results
function is called - push stack frame
function is exited - pop stack frame
visual depiction of stack calls:

an example:

  int x;                  /* storage for global variables */

  void main() {
    int y;                /* stack storage for local variables */
    char *str;            /* stack storage for local variables */

    str = malloc(100);    /* allocates 100 bytes of dynamic heap storage */

    y = foo(23);
    free(str);            /* deallocates 100 bytes of dynamic heap storage */
  }                       /* y and str deallocated and stack frame popped */

  int foo(int z) {        /* z is allocated stack storage */
    char ch[100];         /* ch is allocated stack storage */

    if (z == 23)
      foo(7);

    return 3;             /* z and ch are deallocated as stack frame is
                             popped, 3 is put on top of the stack */
  }

at the start of the program:

after the first call to foo:

after the second call to foo:

An Aside on Cleanup

storage objects in stack storage die (are deallocated) when the stack frame is popped.

storage objects in heap storage can be explicitly killed (as in C), but in other languages are implicitly killed when they are no longer referenced. For example:

  /* C example */
  char *str;
  str = malloc(100);  /* allocate a 100 byte storage object in heap storage
                         and put the pointer to it into str */
  free(str);          /* explicitly release the allocated storage object */

it is also possible to end up with pointers to dead objects. These pointers are called dangling references. For example:

  /* C example */
  char *str;
  str = malloc(100);  /* allocate a 100 byte storage object in heap storage
                         and put the pointer to it into str */
  free(str);          /* explicitly release the allocated storage object */
  *str = 'h';         /* str is now a dangling reference!  This instruction
                         attempts to put the character 'h' into what is
                         pointed at by str, but str points to a storage
                         that is no longer allocated to the program */

Data Types

The store or memory is a collection of data values at a particular moment in the execution of the program.
It is a collection of bits and can be thought of as a series of 1's and 0's
Languages provide the user with certain basic data types
- integer
- real
- character
- boolean
- pointer
The specification of a data type consists of:
- the attributes
- the values
- the operations
Storage in a virtual computer can be organized into composite data types (also known as data objects, or structured data types), variables, and constants

Composite Data Types

A value of a composite type consists of components that may be inspected selectively, and these components in turn may have components that may be inspected selectively.

Features:

occupy more than one cell

  # Perl
  (3, 4, 5)  # creates a list with three elements
             # one element in each

computer hardware/software is geared towards efficient manipulation of single memory locations (or cells) so typically can selectively update single members of composite values
```
  /* C example */
  int int_array[] = (1,2,3,4,5);  /* int_array has 5 items - 1,2,3,4,5 */
  int_array[2] = 6;               /* int_array now contains  1,2,6,4,5 */
```

can composite variables grow (and occupy more cells) during run time? Yes, it is often very useful, for example in cases where the maximum array size is not known. We now distinguish between three different types of arrays:

static array - index is set and fixed at compile-time

  /* Ada example */
  type Real_array is array [1..10] of Real;

  var a : Real_array;

  a[1] := 3;
  a[11] := 4;  /* error - out of bounds! */

dynamic array - index set and fixed on creation of the array variable, this can be done in languages such as PostScript and Ada

  /* Ada example */

  procedure bar(m : integer) {
    type Real_Array is array (Integer range <1..10>) of Real;

    a : Real_Array (2..m);  /* uses the current value of m */

    a[4] := 3;	/* maybe out of bounds! */
    a[11] := 4; /* error - out of bounds! */

The advantage of this is that the array size can be set ``on the fly'', but it is still subject to a maximum size.

flexible array - the index is not fixed at all!

  // the mighty Java
  Vector any_array = new Vector();  // initially size 10 (by default)

  any_array.addElement(1);
  any_array.addElement(2);
  ...
  any_array.addElement(10);
  any_array.addElement(11);  // out of bounds - Java doubles the size of
                             // the array (doubles by default)

Perl also has flexible arrays.

Implementation Issues

The attributes for vectors and arrays are:
- upper and lower bounds
- data type of each component (it's size)
- the address of the virtual origin (element [0], or [0,0] for a 2-dimensional array)
These attributes may be stored in a descriptor called a dope vector
For 1-dimensional arrays the elements are accessed using the formula:
```
A[i] = V.O. + i x E.S.
```
Multi-dimensional array are stored in either row-major (the most common) or column-major order
- In row-major order the matrix is considered a vector in which each element is a sub-vector representing one row in the original matrix.
- In column-major order the matrix is considered a vector in which each element is a sub-vector representing one column in the original matrix.
- The access formula follows directly for the storage configuration

Type Checking

To ensure that nonsensical operations are prevented (e.g. don't allow multiplication of a character by a boolean; both sides of a logical operation must be a boolean etc.)

Type errors are a common kind of programmer error, especially when pointers are involved.

  /* C example */
  void swap(int *x, int *y) {
    ...
  }
  ...
  int a,b;
  swap(a, b);  /* type mismatch error! */

When are types checked?

compile-time: also known as ``static typing''. Every variable and parameter has a fixed type that is chosen by the programmer, and these can only take on values of the given type. Alleviates run-time overhead.
run-time: also known as ``dynamic typing''. Variables and parameters have no designated type. This allows much greater flexibility, but type checking must be performed at run-time, making execution slower.
A language is strongly typed if type checking rules are enforced at both compile and run time. In a strongly typed language each object belongs to exactly one type and all type conversions are explicit.

In languages without declarations, the variables are sometimes said to be typeless, since they have no fixed type.

  # Perl example

  $x = "Hello World\n";		# x has a value of type string
  $x = 3;				# x has a value of type integer
  $z = "5";				# z has a value of type string

  $x = $z + $x;			# convert z into an integer and add

How are types checked - What is type equivalence?

There are two categories of type equivalence:

Name Equivalence: The two variables are of exactly the same name type.

  typedef int new_int;

  int z;
  new_int y;

  z = y;  /* not name equivalent, so fails type check! */

Structural Equivalence: The two variables are of the same type, if their types are represented by the same set. That is, check to make sure the variables have the same structure, rather than type name.

  typedef int new_int;

  typedef struct {
    int value;
  } simple_type;

  typedef struct {
    int value;
  } another_type;

  typedef struct {
    int value;
    char kind;
  } a_third_type;

  int z;
  new_int x;
  simple_type var_one;
  another_type var_two;
  a_third_type var_three;

  z=x;	/* OK by structure, not by name */
  var_one = var_two;	/* OK by structure, not by name */
  var_one = var_three;  /* not the same structure */

An example similar to your homework

The program:


PROGRAM MAIN(INPUT, OUTPUT) ;

  PROCEDURE P(V: INTEGER) ;
  VAR
     X: INTEGER ;

    PROCEDURE Q(V: INTEGER) ;
    BEGIN
      X := V ;
      WRITELN("In Q -- X = ", X:2, ", V = ", V:2) ;
    END ;

    PROCEDURE R(V: INTEGER) ;
      VAR
        X: INTEGER ;

      PROCEDURE S(V: INTEGER) ;
      BEGIN
         X := V ;
         Q(V+1) ;
         WRITELN("In S -- X = ", X:2, ", V = ", V:2) ;
      END ;
  
    BEGIN
      X := V ;
      S(V+1) ;
      WRITELN("In R -- X = ", X:2, ", V = ", V:2) ;
    END ;

  BEGIN
    X := V ;
    R(V+1) ;
    WRITELN("In P -- X = ", X:2, ", V = ", V:2);
    IF (V < 45000) THEN
       P(48048) ;
  END ;

BEGIN
   P(43680) ;
END .

The output of this Program:


In Q -- X = 43683, V = 43683
In S -- X = 43682, V = 43682
In R -- X = 43682, V = 43681
In P -- X = 43683, V = 43680
In Q -- X = 48051, V = 48051
In S -- X = 48050, V = 48050
In R -- X = 48050, V = 48049
In P -- X = 48051, V = 48048