CSCI 431: Structured Data Types
A Brief Review
Two components required to define a data type:
- Set of objects in the type (domain of values)
- Set of applicable operations
Data Types may be determined:
- Statically (at compile time)
- Dynamically (at run time)
A language's data types may be:
- Built-in (i.e., elementary data types)
- Programmer-defined (e.g., structured data types)
A declaration explicitly associates an identifier with a type
(and thus representation)
Benefits offered by data types:
- Serve as documentation
- Improve safety and correctness
- Specify interfaces for separate compilation
- Specify abstract data types
Arrays
An array is an aggregate of homogeneous data elements in which an
individual element is identified by its position
Design Issues:
- What types are legal for subscripts?
- Are subscript values range checked?
- When are subscript ranges bound?
- When does allocation take place?
- What is the maximum number of subscripts?
- Can array objects be initialized?
- Are any kind of slices allowed?
- Other types of high level operations?
Indexing is a mapping from indices to elements
- Syntax
- FORTRAN, PL/I, Ada use parentheses
- Most others use brackets
What type(s) are allowed for defining array subscripts?
- FORTRAN, C--int only
- Pascal--any ordinal type (int, boolean, char, enum)
- Ada--int or enum (including boolean and char)
- Java--integer types only
Binding-Time Classifications
Four categories, based on subscript binding and storage binding:
- Static
- Fixed stack-based
- Stack-based
- Heap-based
Static Arrays
- Range of subscripts and storage bindings are static
- Examples: FORTRAN 77, global arrays in C++, some arrays in Ada
- Advantage:
- Execution efficiency (no allocation or deallocation)
- Disadvantages:
- Size must be known at compile time
- Bindings are fixed for entire program
Fixed Stack-based Arrays
- Range of subscripts is statically bound, but storage is bound at
function run-time
- Examples: Pascal locals, C/C++ locals that
are not static
- Advantages:
- Space efficiency
- Supports recursion
- Disadvantage:
- Must know size at compile time
Stack-based Arrays
Range and storage are dynamic, but fixed from then on for the
variable's lifetime
Examples: Ada locals in a procedure or block
Advantage:
- Flexibility--size need not be known until the array is about
to be used
Disadvantage:
- Once created, array size is fixed for lifetime
Heap-based Arrays
- Subscript range and storage bindings are dynamic and not fixed
- Examples: FORTRAN 90, APL, Perl
- Advantage:
- Disadvantages:
- More space required
- Run-time overhead
Number of Array Subscripts
- FORTRAN I allowed up to three
- FORTRAN 77 allows up to seven
- C, C++, and Java allow just one, but elements can be arrays
Array Initialization
- Usually just a list of values that are put in the array in the
order in which the array elements are stored in memory
- Examples:
- FORTRAN--uses the DATA statement, or put the values in /
... / on the declaration
- C and C++--put the values in braces; can let the compiler count
them (int stuff [] = {2, 4, 6, 8};)
- Ada--positions for the values can be specified:
SCORE : array (1..14, 1..2) :=
(1 => (24, 10), 2 => (10, 7),
3 =>(12, 30), others => (0, 0));
- Pascal and Modula-2 do not allow array initialization
Array Operations
- APL has lots
- Ada
- Assignment; RHS can be an aggregate, array name, or slice
(LHS can also be a slice)
- Catenation for single-dimensioned arrays
- Equality/inequality operators (= and /=)
- FORTRAN 90
- Intrinsics (subprograms) for a wide variety of array
- operations (e.g., matrix multiplication, vector dot product)
Array Slices
- A slice is some substructure of an array; nothing more than a
referencing mechanism
- Slice Examples:
Implementation of Arrays
- Packed storage - components are placed sequentially, without
regard for placing each component at the beginning of an
addressable word.
- May save substantial storage - but pay in access time.
- Row major (by rows) or column major order (by columns)
- Access function maps subscript expressions to an address in the
array
Rewriting access equation:
L-value(A[I,J]) = alpha - d1*L1 - d2*L2 +I*d1 + J*d2
Set I = 0; J= 0;
L-value(A[0,0]) = alpha - d1*L1 - d2*L2 +0*d1 + 0*d2
L-value(A[0,0]) = alpha - d1*L1 - d2*L2, which is a constant.
Call this constant the virtual origin (VO), and represents the
address of the zeroth element of the array.
L-value(A[I,J]) = VO +I*d1 + J*d2
To access an array element, typically use a dope vector:
SUMMARY:
- Allocate storage beginning at alpha :
(U2-L2+1)*(U1-L1+1)*element_size
- d2 = element_size
- d1 = (U2-L2+1)*d2
- VO = alpha - L1*d1 - L2*d2
- To access A[I,J]: Lvalue(A[I,J]) = VO + I*d1 + J*d2
Associative Arrays
An associative array is an unordered collection of data elements
that are indexed by an equal number of values called
keys
Design Issues:
- What is the form of references to elements?
- Is the size static or dynamic?
Perl Associative Arrays
Character Strings
Design Issues:
- Are strings primitive or composite?
- Should they have fixed or variable length?
Are strings primitive or composite?
- Composite (arrays) in Pascal, Ada, Modula-2, ...
- Primitive in ML
- Lists in Miranda and Prolog: provides more flexibility (no length
bound)
Fixed Length or Variable Length
- Fixed length
char A(10) - C
DCL B CHAR(10) - PL/I
Used in Cobol, Ada, Fortran, Pascal
- Variable length to a declared bound
DCL D CHAR(20) VARYING - PL/I - 0 to 20 characters
F = 'ABCDEFG\0' - C - any size
- Unbounded length
Used in Perl and SNOBOL
operations: relational operators, assignment, length, substr,
concatenation, (many others)
String Implementation (Storage Representation)
Records
A record is a possibly heterogeneous aggregate of data elements in
which the individual elements are identified by names
Design Issues:
- What is the form of references?
- What unit operations are defined?
Record Definition Syntax
- COBOL uses level numbers to show nested records; others use
recursive definitions
Record Field References
Record Operations
- Assignment
- Pascal, Ada, and C++ allow it if the types are identical
- Initialization
Allowed in Ada, using an aggregate
- Comparison
In Ada, equal and not equal
- MOVE CORRESPONDING
In COBOL--it moves all fields in the source record to fields
with the same names in the destination record
Comparing Records and Arrays
- Access to array elements is slower than access to record fields,
because subscripts are dynamic (field names are static)
- Dynamic subscripts could be used with record field access, but it
would disallow type checking and be much slower
Unions
A union is a type whose variables are allowed to store different
type values at different times during execution
implementation of union
- Use of a tag/discriminant field
- Pascal and Modula 2: tag assignments & tags not mandatory
- Ada: mandatory tags & whole record assignments
Design Issues for unions:
- What kind of type checking, if any, must be done?
- Should unions be integrated with records?
Union Examples
Union Type-Checking Problems
- Problem with Pascal's design: type checking is ineffective
- Reasons:
- User can create inconsistent unions (because the tag can be
individually assigned)
var blurb : intreal;
x : real;
blurb.tagg := true; { it is an integer }
blurb.blint := 47; { ok }
blurb.tagg := false { it is a real
x := blurb.blreal; { oops! }
Also, the tag is optional!
Ada Discriminated Unions
- Reasons they are safer than Pascal & Modula-2:
- Tag must be present
- All assignments to the union must include the tag value--tag
cannot be assigned by itself
- It is impossible for the user to create an inconsistent
union
- An Ada example:
type Transport is (Auto, Truck, Bus);
type Vehicle (Kind : Transport := Auto) is
record
EngineSize : Float;
case Kind is
when Auto =>
AirCon : Boolean;
AutoTrans : Boolean;
GasMilage : Float;
when Truck =>
Tons : Float;
Axles : Integer;
when Bus =>
Capacity : Integer;
ModelYr : Integer;
end case;
end record;
Free Unions
Evaluation of Unions
- Useful
- Potentially unsafe in most languages
- Ada, Algol 68 provide safe versions
Sets
- A set is a type whose variables can store unordered collections
of distinct values from some ordinal type
- Design Issue:
- What is the maximum number of elements in the base type of a
set?
Set Examples
- Pascal
- no maximum size in the language definition (not portable,
poor writability if max is too small)
- Operations: union (+), intersection (*), difference (-), =,
<>, superset (>=), subset (<=), in
- Modula-2 and Modula-3
- Additional operations: INCL, EXCL, / (symmetric set
difference (elements in one but not both operands))
- Ada--does not include sets, but defines set membership
operator for all enumeration types and subrange expressions
- Java includes a class for set operations
Evaluation of Sets
- If a language does not have sets, they must be simulated, either
with enumerated types or with arrays
- Arrays are more flexible than sets, but have much slower
operations
Set implementation:
- Usually stored as bit strings and use logical operations for
the set operations
The ultimate in bad manners: Pointers
C.A.R. Hoare (1973): "Their introduction into high-level
languages may be a step backwards from which we may never
recover."
A pointer type is a type in which the range of values consists of
memory addresses and a special value, nil (or null)
Uses:
- Addressing flexibility
- Dynamic storage management
Pointers vs. References
- A reference is a "safe" pointer
- reference to an object
- no pointer arithmetic
- no unsafe casts
- no dangling references
- Typical example of pointers is C
- Typical example of references is Java
- Beware C++ references really are not
Pointer Design Issues
- What is the scope and lifetime of pointer variables?
- What is the lifetime of heap-based variables?
- Are pointers restricted to pointing at a particular type?
- Are pointers used for dynamic storage management, indirect
addressing, or both?
- Should a language support pointer types, reference types, or
both?
Fundamental Pointer Operations
- Assignment of an address to a pointer
- References (explicit versus implicit dereferencing)
Problems with Pointers
- Dangling pointers
- Memory leaks
- Double-deallocation/heap corrupting
- Aliasing
Dangling Pointers
- A pointer points to a heap-based variable that has been
deallocated
- Creating one:
- Allocate a heap-based variable and set a pointer to point
at it
- Set a second pointer to the value of the first pointer
- Deallocate the heap-based variable using the first
pointer
Memory Leaks
- Lost heap-based variables (garbage):
- A heap-based variable that is no longer referenced by any
program pointer
- Creating one:
- Pointer p1 is set to point to a newly created heap-based
variable
- p1 is later set to point to another newly created
heap-based variable
- The process of losing heap-based variables is called memory
leakage
Double Deallocation
- When a heap-based object is explicitly deallocated twice
- Usually corrupts the data structure(s) the run-time system uses
to maintain free memory blocks (the heap)
- A special case of dangling pointers
Aliasing
- When two pointers refer to the same address
- Pointers need not be in the same program unit
- Changes made through one pointer affect the behavior of code
referencing the other pointer
- When unintended, may cause unpredictability, loss of locality of
reasoning
Pointer Examples
- Pascal:
- used for dynamic storage management only
Explicit dereferencing
- Dangling pointers and memory leaks are possible
- Ada:
- a little better than Pascal and Modula-2
- Implicit dereferencing
- All pointers are initialized to null
- Similar dangling pointer and memory leak problems for
typical implementations
- C and C++:
- for both dynamic storage management and addressing
- Explicit dereferencing and address-of operator
- Can do address arithmetic in restricted forms
- Domain type need not be fixed (void * )
- C++ reference types
- Constant pointers that are implicitly dereferenced
- Typically used for parameters
- Advantages of both pass-by-reference and pass-by-value
- FORTRAN 90 Pointers
- Can point to heap and non-heap variables
- Implicit dereferencing
- Special assignment operator for non-dereferenced
references
- Java--only references
- No pointer arithmetic
- Can only point at objects (which are all on the heap)
- No explicit deallocator (garbage collection is used)
- Means there can be no dangling references
- Dereferencing is always implicit
Evaluation of Pointers
- Dangling pointers and dangling objects are problems, as is heap
management
- Pointers are like goto's--they widen the range of cells that can
be accessed by a variable, but also complicate reasoning and
open up new problems
- Pointers services are necessary--so we can't design a language
without them