CSCI 431: Formal Languages

Computation

It is common to think of computation in mathematical terms, but clearly mathematical applications don't cover all of the roles computation plays.
We could think of computation as some sort of processing of information .
We'll settle for a definition of computation as carrying out a sequence of well-defined steps (a program) for the purpose of solving a problem. (Of course, this begs the question of what is a problem and what is a program.)

Programs

We can think of a program as a tool for solving problems. There are a lot of terms we could use to describe programs.
We could think of them as a sequence of operations for manipulating symbols or, more simply, as some kind rule (often a very complicated rule).
The bottom line is that programs are something we can write down; they can be represented as a finite sequence of symbols from some fixed (finite) set of legal symbols.
For example, we build C programs with a subset of the ASCII character set and we build Java programs from a somewhat larger character set (described as Unicode).

Problems vs. Programs

There is a very interesting contrast between problems and the programs that we can use to solve them. Programs are finite, while (interesting) problems are infinite. We can use a finite program to handle an infinite number of problem instances.
Another interesting point about programs concerns what we get when we run them. If we're running a program on a particular input, it might process the data or it might run forever. If all programs were guaranteed to always give us an answer (eventually), the world of theoretical computer science would be a very different place.
Programs that always give us an answer (i.e. they never just run forever) have a special name. We call them Algorithms. We also have the term, total functions (as opposed to partial functions). Algorithms compute total functions.

Computational Power

Given this notion of problems and programs, we can make a meaningful stab at saying what computational power is. If my computer (or programming) language lets me write programs that solve problems that yours does not, then mine is more powerful. Or if I can write a program on computer A that solves a particular problem that can't be solved with any program on computer B, then computer A has some computational power that is lacking in computer B.
We trust that all general-purpose programming languages have the same computational power. We'd like to believe that any problem we can solve in Pascal, for example, can also be solved with fortran or lisp or Prolog.

Program manipulation

There's an idea that's very important to the study of computation theory (which is what we are talking about right now). This is the idea that I may want to build programs that operate, not just on any old inputs but on other programs themselves.
You're probably already familiar with programs that operate on other programs. Let's see if we can think of some examples:
- Compiler
- Lint
- Lex / Yacc
Thinking about this can lead to a whole new way of thinking about computation and formal languages. We can now have formal languages that are made up of programs, and we can also have programs that recognize these languages.

Compilers

Lexical Analysis

Languages

What is a language? Simply put, a language is a set of strings- a very naive but essential observation.
There are human languages, like Chinese, English and Chichewa - an African language.
Each language is a set of sentences, each consisting of a string of words or phonemes, that the speakers of the language ever use for human communication.
There are artificial languages, e.g., programming languages like Prolog, C/C++, Java.
Each is a set of permissible statements, and each statement is a string of tokens - keywords of the language or variable names defined by a user.
Notice that a language consists of only the legal or grammatical strings.
This implies that many possible strings are illegal or ungrammatical in a specific language but they may be all right in another one.
How do we know which strings are grammatical in a language and which are not---how do we recognize a language???
To do this we need a device or mechanism that determines the grammar of the language
Before talk formally about a grammar, let's talk about strings

Strings

A string is a finite sequence of some symbols.
The symbols are atomic.
The set of all symbols in a language is known as the language's vocabulary or alphabet.
Every language has a vocabulary.
A string is a sequence of symbols in some order.
- E.g., abc is regarded as different from cba.
A string has an length, which is the number of symbols in the string.
- E.g., the length of abc can be denoted as |abc| = 3.
- There is an empty string e. It is length is 0.
A* (given A is a vocabulary): the set of all possible strings of any length on A (ie, consisting of any number of symbols from A).
A language on A is a subset (any subset) of A*. But, how about a natural language (NL)?
Is it "any subset" of its vocabulary A's A*? - No, of course not.
There are strong pattern or regularity in any useful language and these are defined by the grammar or syntax of the language.

Grammars

A formal grammar can be described, essentially, as a deductive system of axioms and rules of inference that generate the sentences of a language like theorem derivation.
The axiom, or initial symbol is usually denoted as S
Examples of rules are: x -> y, AB -> CDA, S -> NP VP
What do a rule like one of those above mean?
It means the symbol sequence of the left-hand-side (LHS) of a rule can be replaced by the right-hand-side (RHS) to yield new string. e.g., the following derivation: EBABCC ==> EBCDACC.

Formal Definitions

Terminal alphabet, denoted as V_T:
- the set of symbols from which no derivation can yield any other string. E.g., words in a language.
- Your book uses the summation symbol to represent the terminal alphabet.
Non-terminal alphabet, denoted V_N:
- The set of symbols from which some other string can be derived. E.g., syntactic categories in NL, like NP, V, N, etc.
- Your book uses N to represent the non-terminal alphabet.
Formally, a grammar is a quadruple G = { V_T, V_N, S, R}, where:
- V_T and V_N are as previously defined
- S is the start symbol
- R is the set of rules (your book uses P to represent the set of rules).
A notational convention:
- terminals in lowercase letters, e.g, a, b, c, boy, girl, int, main
- non-terminal in uppercase, e.g., A, B, NP, V, P.

Recognition, generation and derivation

Generation: derive (or yield) a new string from an old one, initially from the initial symbol S, step by step through applying a rule in R each step until the string in question is no longer extendible (i.e., all symbols in the string are terminal ones.)
A grammar used for generation can be called a production system
Derivation: the process of generation. E.g.,
S ==> NP VP ==> DET N VP ==> DET N V NP
==> DET N V DET ADJ N
Actually, each rule can be understood as a derivation, e.g., DET N is derived from NP by the rule NP -> DET N
Recognition is the process of determining whether a string is in a language. A string is recognized as belonging to a language if it can be derived from the grammar for that language.

Trees

A tree reveals (or encode) the hierarchical phrase structure of a sentence (or a string), including the following information:
- Hierarchical grouping of smaller constituents (or substrings) into larger ones
- Grammatical type (category or just simply a name or label) of a constituent;
- Some order among constituents, e.g., left-to-right word order.
Basics of trees
- Nodes: root, intermediate, leaves
- The yield of a tree: the sequence of leaves
  - E.g., the word sequence of a NL sentence is the yield of the sentence's phrase structure tree.
- Branches: direction, higher vs. lower nodes
- Label: non-leaf: a symbol from V_N; leaf: a symbol from V_T.
- Relations: Sister(s), mother, child(s)
- Path of a node: the sequence of branches from the node back to the root.

Grammars and Trees

A grammar, whose rules are in the form A -> x, generates a tree;
Each rule is itself a tree, with the LHS as the root and RHSs as the leaves;
A tree for a sentence can be understood as a hierarchical "concatenation" of trees of the rules involves in the generation of the sentence.

Chomsky Hierarchy

It categorizes grammars into types in terms of their generative power. There 4 types of grammars in the Chomsky hierarchy:
- Type 0 - unrestricted rewriting systems.
  - The only "restriction" is that the left-hand side of a rule must contain at least one non-terminal symbol, i.e., the LHS cannot be an empty string (otherwise, it is not a grammar rule in any sense).
- Type 1 - Context sensitive grammars
  - Each rule is of the form XAY-> XBY,
    - X and Yare arbitrary strings (of symbols).
    - the length of the RHS is less than or equal to that of the LHS
- Type 2 - Context free grammars
  - Rule: A ->B
    where B is any string in the V
- Type 3 - Regular grammars
  - Rule: A ->xB or A-> x
    where the first symbol on the RHS must be a terminal and may be followed by a non-terminal
Each language is associated with a particular computing machine, the minimum machine required to recognize or generate strings in that language.
Example grammar:
- G = ({a, b}, {S, A, B}, S, R), where
- R = { S -> e, S -> aB, S -> bA,
- A -> a, A -> aS, A -> bAA,
- B -> b, B -> bS, B -> aBB}
- What type of grammar is this?

Finite Automata

Finite automata are computing machines that can recognize languages generated by a regular grammar, these languages are called regular languages
An automaton is an ideal abstract computing device. It reads an input sequence of symbols from a tape, and halts after all input symbols are read, and then it signifies the acceptance or rejection of the input.
Components of an automaton
- Tape - holding the input discrete symbols;
- Reading head - scanning the input symbols in the tape;
- Control box - there is an internal state in the box, which will be changed to next state when a symbol is read. The next state can be a different state or the same as the current one.
  - There is a designated initial state;
  - There is a set of final state(s).
- The change of state from one to the next in terms of an input symbol is known as an instruction or a transition, represented as:
  (CurrentState, InputSymbol, nextState)
- Examples:
  - (q0, a, q0)
  - (q0, b, q1)
  - (q1, a, q1)
  - (q1, b, q0)
  - If q0 is a final state, an automaton with these rules would recognizes: "a", "abb", "abab", "aabab", etc.
Diagram Representation
- States
- Arcs (= transitions)
- Examples:
  (q0, b, q1): (q0) ---b---> (q1)
Deterministic vs. non-deterministic finite automata
- Deterministic: There is at most one arc leading out from a state with the same symbol. That is, there are no two arcs from one state labeled with an identical symbol in the same finite automaton.
- For any non-deterministic finite automaton, there is an equivalent deterministic one. ie. it recognizes the same language as the non-deterministic one does.

Formal Definition

A deterministic finite automaton M is a quintuple <S, V, R, Start, FS>, where
- S - a finite set of states
- V - a finite set of symbols (ie, the alphabet)
- R - a function from S x V to S (i.e, a set of transitions to map from a <state, symbol> pair to a state)
- Start - an initial state
- FS - a set of final states

Regular Languages

Kleene Theorem: A is a finite automaton language <==> regular language.
The class of languages recognized by a FA is the class of regular languages
- Conversion: (q0, b, q1): q0 ---> b q1
  and then we can see q0 and q1 are non-terminals in the correspondent regular grammar that generate the same language as the original automaton does.
Definition of regular languages (RLs): given an alphabet A
- 1. The empty set if a RL;
- 2. For any string x -in- A*, {x} is a RL;
- 3. If A and B are RLs, so are A+B (union of A and B) and AB;
- 4. If A is a RL, so is A*;
- 5. Nothing else is a RL.
A* - Closure or Kleene star on a set of strings A: the set of strings formed by concatenating members of A together any number of times (including 0) and in any order.
- Example: {a, bb}* = {e,a,bb,aa,abb,bb,bba,bbbb,aaa,aaabb,abba,...}

Regular expressions

Regular languages can also be built from regular expressions
The Definition of a regular expression: given an alphabet A
- 1. The empty set is a regular expression;
- 2. For any symbol x -in- A, x is a regular expression;
- 3. If e is a regular expression, then so is (e);
- 4. If e1 and e2 are regular expressions, then so are e1 e2, e1 | e2, and e1*
Symbols such as (, ), | and * (as well as others) may be used to specify regular expressions
Example: ab*c for {ac, abc, abbc, abbbc, abbbbc, ...}
Next time we will look at the UNIX utility egrep for specifying and using regular expressions.

Examples

Example 1: 0and1Even
Example 2: a(bb)*bc
Example 3: Binary Odd Numbers
Example 4: 00(1|0)*11
Example 5: Even Number of b's
Example 6: At Most Two Consecutive b

Inadequacy of regular grammars for NL

Examples:
- 1. the cat died.
- 2. The cat the dog chased died.
- 3. The cat the dog the rat bit chased died.
- 4. The cat the dog the rat the elephant admired bit chased died.
They are generated by a grammar look like S -> (NP)ⁿVⁿ, which is not a regular grammar.
Next time we will look at other grammars.