The official definition of Java is given in the Java Language and Virtual Machine Specifications.
Phases of compilation
We’ll figure out what Java is by studying the four phases of the typical compiler. (By the way, the designers of the Java compiler used their own non-standard names for these phases.)
- lexical analysis
- syntactic analysis
- semantics analysis
- code generation
Lexical analysis
In lexical analysis a scanner breaks the program into tokens or lexemes or words. Usually this process is done with lex or flex, a public domain implementation of lex. ANTLR is often used by Java programmers for lexical analysis. However, the Java lexical analyzer is written in Java to avoid the bootstrap problem.
The first step in Java
Compilation begins by dividing the program into three types of input elements.
- white space
- comments
- tokens
White space consists of blanks, tabs and the two line separation characters, newline and return.
In java comments are either in traditional C style,
delimited by /*
and */
, or
end-of-line C++ style, started with //
and continuing the remainder of the line.
The interesting stuff for the compiler is the tokens.
Also, it’s really all a bit more complicated than this, because
some tokens, such as the String
literal
"white space"
, actually do
contain internal blanks.
Tokens in Java
Java tokens occur in five varieties.
- identifier
- keyword
- literal
- separator
- operator
Keywords
The fifty keywords of Java are listed in
section 3.9 of the Java language specification.
All of these keywords, except for const
and goto
,
have special predefined meanings in Java programs.
Presumably, const
and goto
were originally
included in the list because they are keywords in C and C++.
Identifiers
Java has a simple definition of identifiers: (1) They are sequences of
characters that start with a letter and continue with letters, and (2)
they are neither a keyword, null
,
true
, or false
.
This means that csci202
and x15
are
identifiers, but 10000bc
and const
are not.
But there is a complication. Java programs are technically
encoded in 16-bit Unicode, and
letters and numbers from foreign languages can be used in Java
identifiers. Even more strangely, the underscore and dollar sign
are considered as letters “for historical reasons”.
This means that $τökéñ
is an
acceptable Java identifier.
However, even though it s legal that doesn t make it right. Java programmers should follow the naming conventions for the language.
Literals
Literals are tokens whose meaning is to be taken literally. There are six types of literals.
- null
- boolean
- integer
- floating point
- character
- string
The only null literal is null
.
The only boolean literals are false
and true
.
There are almost twenty quintilian different integer literals.
Before Java SE 7, integers could be specified in
decimal (starts with 1
to 9
),
octal (starts with 0
),
and hexadecimal (starts with 0x
or 0X
).
Integers could also end with a l
or L
to indicate they should be stored in a 64-bit long
value.
Now integers can be written in binary (starts with 0b
or 0B
) and can contain underscores to make them
easier to read.
Consequently, there are several ways to say one million.
1000000
1_000_000
0xf4240
03641100
0B11110100001001000000
0B1111_0100_0010_0100_0000
Look at Section 3.10.2 of the Java Language Reference to see the unforgiving rules for writing floating point literals. Here are few ways to write the result of dividing 1 by 10.
0.1
1e-1
1.0e-1
0X1.999999999999AP-4
Character literals are written as a single character enclosed by single quotes. Java allows the character to be specified as a Unicode escape or a with a C style escape sequence.
Here are a few character literals.
'a'
'0'
'\t'
'\''
'\0'
'\00'
'α'
'\u03b1'
String literals are zero or more characters inside double quotes. Here’s an final example.
"This is an \"uninteresting\" example."
Separators
There are nine one-character separator tokens. These are listed in Section 3.11.
Operators
The thirty-seven operator tokens are listed in Section 3.12.
Try it out
The following is Java. Divide it into white space, comments, and tokens.
package edu.unca.cs.csci202.jan222012; // A Java program of sorts public class Example1 { public static void /* of use */ main(String[] args) { int x = (int)(3.14 / (5 << 2)) ; for (int i=x; i<202; ++i) { System.out.println("This is line " + i) ; } System.out.println("abcd" == "ab" + "cd") ; System.out.println("abcd".equals("ab".concat("cd"))) ; } }
One more thing
Tokens are as long as possible. <<
is a single token, not the
two tokens <
and <
in sequence.
This is why a- -b
is legal
but a--b
is not.