What is legal Java? — Tokens

The official definition of Java is given in the Java Language and Virtual Machine Specifications.

Phases of compilation

We’ll figure out what Java is by studying the four phases of the typical compiler. (By the way, the designers of the Java compiler used their own non-standard names for these phases.)

  1. lexical analysis
  2. syntactic analysis
  3. semantics analysis
  4. code generation

Lexical analysis

In lexical analysis a scanner breaks the program into tokens or lexemes or words. Usually this process is done with lex or flex, a public domain implementation of lex. ANTLR is often used by Java programmers for lexical analysis. However, the Java lexical analyzer is written in Java to avoid the bootstrap problem.

The first step in Java

Compilation begins by dividing the program into three types of input elements.

White space consists of blanks, tabs and the two line separation characters, newline and return.

In java comments are either in traditional C style, delimited by /* and */, or end-of-line C++ style, started with // and continuing the remainder of the line.

The interesting stuff for the compiler is the tokens.

Also, it’s really all a bit more complicated than this, because some tokens, such as the String literal "white space", actually do contain internal blanks.

Tokens in Java

Java tokens occur in five varieties.


The fifty keywords of Java are listed in section 3.9 of the Java language specification. All of these keywords, except for const and goto, have special predefined meanings in Java programs. Presumably, const and goto were originally included in the list because they are keywords in C and C++.


Java has a simple definition of identifiers: (1) They are sequences of characters that start with a letter and continue with letters, and (2) they are neither a keyword, null, true, or false. This means that csci202 and x15 are identifiers, but 10000bc and const are not.

But there is a complication. Java programs are technically encoded in 16-bit Unicode, and letters and numbers from foreign languages can be used in Java identifiers. Even more strangely, the underscore and dollar sign are considered as letters “for historical reasons”. This means that $τökéñ is an acceptable Java identifier.

However, even though it s legal that doesn t make it right. Java programmers should follow the naming conventions for the language.


Literals are tokens whose meaning is to be taken literally. There are six types of literals.

The only null literal is null.

The only boolean literals are false and true.

There are almost twenty quintilian different integer literals. Before Java SE 7, integers could be specified in decimal (starts with 1 to 9), octal (starts with 0), and hexadecimal (starts with 0x or 0X). Integers could also end with a l or L to indicate they should be stored in a 64-bit long value.

Now integers can be written in binary (starts with 0b or 0B) and can contain underscores to make them easier to read. Consequently, there are several ways to say one million.

Look at Section 3.10.2 of the Java Language Reference to see the unforgiving rules for writing floating point literals. Here are few ways to write the result of dividing 1 by 10.

Character literals are written as a single character enclosed by single quotes. Java allows the character to be specified as a Unicode escape or a with a C style escape sequence.

Here are a few character literals.

String literals are zero or more characters inside double quotes. Here’s an final example.


There are nine one-character separator tokens. These are listed in Section 3.11.


The thirty-seven operator tokens are listed in Section 3.12.

Try it out

The following is Java. Divide it into white space, comments, and tokens.

package edu.unca.cs.csci202.jan222012;

  // A Java program of sorts

public class Example1 {

    public static void /* of use */ main(String[] args) {
        int x = (int)(3.14 / (5 << 2)) ;
        for (int i=x; i<202; ++i) {
            System.out.println("This is line " + i) ;
        System.out.println("abcd" == "ab" + "cd") ;
        System.out.println("abcd".equals("ab".concat("cd"))) ;

One more thing

Tokens are as long as possible. << is a single token, not the two tokens < and < in sequence. This is why a- -b is legal but a--b is not.