CSCI 235 — Caching

How to speed up a program

Start with a dot product.

float dotP = 0.0 ;
for (int i=0; i < 1000; ++i) {
  dotP = dotP + X[i] * Y[i] ;
}

Apply the usual tricks.

Register allocation for dotP, i
Strength reduction for &X[i] (eliminate the i*4)
Loop unrolling – dotP = dotP + X[i] * Y[i] + X[i+1] * Y[i+1]

Still, most of the time will be spending retrieving numbers from the arrays.

Locality in the dot product loop

temporal data locality – i and dotP are used every loop
temporal instruction locality – instructions are executed every loop
spatial data locality – A[β] and A[β+1] are referenced in successive loops
spatial instruction locality – instructions are executed in sequence

The memory hierachy

According to the book, access times vary quite a bit. See CMU lecture notes (p 44).

Disk: 3,000,000 nsec
SSD: 80,000 nsec
DRAM (in DIMM): 20 nsec
SRAM (register and cache): 1 nsec

On chip caching

Now let’s look at the design of a typical cache. Start with a quick look at CMU lecture notes on cache memories (pp 3–14) and a longer look at the Intel 486DX homework example.

Buffers — the software cache

Buffering is a technique used to programmers, programming languages, and operating systems to reduce the number of disk access.

Programming languages
- C/C++ stream buffers
- Java BufferedInputStream
- Python buffered streams
Operating systems
- MVS JCL RECFM=FB
- Linux buffer cache
Oracle buffer cache (See Lee Johnson)