How to speed up a program
Start with a dot product.
float dotP = 0.0 ; for (int i=0; i < 1000; ++i) { dotP = dotP + X[i] * Y[i] ; }
Apply the usual tricks.
- Register allocation for
dotP
,i
- Strength reduction for
&X[i]
(eliminate thei*4
) - Loop unrolling –
dotP = dotP + X[i] * Y[i] + X[i+1] * Y[i+1]
Still, most of the time will be spending retrieving numbers from the arrays.
Locality in the dot product loop
- temporal data locality –
i
anddotP
are used every loop - temporal instruction locality – instructions are executed every loop
- spatial data locality –
A[β]
andA[β+1]
are referenced in successive loops - spatial instruction locality – instructions are executed in sequence
The memory hierachy
According to the book, access times vary quite a bit. See CMU lecture notes (p 44).
- Disk: 3,000,000 nsec
- SSD: 80,000 nsec
- DRAM (in DIMM): 20 nsec
- SRAM (register and cache): 1 nsec
On chip caching
Now let’s look at the design of a typical cache. Start with a quick look at CMU lecture notes on cache memories (pp 3–14) and a longer look at the Intel 486DX homework example.
Buffers — the software cache
Buffering is a technique used to programmers, programming languages, and operating systems to reduce the number of disk access.
- Programming languages
- C/C++ stream buffers
- Java
BufferedInputStream
- Python buffered streams
- Operating systems
- MVS JCL
RECFM=FB
- Linux buffer cache
- MVS JCL
- Oracle buffer cache (See Lee Johnson)