Computer memory has always been slower than the CPU to which it is connected. This has probably been true of every stored program machine that has ever been made, going back 60 years. As CPU speeds have increased exponentially (following Moore's Law), memory speed has not kept up, and so the discrepancy between memory and CPU has gotten much greater. To prevent CPU speed from being wasted by waiting for memory, modern machines use a heirarchy of memory levels, each larger and slower than the lower levels. The PCs that we use have five levels:
CPU registers: A few hundred bytes, effectively zero access time.
L1 data cache: 8Kb, very low access time
L2 data cache: 256Kb - 2Mb depending on model, low access time
main memory (DRAM): 512Mb - 2Gb, medium access time
virtual memory: on hard drive, very large and very long access time
Program and data resides in virtual memory and is pulled into lower levels as needed, displacing data that has been used least recently (this is somewhat oversimplified). Data must be pulled into the lowest level, CPU registers, for any operation to be performed. Data in any level has copies in all higher levels, and the copies are kept consistent by the hardware.
The 4x3 counting program and data fit completely in main memory, so that no virtual memory access is needed as the program runs. If it didn't fit in main memory, the program would be too slow to be useful. The scalar quatities, such as the 64-bit sum and various locals on the stack, fit in the L1 data cache and can be accessed quickly. In the inner loop the sum and loop counter fit in the CPU registers as can be seen in the previous post. The 346*2*2 = 1384-byte offset tables fit in L1 cache, and since they are used just after being written they won't be displaced by other data. The 5775*5775*2*2 = 133Mb static tables fit in main memory but not in any lower level. Data is pulled into L2, L1, and the CPU registers as needed, and then displaced as other portions of the tables are needed.
For efficiency reasons, optimized over all kinds of computation, data is pulled into lower levels in chunks whose size depends on the level. For example, data is pulled from main memory into L2 cache in 64-byte chunks called
cache lines. Thus fetching a V2 (2 bytes) table entry that is not already in L2 causes 64 bytes to be pulled from main memory into the L2 cache. The hope in doing that is that neighboring bytes will also be needed.
This is true for most algorithms, but not for the 4x3 counting. For one complete execution of the inner loop (346 iterations), 692 bytes are needed from one 11550-byte row in topCounts and 692 bytes are needed from one 11550-byte row in botCounts. Due to the 64-byte cache line and the fact that the table offsets are somewhat randomly distributed in the 0 - 5774 range, most of the 11550 bytes are read from main memory for each static table during one complete execution of the inner loop. To make matters far worse, each inner loop execution needs data from randomly distributed rows in the tables (as determined by topIndex and botIndex). These data are rarely in L2 cache, so that almost every execution of the inner loop reads nearly 23100 bytes from the relatively slow main memory. This completely wastes all of the extraordinary methods that the P4 uses to keep busy and make the loop run fast.
The key to speeding up 4x3 counting is to reorder the memory access pattern so that each table row is accessed as many times as is needed by the counting before moving on to the next row. For a typical base pair, the inner loop executes 2-3 million times, and since there are only 5775 table rows each will on average be needed 350 - 500 times. The counting program is rewritten as follows:
- Code: Select all
// static tables for selected base pair
V2 topCounts[5775][5775];
V2 botCounts[5775][5775];
V2 topOffsets[173][346][346];
V2 botOffsets[173][346][346];
struct HitData
{
V2 topIndex;
V2 botIndex;
V2 offset0, offset1;
};
HitData hitArray[3000000]; // this is actually allocated on the heap
// with size as needed
// Note: this is a more complete rendering of the counting loops than was
// given in the previous post (but still much simplified).
int cnt0, cnt1, cnt2, cnt3;
V8 sum = 0;
int h = 0;
for (cnt0 = 0; cnt0 < 173; cnt0++)
for (cnt1 = 0; cnt1 < 346; ++cnt1)
if (base pair matches)
{
// compute values of topOffsets[cnt0][cnt1][0..345]
// and botOffsets[cnt0][cnt1][0..345]
// now compute current element of hitArray
for (cnt2 = 0; cnt2 < 346; ++cnt2)
{
hitArray[h].topIndex = <compute value of topIndex>;
hitArray[h].botIndex = <compute value of botIndex>;
hitArray[h].offset0 = cnt0;
hitArray[h].offset1 = cnt1;
++h;
}
// sort elements of hitArray in increasing order of topIndex
}
for (cnt2 = 0; cnt2 < h; cnt2++)
for (cnt3 = 0; cnt3 < 346; cnt3++)
{
HitData *d = &hitArray[cnt2];
sum += (V4)topCounts[d->topIndex][topOffsets[d->offset0][d->offset1][cnt3] *
(V4)botCounts[d->botIndex][botOffsets[d->offset0][d->offset1][cnt3];
}
The effect of sorting hitArray by topIndex is that all of the accesses to each row of topCounts are done before moving on to the next row. Thus all of the topCounts data are in L2 cache when needed (except on the rare transitions from one row to the next). The accesses to botCounts are still random, however (although this can also be fixed, as will be seen later). This cuts the reads from main memory in half, and roughly doubles the speed of the program.
One cost of this method is the time needed to sort the 2 - 3 million entries of hitArray. Due to the nature of the data, a fast O(N) sort can be used. The sorting time is very small compared to the time saved by halving the main memory accesses.
Another cost is that we've added a new 173*346*346*2 = 41Mb static table, and a 3000000*8 = 24Mb array (the actual array is smaller because offset0 and offset1 are packed into one V2). The cost is negligible, however. First, these tables easily fit in main memory, so virtual memory is not needed. Second, access to these tables is strictly sequential, not random, so memory efficiency is excellent. Reading one cache line at a time works in our favor here.
Note, however, that the entire 41Mb table is written and then used much later. Normally this would displace all of the data in L2 on its way out to main memory, which hurts efficiency. The P4 has a special instruction to avoid this that is used when the programmer knows that data is being written that is not needed until much later. It is called a
non-temporal store, and it writes directly to main memory without displacing data in L2. We gather 4 V2 values in a CPU register and write 8 bytes at a time using a non-temporal store.
So now we've doubled the speed of the program at a cost of increased complexity. In the next post I'll describe additional methods on the way to a final program that is over three times faster than the original.