## 4x3 Sudoku counting

Everything about Sudoku that doesn't fit in one of the other sections
Good luck! The 3x4 case will be quite an achievement! Any hope of extending the methods to count the number of 12x12 Latin squares?
frazer

Posts: 46
Joined: 06 June 2005

The current method of counting 4x3 takes advantage of the splitting into bands, so I don't see how to use this immediately for Latin squares.

There is another way I was hoping to use for 12x12 latins squares, the RxC band counting algorithm which for C=1 gives the number of RxR latin squares. The complexity though grows probably exponentially in R. PatmaxDaddy has implemented this method for R=5, I don't know how far he reached on R=6. This will most certainly be too heavy for R=12.

Yet another way to try is to use the algorithm to create a reccursive formula (in C) for the number of RxC bands. A reccursive formula is known for R=3 (Franel numbers, see Red Ed's post), I'm sure it exists for any R, but again getting it for R=12 might be too hard.

I think 6x2 Sudoku is easier than 12x12 Latins squares (but harder than 4x3 Sudoku).
kjellfp

Posts: 140
Joined: 04 October 2005

### 4x3 Counting complete

4x3 counting completed

PatmaxDaddy and I have completed the count on the number of 4x3 Sudoku grids. The final answer we got was

81,171,437,193,104,932,746,936,103,027,318,645,818,654,720,000

or 8.117144e+46. Prime factorization is

2^40 * 3^12 * 5^4 * 7 * 11 * 19 * 53 * 134243 * 4443941 * 4804942621

The former estimate on 8.1064e+46 was 0.132% away.

The entire counting took 2568 hours, at most on 9 different computers.
kjellfp

Posts: 140
Joined: 04 October 2005

### Re: 4x3 Counting complete

Excellent! Well done!

What a relief, too, that the total was a close match with the estimate ...
Red Ed

Posts: 633
Joined: 06 June 2005

### Reliability

I thought I should say a bit on different tests we did, which together make me confident in the correctness of the number.

Band count check This was a test on whether the lookup tables of band counts had correct answers. For each base case (se earlier posts) we had created a 5775x5775 lookup table for the band counts. By summing this up and multiplying with the number of boxes for each base case, we were able to recompute the already known number of 4x3 Sudoku bands.

Gangster size check The equivalence class size of each gangster class and their band counts were also stored. Using these numbers, we found again the number of 4x3 Sudoku bands, as well as the correct number of band configurations.

Table lookup symmetries The 5775x5775 lookup tables turned out to be symmetric for each base, which we expected.

Gangster order It could happen, for some reason, that the boxes for each gangster could be badly stored. Therefore a program was created to read in the gangsters, and show that they (a) come in lexiographical order and (b) each gangster is the lexiographically first within its orbit under all the actions that are used to create the gangster equivalence classes. The program showed that this was the case. The lexiographical order used was the same as when the gangstsers were originally found.

3x3-counting The programs were rewritten as little as possible, to be used to do the 3x3 Sudoku counting. Even an extra common factor was introduced so that the number of bytes to be used for the entries in the lookup tables would vary from base to base, a situation the 4x3 counting had to deal with. This test ended up reconfirming the 3x3 number.

Top band permutaion An alternative gangster file was created with 24 entries. These entries were given by chosing 4 boxes, and storing the 24 ways to permute them. This would give very different sub-countings on the different base pairs, but when summing up the counts, we should always get the same number for each of the 24 cases. This test was done several times for different choices for the top band, and proved to be correct.

The 250-sample We had several versions of the program, from my original one that passed all tests, but was very slow, and to PatmaxDaddys improvements. The origianl version was nevertheless used to create a sample of all counts for every 250th gangster. These samples was tested against all the final results produced. This turned out to be useful as we detected one bad count sequence. The reason for the error was some table data that were misloaded, we don't know why. A correct rerun was made by us both. Apart from this case, all counts checked fine against the samples.

The estimation There was already an estimate on the expected number, which we believed could not miss the correct number by more than 1%. The final miss turned out to be 0.132 %. This test could not be 100% reliable though - the bad run already mentioned would change the final result by a very tiny fraction of 1%.

The power of two This is the test that more than any of the others convinces me that we are right. When looking at the already known numbers of RxC Sudoku grids, we can divide out a factor of (RC)!*R!^(C-1)*C!^(R-1)*(R-1)!*(C-1)!, this comes from looking at the symmetries. The number we then get turns out to be divisible by 2^N for some number N which grows as R and C grow. For 3xC Sudoku, the sequence so far was 2^0, 2^4, 2^7, and for 4xC Sudoku, the sequence was 2^2, 2^11, so we should expect at lest 2^15 to divide, probably even more. When the counts eventually were completed, the factor turned out be 2^19. This happened though a big part of the 144578 gangsters gave numbers with lower 2-power factors when count, topband count and equivalence class size were multiplied together. It would be a too big coincidence to get such a high 2-power on a wrong final result.
kjellfp

Posts: 140
Joined: 04 October 2005

Many congratulations to you both on getting this count! Wonderful work; you should write it up sometime.

Oddly enough, I'd met the Franel numbers before, in relation to my "real" work; they appear in work of Stienstra and Beukers on Picard-Fuchs equations of certain K3 surfaces, and Cusick's nice recurrence relation is typical of those appearing in Stienstra-Beukers. I haven't got anywhere with finding anything similar for the 4xC counts yet, but there are certain features (in the prime factorisation, for example) which suggests that there may be something similar here. Just as for the Franel numbers, one seems to be able to predict the power of (at least) some prime numbers p dividing the 4xC count in terms of the base p representation of C. (This is pretty common amongst sums which are made from binomial coefficients.) I'll report back if I get anywhere, although I'm not intending to think much more about them in the next few weeks.
frazer

Posts: 46
Joined: 06 June 2005

### 4x3 speed techniques

As Kjell mentioned, the 4x3 counting took 2568 hours (107 days) of coumpter time. This took just over three weeks using a variety of computers; at one point we had access to nine machines. The final program we used was at least three times faster than the original, although it's hard to say exactly since the computers differed significantly in capabilities, and some benefited more from the speedups than others. The machines ranged from a 2GHz laptop with 512Kb of L2 cache to a 3.8 GHz PC with 2Mb of L2. The laptop used 19% of the total time to do 10.6% of the total work, while the PC used only 20.6% of the total time to do 36% of the work. The other computers fell somewhere in between.

A variety of methods were used to get a factor of three speedup. These methods are generally applicable for getting the maximum performance out of a PC or similar machine, and particularly for programs involving large tables. Although this forum has typcailly been more devoted to mathematics than software engineering, I'm going to take a chance that some may be interested in this topic, so I'll post some explanation of the methods as I have time.

The methods I plan to discuss are 1) rearranging data access patterns to maximize cache performance; 2) creating a software pipeline to keep the CPU as busy as possible; 3) prefetching data that will be needed soon to reduce the time that the CPU spends waiting; and 4) aligning data on cache line boundaries to improve memory access efficiency.

Please stay tuned for more details.

Posts: 63
Joined: 18 October 2005

One of the key insights that kjell made is to split the counting into 45 base pairs, each of which can be counted separatly. Without this the static tables that are needed would be so large that they would not fit in the memory available on any current generation computer that we could afford, and this would make the computation intractable. The base pair design also allows the computation to be partitioned easily among many computers, each of which can do selected pairs independently.

Even with the base pair innovation, the computation was projected to require nearly a year on a current-generation PC (3GHz Intel P4). With the multiple computers we used, it would still have taken at least three months (we had nine machines only briefly; only five were available for the entire three weeks that it actually took, including the slow laptop).

The program as originally written made very poor use of the computational resources available on P4 processors. It spent the vast majority of its time waiting for data to be available, and very little time actually multiplying and adding numbers. To understand this and see how to fix it, we start by looking at the inner loops.

The inner loops of the 4x3 counting program look like this (simplified):

Code: Select all
`// abbreviations for unsigned integerstypedef unsigned short     V2;typedef unsigned long      V4;typedef unsigned long long V8;// static tables for selected base pairV2 topCounts[5775][5775];V2 botCounts[5775][5775];// inner loopsV2 topOffsets[346], botOffsets[346];int topIndex, botIndex;V8 sum = 0;if (base pair matches)  for (i = 0; i < 346; i++)  {    // compute values of topOffsets, botOffsets, topIndex, botIndex here    for (j = 0; j < 346; ++j)      sum += (V4)topCounts[topIndex][topOffsets[j]] *             (V4)botCounts[botIndex][botOffsets[j]];  }`

Here is how the loop translates into machine language, using C syntax (r0 - r6 are 32-bit CPU registers):

Code: Select all
`        V2 *r0, *r1;        V4 r2, r3, r4, r5;        int r6;        r0 = topCounts[topIndex];        r1 = botCounts[botIndex];        r4 = (V4) sum;        r5 = (v4)(sum >> 32);        r6 = 0;loop:   r2 = topOffsets[r6];        r3 = botOffsets[r6];        r2 = r0[r2];        r3 = r1[r3];        r2 *= r3;        r4 += r2;        r5 += carry;        r6++;        if (r6 < 346) goto loop;`

Each of the above C statements corresponds to exactly one machine instruction (Intel assembly language has different syntax, following a 50-year-old tradition; more modern assembbly langusges look much more like the above). The above loop is nearly optimal for Intel processors; the compiler we used doesn't do quite as good a job as this, but it gets pretty close.

The P4 goes to extraordinary lengths to make this loop run as fast as possible. It doesn't wait for one instruction to complete before starting the next. It may have dozens of instructions in partial stages of execution at one time, each either performing some arithmetic or logical operation, or waiting for the data it needs. It has many computational units that can operate in parallel; address generation, addition, multiplication, floating point, memory access, and others use different units. As soon as the data and computational unit are available the operation starts, and as soon as the result is available it is written to the specified destination.

The P4 can have multiple iterations of the loop executing at once. For example, it can be fetching data for the next iteration while doing the adds and multiply for the previous. In order to do this it predicts the outcome of the "if" instruction, using statistics of past outcomes and other hints. It may even execute both code paths before the outcome is known, and then discard the results of the unused path once the outcome is determined (this is called speculative execution).

The P4 doesn't do a perfect job of keeping busy, however. Sometimes an assembly language programmer (or a very good compiler, the likes of which I've never seen) can help out by reordering instructions, prefetching data, constructing software pipelines, and other techniques. More on this later.

One important consequence of the above is that the time to execute one iteration of the loop is not equal to the sum of the times of each instruction. The time is more like the longest time that each execution unit needs to do its work. In the above loop, one unit does the three adds, one does the multiply, and one does the if/goto. The two fetches from the offset arrays hit the data cache (more on this later), and are handled separately from the two fetches from the big static tables, which almost always must be read from RAM memory. For this loop those two fetches from RAM for the static tables take 5 - 6 times longer than the time needed by any other execution unit, and so determine the time needed to execute the loop. The adds, multiply, if/goto, and offset array access effectively add zero time. One could elminate them entirely and not significantly speed things up, and one could add more computation and not significantly slow things down. The CPU is spending almost all its time waiting for the data, and very little doing the main work of multiplying and adding.

In my next post I'll explain why the RAM access is so slow and how to fix it.

Posts: 63
Joined: 18 October 2005

Computer memory has always been slower than the CPU to which it is connected. This has probably been true of every stored program machine that has ever been made, going back 60 years. As CPU speeds have increased exponentially (following Moore's Law), memory speed has not kept up, and so the discrepancy between memory and CPU has gotten much greater. To prevent CPU speed from being wasted by waiting for memory, modern machines use a heirarchy of memory levels, each larger and slower than the lower levels. The PCs that we use have five levels:

CPU registers: A few hundred bytes, effectively zero access time.
L1 data cache: 8Kb, very low access time
L2 data cache: 256Kb - 2Mb depending on model, low access time
main memory (DRAM): 512Mb - 2Gb, medium access time
virtual memory: on hard drive, very large and very long access time

Program and data resides in virtual memory and is pulled into lower levels as needed, displacing data that has been used least recently (this is somewhat oversimplified). Data must be pulled into the lowest level, CPU registers, for any operation to be performed. Data in any level has copies in all higher levels, and the copies are kept consistent by the hardware.

The 4x3 counting program and data fit completely in main memory, so that no virtual memory access is needed as the program runs. If it didn't fit in main memory, the program would be too slow to be useful. The scalar quatities, such as the 64-bit sum and various locals on the stack, fit in the L1 data cache and can be accessed quickly. In the inner loop the sum and loop counter fit in the CPU registers as can be seen in the previous post. The 346*2*2 = 1384-byte offset tables fit in L1 cache, and since they are used just after being written they won't be displaced by other data. The 5775*5775*2*2 = 133Mb static tables fit in main memory but not in any lower level. Data is pulled into L2, L1, and the CPU registers as needed, and then displaced as other portions of the tables are needed.

For efficiency reasons, optimized over all kinds of computation, data is pulled into lower levels in chunks whose size depends on the level. For example, data is pulled from main memory into L2 cache in 64-byte chunks called cache lines. Thus fetching a V2 (2 bytes) table entry that is not already in L2 causes 64 bytes to be pulled from main memory into the L2 cache. The hope in doing that is that neighboring bytes will also be needed.

This is true for most algorithms, but not for the 4x3 counting. For one complete execution of the inner loop (346 iterations), 692 bytes are needed from one 11550-byte row in topCounts and 692 bytes are needed from one 11550-byte row in botCounts. Due to the 64-byte cache line and the fact that the table offsets are somewhat randomly distributed in the 0 - 5774 range, most of the 11550 bytes are read from main memory for each static table during one complete execution of the inner loop. To make matters far worse, each inner loop execution needs data from randomly distributed rows in the tables (as determined by topIndex and botIndex). These data are rarely in L2 cache, so that almost every execution of the inner loop reads nearly 23100 bytes from the relatively slow main memory. This completely wastes all of the extraordinary methods that the P4 uses to keep busy and make the loop run fast.

The key to speeding up 4x3 counting is to reorder the memory access pattern so that each table row is accessed as many times as is needed by the counting before moving on to the next row. For a typical base pair, the inner loop executes 2-3 million times, and since there are only 5775 table rows each will on average be needed 350 - 500 times. The counting program is rewritten as follows:

Code: Select all
`// static tables for selected base pairV2 topCounts[5775][5775];V2 botCounts[5775][5775];V2 topOffsets[173][346][346];V2 botOffsets[173][346][346];struct HitData{  V2 topIndex;  V2 botIndex;  V2 offset0, offset1;};HitData hitArray[3000000];    // this is actually allocated on the heap                              // with size as needed// Note: this is a more complete rendering of the counting loops than was// given in the previous post (but still much simplified).int cnt0, cnt1, cnt2, cnt3;V8 sum = 0;int h = 0;for (cnt0 = 0; cnt0 < 173; cnt0++)  for (cnt1 = 0; cnt1 < 346; ++cnt1)    if (base pair matches)    {      // compute values of topOffsets[cnt0][cnt1][0..345]      // and botOffsets[cnt0][cnt1][0..345]      // now compute current element of hitArray      for (cnt2 = 0; cnt2 < 346; ++cnt2)      {        hitArray[h].topIndex = <compute value of topIndex>;        hitArray[h].botIndex = <compute value of botIndex>;        hitArray[h].offset0  = cnt0;        hitArray[h].offset1  = cnt1;        ++h;      }      // sort elements of hitArray in increasing order of topIndex    }for (cnt2 = 0; cnt2 < h; cnt2++)  for (cnt3 = 0; cnt3 < 346; cnt3++)  {    HitData *d = &hitArray[cnt2];    sum += (V4)topCounts[d->topIndex][topOffsets[d->offset0][d->offset1][cnt3] *           (V4)botCounts[d->botIndex][botOffsets[d->offset0][d->offset1][cnt3];  }`

The effect of sorting hitArray by topIndex is that all of the accesses to each row of topCounts are done before moving on to the next row. Thus all of the topCounts data are in L2 cache when needed (except on the rare transitions from one row to the next). The accesses to botCounts are still random, however (although this can also be fixed, as will be seen later). This cuts the reads from main memory in half, and roughly doubles the speed of the program.

One cost of this method is the time needed to sort the 2 - 3 million entries of hitArray. Due to the nature of the data, a fast O(N) sort can be used. The sorting time is very small compared to the time saved by halving the main memory accesses.

Another cost is that we've added a new 173*346*346*2 = 41Mb static table, and a 3000000*8 = 24Mb array (the actual array is smaller because offset0 and offset1 are packed into one V2). The cost is negligible, however. First, these tables easily fit in main memory, so virtual memory is not needed. Second, access to these tables is strictly sequential, not random, so memory efficiency is excellent. Reading one cache line at a time works in our favor here.

Note, however, that the entire 41Mb table is written and then used much later. Normally this would displace all of the data in L2 on its way out to main memory, which hurts efficiency. The P4 has a special instruction to avoid this that is used when the programmer knows that data is being written that is not needed until much later. It is called a non-temporal store, and it writes directly to main memory without displacing data in L2. We gather 4 V2 values in a CPU register and write 8 bytes at a time using a non-temporal store.

So now we've doubled the speed of the program at a cost of increased complexity. In the next post I'll describe additional methods on the way to a final program that is over three times faster than the original.

Posts: 63
Joined: 18 October 2005

i think thats a good point but im relly curios of how that would work so i would like to see your other post
nathanmcmaster

Posts: 5
Joined: 19 April 2006

Alignment of Static Tables
Each execution of the inner loop reads one 692-byte row from each of topOffsets and botOffsets. 692 bytes occupy 11 cache lines, so 704 bytes are actually read. But this is a slight oversimplification, and now we need to dig a little deeper. Cache lines must also be aligned on 64-byte boundaries. Rows in topOffsets and botOffsets, however, are aligned on V2 (2-byte) boundaries. If a row happens to have an offset within a cache line of 0 - 12 bytes, 11 cache lines are indeed used. But if the offset within a cache line is 14 - 62 bytes, 12 cache lines are needed, so that 768 bytes would be read for each row. The expected value is 11.81 cache lines, or 756 bytes.

Clearly it would be more efficient if every row were aligned on a cache line boundary, so that 11 cache lines are always used. This is very easy to do:
Code: Select all
`// unaligned tables, same as previous postV2 topOffsets[173][346][346];V2 botOffsets[173][346][346];// cache-aligned tables__declspec(align(64)) V2 topOffsets[173][346][352];__declspec(align(64)) V2 botOffsets[173][346][352];`

The __declspec is Microsoft-specific, but equivalents are probably available for most compilers. Increasing the row size from 346 to 352 elements keeps each row aligned at a cost of some wasted memory, but since the extra space still fits easily in main memory there is no performance penalty.

This simple chnage in the declaration of the tables provides a small increase in speed. The increase is maybe only a few percent, but over a very long counting run it could add up to a couple of days saved. The change adds no complexity to the program, and so is clearly worth doing.

This is an example of a very general technique in high-performance computing. In the 4x3 program the effect is small, but often is can be much larger. Furthermore, accidental changes in alignment can affect speed in ways that seem mysterious. For example, one could add a simple printf at the end of a program and find that the main computation got faster or slower, for no reason that could possibly seem related to adding the printf. But if the seemingly unrelated code change affects the alignment of data in memory (maybe because a literal string now needs to be stored), the code performance can change. You are not hallucinating.

(Note for experts: The L2 cache on the P4 is 8-way set associative. Often data can be laid out in memory so as to take advantage of the associativity of the cache. Doing so can provide some control over the cache replacement algorithm, which often works against a particular computation. In the 4x3 code I know I could have gotten a significant speed improvement if I had better control of cache replacement, but there seems to be no good way to lay out the big tables to achieve this. Ideally one would like complete manual control of the cache, as is often found in DSP architectures.)

Much more to come. I hope some may find this discussion useful or at least interesting. Questions are always welcome.

Posts: 63
Joined: 18 October 2005

Sorting the elements of hitArray by topIndex has the effect of separating the array into 5775 bins, where each element of a bin has the same topIndex (some bins may be empty). As previously noted, one row of topCounts is read repeatedly for each element of a bin, which allows the row to stay in L2, halving (approximately) the reads from main memory. Quite a bit more than one row of topCounts will fit in L2, however, and we can take advantage of this to reduce main memory reads much further.

Consider k consecutive bins in hitArray; we will refer to k such bins as a bucket. We sort the elements of each bucket by botIndex, using a fast O(N * log(N)) sort. Each bucket now contains k different values of topIndex, and all elements of equal botIndex are now consecutive in the array. As long as k is small enough that k rows of topCounts will fit in L2, there is no increase in main memory reads from topCounts caused by this sort. Reads from botCounts, however, are reduced by the average number s of consecutive elements with the same botIndex, over the entire hitArray. The value k is defined by a parameter called sortBot. Reads from main memory to botCounts are reduced by approximately the factor s.

This average clearly increases as sortBot increases. There is a tradeoff to make: higher values of sortBot reduce reads from botCounts, but at some point increase reads to topIndex as the sortBot rows of topCounts no longer fit in L2 and begin to be displaced by other rows. The best value depends on the size of L2.

The machines we used had three L2 sizes: 512Kb, 1Mb, and 2Mb. We found emperically that the best value of sortBot for the 512Kb L2 was around 10; for the 1Mb and 2Mb L2, the best value was around 70. The reason for the seven-fold increase when the size of L2 merely doubled is that for 512Kb much of the cache is used for other data. When the cache doubles, much of the extra space can hold rows of topCounts. I don't know why the 2Mb L2 was no better than the 1Mb; it may have something to do with the cache replacement algorithm and the associativity of the cache.

For sortBot = 10, over the entire 4x3 count s = 2.32. For sortBot = 70, it was 6.05. Over all values of sortBot, s = 3.28.

The following is a 2-dimensional histogram of the execution time t of the inner loop (346 iterations), plotted against s, for one base pair that was run on the fast PC with sortBot = 70. The horizontal axis is s (square root compressed), the vertical axis is time in microseconds. The brightness of each point shows the population density at a given (s,t). Note how time decreases as s increases. Note also that for most values of s there are some points at much higher values of t than average; this is caused by the PC doing other work during the run, for example responding to network traffic.

The average time to execute the inner loop for this base pair is 621 microseconds, so on average each iteration takes 6.8 clock ticks (3.8 GHz clock). That's moving along pretty well.

With the sorting improvements the time spent reading from main memory is much closer to the time needed by the CPU to do the calculations, and so improvements to the coding of the inner loop will further improve performance (with the original version, no improvement to the inner loop made any difference). In the next post I'll describe the coding of the inner loop as a four-stage pipeline, using the Streaming SIMD Extensions (SSE2) available on the P4. I will write all machine instructions using a C syntax rather than Intel assembbler, so that anyone who can read C can follow the code.
Last edited by PatmaxDaddy on Tue Apr 25, 2006 3:43 pm, edited 1 time in total.

Posts: 63
Joined: 18 October 2005

Software Pipeline
The inner counting loop can be performed by a pipeline that looks like this:
Code: Select all
`             topCounts    botCounts                     |    |topOffsets           V    V|   -----------    ----------    -------    ------------->|  fetch  |--->| fetch  |--->| mul |--->| add to |--->| offsets |--->| counts |--->|     |    |  sum   ||   -----------    ----------    -------    ----------botOffsets// Non-pipelined maching language equivalent, from// previous postloop:   r2 = topOffsets[r6];      // fetch offsets        r3 = botOffsets[r6];        r2 = r0[r2];         // fetch counts        r3 = r1[r3];        r2 *= r3;                   // multiply        r4 += r2;                   // add to sum        r5 += carry;        r6++;                       // loop control        if (r6 < 346) goto loop;`

Instead of executing each block serially, as is done in the non-pipelined machine language loop shown, we execute each block in parallel. Imagine that a clock is attached to each block, and each time the clock ticks each block looks at its inputs and produces a new output. While it takes four clock ticks for data to pass all the way through the pipeline, each block can start a new operation on every tick, and so the throughput of the pipeline is only one tick per item (in our case, 346 ticks).

The P4 is a deep-pipeline machine that tries to automatically order the operations to achieve parallel operation (this was described in a previous post, although the actual operation of the P4 is far more complex). Machines are not as smart as human programmers, though, and we can help considerably by writing the loop as a software pipeline. Here is what it might look like:
Code: Select all
`V2 *r0, *r1;      // -> topCounts, botCountsV4 r2, r3;        // fetch offsets hereV8 r4;            // sumint r6;           // loop counterV4 r7;            // product goes hereV4 r8, r9;        // fetch topCounts, botCounts here//      fetch  fetch //     offsets counts   mul     add//     ----------------------------------loop:                           r4 += r7;                        r7 = r8;               r8 = r0[r2];        r2 = topOffsets[r6];                        r7 *= r9;               r9 = r1[r3];        r3 = botOffsets[r6];                                r6++;                                if (r6 < 346) goto loop;`

Read this code first in the following order: down the "fetch offsets" column, then down "fetch counts", then down "mul", then down "add". You will note that in this order, the instructions are essentially identical to the non-pipelined code. The major difference is that three more registers are used, which hold intermediate values in the pipeline. Reading the code in this way, you'll notice that the computation is correct. (A minor change is that the sum is held in a 64-bit register to simplify the code.)

The CPU doesn't see the four different columns, of course. Those are used just to make the loop easier to read. The CPU executes the loop in the order given (internally it can rearrange the order, but only in ways that preserve the computation specified by the given order).

When the loop is executed, iteration i fetches a pair of offsets, i+1 uses them to fetch the counts, i+2 multiplies the counts, and i+3 adds the product to the running sum. Thus each iteration of the loop is simultaneously working on four separate sets of data, and so it reflects the above pipeline diagram. Note that registers (r2, r3, r7, r8, r9) are always modified just after the contents from the previous iteration are used. Thus the delay from the time any data is loaded from memory or computed, to the time it is needed, is always one entire loop iteration. This gives each execution unit in the P4 the maximum possible time to finish an operation before the results are needed. This loop cooperates with the internal parallelism of the P4 to make the loop execute significantly faster.

Before the sorting improvements described in previous posts, the fetch counts operation took so long that the pipelined code ran no faster than a non-pipelined version. In that case the internal parallelism of the P4 was able to do as much as could be done, which wasn't much. But with the sorting the fetch counts time is much closer to the computation time, and so the pipelined loop is significantly faster.

Note that the first three iterations of the pipeline produce invalid data, and that after the loop completes the three final results are still in the pipeline. A small amount of preamble and postscript code takes care of that.

Use of SSE2 Instructions

The big problem with a software pipeline is that it needs more registers to hold intermediate values. The P4 does not have nearly enough general registers (the seven used by the non-pipelined loop was stretching it; the base register used to manage the stack has to be taken over temporarily for the loop).

The solution is to use the additional registers provided by the Streaming SIMD Extensions (version 2). There are eight such registers, each 128 bits long. Each register can hold and operate on 16 bytes, 8 16-bit values, 4 32-bit values, or 2 64-bit values (values are signed or unsigned). These operations are done in parallel, which is what SIMD means (single instruction, multiple data).

The actual loop we use executes 173 times, with each iteration doing two sets of data, fully pipelined as described above. However, instead of two multiplies, two adds, and two assignments (r7 = r8) per loop, the SIMD instructions are used to do these operations in parallel, so that only one of each is needed. We use the 128-bit registers to hold 2 64-bit unsigned values. We end up with two sums that have to be combined at the end of the counting. This also prevents the number of additional registers we need from doubling. Note that the loop control need only be done once per loop, halving the number of times it needs to be executed.

In a previous post I mentioned that average iteration time is 6.8 clock ticks (for sortBot = 70). That's equivalent to 0.76 ticks per instruction for the loop that doesn't use SSE2. That's pretty good use of CPU resources, although it would be faster still if the code didn't have to read from those big tables.

One more topic left to post: using prefetch to speed up access to the big tables a little more.

Posts: 63
Joined: 18 October 2005

Prefetch
Within each bucket as described in previous posts, there are runs of equal botIndex whose length varies from run to run, with an average length that depends on the parameter sortBot. All of the rows of topCounts needed by the bucket are found in L2, except for a short while after a transition from one bucket to the next.

Similarly, on a transition from one run to the next a new row of botCounts must be loaded into L2 from main memory, where it can be used several times over the course of the run. Each step of the run, corresponding to one element of hitArray and one execution of the inner loop, reads 346 V2 values from the row of botCounts specified by the constant botIndex of the run. Due to the essentially random order of reads from the 5775-element row, and the 64-byte cache line, most of the row is loaded into L2 on the first step. Most of what remains is loaded on the second step, and by the third and subsequent steps loads from main memory are rare. Of course a run may have only one step, or it might have dozens; the average length is, as previously posted, 2.32 for sortBot = 10 and 6.05 for sortBot = 70.

So there is a lot of main memory traffic on the first step, which means the inner loop is mostly waiting for memory, and very little traffic on third and subsequent steps (if any), which means the loop is perhaps mostly waiting for computation (I say "perhaps" because it's very hard to tell). As with most computers, the P4 works best when there is a balance between memory traffic and computation, and we clearly have something of an imbalance.

We'd like to shift some of the memory traffic to later steps to achieve more balance, and hence more speed. This can be done with a special instruction called prefetch. With this instruction the programmer specifies a memory address, and the CPU causes the cache line containing the address to be loaded into L2 (if it is already there, or the address is invalid, nothing happens). The idea is that sometime during the later steps of a run, when memory traffic is low, we prefetch the row of botCounts needed for the next run. When the next run starts, much of the needed counts will already be in L2.

It may seem that doing this just shifts the memory traffic from the beginning of a run to the end of the previous run, and so would be of no value. This is not the case, however, because the data being prefetched is not needed during the end of the previous run, so the CPU doesn't have to wait for it in order to do the computation on which it is working.

Prefetch works best when the instructions are spread out, rather than all being executed one right after another. Thus we'd like to work the prefetch into the inner loop itself, executing one prefetch for each iteration of the loop. This also avoids the need for a separate prefetch loop, which would have loop control overhead. Our inner loop, as described in a previous post, executes 173 times. Each row of botCounts uses 181 cache lines. This fortunate coincidence suggests that on the last step of a run we prefetch 173 cache lines of the row of botCounts needed for the next run. Prefetching the remaining eight cache lines is not worth the trouble.

The prefetch makes a noticable difference in the speed of the counting, although not nearly as great as the sorting. I suspect that, over the course of the counting, the amount of time it saved was less than the time I spent writing and debugging the code. Typically, however, one does not write a computer program that is intended to be run only once, so prefetching can be of considerable value. In many cases, furthermore, prefetching improves performance much more than we observed for the 4x3 counting. Finally, there is a certain aesthetic pleasure in getting code to run as fast as possible.

What the programmer would really like to have, as I hinted in a previous post, is manual control of the cache; complete control of when data goes in and when it is replaced. The hard-wired cache control of general-purpose computers often works against a particular algorithm, and significantly wastes computational resources. Many DSPs (digital signal processors, essentially general-purpose CPUs but with instruction sets more geared for signal processing) offer just such a facility. They have high-speed memory (like our L1 or L2, but not called a cache) and a memory transfer engine that operates independently of and in parallel with the CPU. This engine has its own instructions that it executes to move data around. An experienced programmer can achieve nearly optimal balance between memory traffic and computation.

The cache is probably the best compromise for general-purpose computation, as opposed to specialized signal processing algorithms coded by expert programmers. Still, it is a compromise, and often the exquisitly high clock rates and internal parallelism that Intel engineers goes to waste.

Posts: 63
Joined: 18 October 2005

Wow, good stuff.

I guess many readers will, like me, be happy hacking 'C' but not confident with assembler. Should 'C' programmers just concentrate on making best use of the cache, and maybe hand-code some software pipelines, but forget about non-temporal stores, SSE2, prefetch and other assembler magic? And always pass '-O9' to gcc ...?

And what about finding bottlenecks in the code? Do I have to use a profiler, or can I get away with gettimeofday(), or clock() or that rdtsc thingamajig?
Red Ed

Posts: 633
Joined: 06 June 2005

PreviousNext