4x3 Sudoku counting

Everything about Sudoku that doesn't fit in one of the other sections

4x3 Sudoku counting

Postby kjellfp » Wed Dec 14, 2005 10:17 am

Some of us have started a discussion on how to find the total number of 4x3 sudoku grids, i.e. the number of 12x12 grids filled with symbols from 1 to 12 such that

- Every row has 12 different symbols
- Every column has 12 different symbols
- Every base 4x3-box has 12 different symbols

A base 4x3-box is one of the natural disjoint 12 boxes of size 4x3 we can divide the grid into.

(The number of 3x4 sudoku grids is of course the same problem)

Below follows a brief summary on my contribution so far, together with some thoughts from dukuso and PatmaxDaddy.

In principle, the 4x3-counting is identical to the 3x3-counting. The process has two parts:

1. Precalcualtion on the 4x3-gangsters (bands of size 4x12). We need to find the size of each equivalence class, the numbers of bands for each gangster, and how this can be used to find the number of bands from any column configuration.

2. Counting. The number of loops is (#gangsters) * 346/2 * 346 * 346 * 346 (346 is the number of 4x3 box column configuration not in conflict with a fix configuration, just like the number 56 for 3x3-counting).

I don’t know about the first part, but I hope it is possible in a few hours. The second will take much more time, maybe a year or two (at least 83 days on my PC if the inner loop had as few CPU clock ticks as when doing the 3x3-counting, but for this purpose, we can not store everything on the machine stack). The results from the first part should be stored to files, while the final counting should read in these files and produce regular output files, so it can pick up the thread on a rerun after the program stops. The final counting should of course also involve a task list on which gangsters to do the counting for, if we are splitting the work into several computers.

Memory is a problem here, so we must use reductions that we didn’t need for 3x3-counting. First, notice that given a pair (B1,B2) of 4x3 boxes, they are configuration equivalent to a pair (Id,B2’) where Id is the configuration {1,2,3,4},{5,6,7,8},{9,10,11,12} and B2’ can be one out of 9 base boxes. We can recognize which of the base boxes to be used out of a 3x3-matrix where position (i,j) says how many elements column i in B1 and column j in B2 have in common. The 9 matrices (up to row permutation and column permutation) are

Code: Select all
#1: 400   #2: 400   #3: 400
    040       031       022
    004       013       022

#4: 310   #5: 310   #6: 310
    103       121       112
    031       013       022

#7: 220   #8: 211   #9: 112
    202       121       112
    022       112       220


So given a band, to use it in lookup tables to find number of solutions etc, we transform it into a band configuration Id|B2|B3|B4 where B2 is one among the 9 base box configurations. There are then 5775 choices for B3 and B4 (just as there were 280 choices in the 3x3-counting). Totally we have 9x5775x5775 different band configurations to handle, this is about 300.000.000. That requires 300 MB for every byte we want to assign to each band.

So what do we want to assign? What we actually need it for, is to find the number of solutions for a configuration. We could store the band gangster index, but that requires at least 17 bits of memory (depends on the number of gangsters, which I only know must be more than 96752). We could store the number of solutions directly, but that will be 7962624 in the worst case, and 65719 in average. General cases might be trapped between a higher and lower limit whose difference is less than 2^16, and so the special cases could be handled differently. This is the most direct method with respect to looking up the values.

PatmaxDaddy had a nice remark to my thoughts. Though there are about 100.000 gangsters, the different unique numbers of solutions for a gangster could be much smaller (for 3x3-counting, there are 44 gangsters, but only 16 different numbers occur as the number solutions for a configuration), and so probably much less than 16 bits of information is required for each configuration.

dukoso has suggested to do part 1 first, and then think. I agree.

To lower the risk of ending up with a wrong number, we should rewrite the program we end up with to do the 3x3-counting first (but we introduce 'problems' on the numbers used, to have the difficulties on memory as for 4x3-counting).
kjellfp
 
Posts: 140
Joined: 04 October 2005

Postby dukuso » Wed Dec 14, 2005 10:38 am

OK, this is off-topic here, but since you seem to be interested in
these enumerations...

Another open problem is to enumerate 9*9*9 sudoku cubes.
I've been talking about this with Condor in PM
and I think it might be doable , no timing model
yet, just some vague estimates.
That seems surprising on first view but there are lots of
constraints which make the total number rather low.
So e.g. the number of 9*9*9 sudokus could be smaller
than the number of 9*9*3 sudokus
dukuso
 
Posts: 479
Joined: 25 June 2005

Postby kjellfp » Wed Dec 14, 2005 10:43 am

What are the rules on 9x9x9 sudoku?
kjellfp
 
Posts: 140
Joined: 04 October 2005

Postby dukuso » Wed Dec 14, 2005 10:54 am

fill a 9*9*9 with digits from {1,2,3,4,5,6,7,8,9}
such that each of the 81 rows, 81 columns,81 piles,
3*81 aligned 1*3*3 subcubes contains each digit exactly once.

see this old thread:
http://forum.enjoysudoku.com/viewtopic.php?p=6890&highlight=dion#6890

you might give up on 3*4 and concentrate on 9*9*9 ... ?!
then let's start a new thread. But right
now I've not so much time, got some other priorities.
dukuso
 
Posts: 479
Joined: 25 June 2005

Postby kjellfp » Wed Dec 14, 2005 12:13 pm

There is also an alternative to the 4x3-counting, and that is the 3x4-counting. I have an idea on a way to do Nx4-counting, and I used it for N=1 for an alternative way to count the number of latin 4x4-squares by head.

The idea is to look at configurations of the symbols in the union of the first two bands, and do the same for the two last. An example for N=2 might help: Concider the following 2x4 sudoku grid

4781 5326
6352 1847

5163 2478
7824 3561

3246 8715
1578 6234

8637 4152
2415 7683

If we group together the two first and last bands into "superbands", and reorder the elements in each column we get

4121 1321
5352 2446
6763 3567
7884 5878

1215 4112
2436 6233
3547 7654
8678 8785

The second superband configuration is given by the first.

We should of course create "supergansters" to represent the superbands, and so for each such superganster, count the number of 2-band solutions as well as its equivalence class size. The same can be done for its "dual" supergangster, before we multiply together the results.

The property of a "superband" is that every symbol occurs twice in every "super box" (2Nx4 box), but never twice within the same column.

To count the number of 2-band solutions of a supergangster, we run through all ways to split the superband configuration into two band configurations such that every symbol occurs once in every box. And then we multiply the numbers of solutions for each of the two bands together, and sum up.

These are ideas, not estimated yet, so I don't know if this is a better or worse idea for doing the 4x3-counting. But I can see this puts more burdens on the first counting part (finding equivalence classes and number of band soultions for superbands), but probably less on the second.
kjellfp
 
Posts: 140
Joined: 04 October 2005

Postby kjellfp » Wed Dec 14, 2005 3:56 pm

OK, a little bit more on superbands for 3x4-counting.

As soon as we have a superband, I think counting the solutions can be trivial and relatively fast. An underlying work is to do 3x4-band counting, gangster generation, equivalence class computation etc. It's much faster and less memory consuming than 4x3. The average number of solutions per band configuration is 627 and the expected number of gangsters has a lower bound of 1271, maybe it's around 2000-3000, I don't know.

Then for superbands, take the following example of a box in a super band (I call it a "superbox"):

2112
3346
4759
596A
8A7B
BC8C

Let N(a,b) be the number of symbols column a and b have in common. If {a,b,c,d} is any permutation of {1,2,3,4} then N(a,b)=N(c,d). If we sort the sequence { N(1,2), N(1,3), N(1,4) }, we get a sequence a,b,c with the property

0 <= a <=b <= c and a+b+c=6.

There are seven possible such sequences:

006, 015, 024, 033, 114, 123, 222

That gives the box an index from 1 to 7. For the above example, we have
N(1,2) = 1 N(1,3) = 3 N(1,4) = 2
which gives index 6.

The good thing is that the dual superbox (the superbox from band 3 and 4) always has the same index. The three superboxes indices of a superband, gives totally 84 different parts to split up the superbands into, where the superband is in the same part as its dual. That can simplify the memory consumption of the calculation because counting grids where the two superbands have indices 2,2,7 can be run independetly from the counting where the indices are 1,2,5 etc.

Then we can start reducing the number of superband gangsters for each such part. The first superbox has one choice. The second and third much much more, but I hope running through transformations might give a number of gangsters that we can handle.
kjellfp
 
Posts: 140
Joined: 04 October 2005

Postby kjellfp » Sun Jan 29, 2006 4:54 pm

Some calculations on 3x4- and 4x3-counting, and why I eventually believe ordinary band by band 4x3-counting is better than 3x4 with superbands.

Some number have to be derived.

3x4-counting

For a stack of 4 3x4-boxes, how many choices are there for the column configurations of box 1, 2, 3 and 4?

The first box can be chosen in B1=369600 ways
There are twelve symbols, three should be picked out for each of the four columns. This gives the number 12! / (3!)^4 = 369600

The second box can be chosen in B2=13833 ways
The number itself is independent of the choice of the first box. There are at least two ways of counting the number. Either look systematically through all isomorphic cases of how the configuration of box 2 looks compared to box 1, or count by computer. I have done both and got the same answer.

The superband configurations of two boxes can be chosen in S=32496156 ways
This corresponds to the number of ways to give four subsets of {1..12} such that each set contains six different symbols, and every symbol occurs in excactly two sets. This number is derived as follows:

Let a, b and c be the number of elements in the first column that also occur in respectively column 2, 3 and 4. Then a+b+c=6. As demonstrated by PatmaxDaddy (C=3), a, b and c are also the number of common symbols in respectively column 3 & 4, column 2 & 4, and column 2 & 3. The number of superband configurations with these properties for a,b, and c is given as the number of ways to choose the symbols to fill the pattern, that is 12!/(a!*a!*b!*b!*c!*c!). This can also be written as 924 * (6!/(a!*b!*c!))^2 because 924 = 12!/(6!*6!). This sums up to 924*35169=32496156.

The third box can in average be chosen in B3=157.633485 ways
Note that the number for each case depends on the choices for the two first boxes, and will differ.

The total number of stacks with given box column configurations is the total number of stacks divided by 3!^16. Using the number from Wikipedia, we have 12!*3!^12*2180544/3!^16 = B1*2180544 stack configurations. This should be the same as B1*B2*B3, and so we have B3=2180544/B2=157.633485

There is another way to calculate an estimate, which slightly differs from the answer. Chose box 1, box 2, which defines a superband configuration for box 3+4. We are asking for the number of ways to split this superband into two box configurations, and that can be estimated (but this is not exact, hence the difference) to be done in B1*B2/S=157.331741 ways.

There are about SC=863722099 non-isomorphic ways to chose a super band of three superboxes
Non-isomorphic in the configuration sence, hence this is the number of "superband gangsters". That is, the number of legal superbands modulo the following operations:
- The six symbols in any column can be permuted independent from each others.
- The four column inside a superbox can be permuted
- The three superboxes can be permuted
- The 12 symbols can be permuted

The number X of superbands is the number of superband configurations times the ways to permute the cells inside each column. That is X=(S*6!^4)^3. The estimated number of gangsters is then X divided by the umber of permutations, i.e. X/(6!^12 * 4!^3 * 3!) which gives the number above.

This number is only correct under the assumption that every permutation produces different configurations. This is not entierly true, but the experience from similar cases on such expected numbers is that the bigger the final numbers are, the closer they are (relatively) two their estimate. For such a big number as above, the estimate is probably not wrong by more than 2%.

Then we finally have:

3x4-counting using superbands requires about 1.6916e+15 loops

For each superband class, we can split box 1, 2 and 3 in about B3 ways. An overall swapping of band 3 and 4 helps us to reduce the number of loops by a factor of 2, so we end up with the number of loops to be SC*B3^3/2.

4x3-counting

This number is a bit simpler to estimate.

The number of 4x3 gangsters is about 96752

The number of choices for each of the four box configurations is 12!/4!^3=34650. If we also order the columns in a box, we get another reduction of 3!, so we get 5775 configurations. We can assume that the first box is the canonical 1234/5678/9abc, while we have 5775^3 choices for the other three boxes. We have 4!^3 ways to permute the symbols inside each column in the first box, and 3! ways to permute the column before we symbolpermute back to the canonical form. We can also permute the four boxes in 4! ways. This gives the estimate of 5775^3/(4!^3*3!*4!) gangsters. Again, as for superband gangsters, this is an estimate, but probably pretty accurate.

4x3-counting using superbands requires about 6.9322e+14 loops

For each gangster, there are 346 choices for each of the boxes on the next band. A reduction factor of 2 comes from swapping band 2 and 3, so we have 96752*346^4/2 loops

Conclusion

Band by band seems to be faster than superbands.

Memory has been mentioned earlier as a problem. I think that by not using a 9x5775x5775 lookup table, but reducing more into equivalence classes of the first three boxes, we might get down to a lookup table of about 2.000.000 entires. But this again requires more calculation in the innermost loop. And we might ask if this will be so expensive that the superband method is the fastest after all.

I'm still optimistic. If it's possible with a code that runs in about a year, the job can be split up in several PCs, as many as possible may contribute and run calculations on their PCs, so we may type out the 4x3-number one day.
kjellfp
 
Posts: 140
Joined: 04 October 2005

Re: 4x3 Sudoku counting

Postby kjellfp » Sun Feb 12, 2006 12:06 am

In the beginning of this thread, I wrote:In principle, the 4x3-counting is identical to the 3x3-counting. The process has two parts:

1. Precalcualtion on the 4x3-gangsters (bands of size 4x12). We need to find the size of each equivalence class, the numbers of bands for each gangster, and how this can be used to find the number of bands from any column configuration.

2. Counting. The number of loops is (#gangsters) * 346/2 * 346 * 346 * 346 (346 is the number of 4x3 box column configuration not in conflict with a fix configuration, just like the number 56 for 3x3-counting).

I don’t know about the first part, but I hope it is possible in a few hours.

The first part is now done. It completed in 87 minutes.

The number of 4x3-band gangsters is 144578, so the earlier assumption on about 100000 classes was too optimistic.

5240 different numbers occur as the number of bands with a given configuration (compared to 16 different numbers for 3x3 sudoku bands). The highest common divisor for them all is 8. The most commonly used number of bands is 72704 and occurs 141 times.

There are essentially 1546 different ways to choose a tripple (B1,B2,B3) of boxes when box permutaion is not allowed. In other words: Given the family F of tripples of boxes, they group into 1546 equivalence classes under the action of the group of transformations generated by:

- Permuting the order of the four symbols inside a column (independently from the other columns)
- Permuting the three columns inside a box
- Permuting the 12 symbols

(If we also allow the permutation of B1, B2 and B3, we will get even fewer equivalence classes, but this gets harder to handle for the final count).

So instead of my original idea about a 9x5775x5775 lookuptable, I'm down to 1546x5775. We probably need two bytes of information for each entry in the table, that gives totally 17.856.300 bytes of information, still a lot, too much for the machine stack, but something the memory can handle.

I have several ideas about how to reduce the use of memory even more, and how to do as little as possible in the very innermost loop in the final count, I might put them out later.

The total number of possible band configurations, as well as the total number of 4x3 bands, are calculated from the derived results and checked against their already known values from previous work.
kjellfp
 
Posts: 140
Joined: 04 October 2005

Postby PatmaxDaddy » Sun Feb 12, 2006 1:43 pm

An 18 Mb table is not bad, I used an even bigger table in my very first 3x3 counting porgram. If you can reduce it by enough that it will fit in L2 cache (probably around 1 Mb on your machine) that is worth doing, but to reduce it by only a factor of 2 or 3 isn't worth the trouble, particularly if doing so adds work in the inner loop. Sometimes increasing the table size is good if it reduces work in the inner loop.

The most important thing in understanding memory performance is the memory heirarchy: 6 general registers, about 64Kb of on-chip L1 cache, about 1 Mb of high-speed L2 cache, about 1Gb of main memory (DRAM), and finally the relatively slow hard drive (virtual memory). If you can get your data to fit in a smaller memory unit it is a huge win. The second most important thing is to prefer sequential memory addresses to minimize cache misses. Sometimes the least-recently-used cache replacement method use by the hardware can really hurt performance, but this can often be fixed by fetching data well before you need it. Sometimes one has to resort to assembly language to do that.

I can usually get a factor of 2 improvement in speed by coding the inner loop in assembly language; sometimes more, sometimes less. A huge improvement over compiled code can be achieved if the computation can take advantage of the SIMD-parallel MMX instructions, but I have never found this to be possible for these grid counting algorithms. The need for lookup tables kills SIMD parallel computation.
PatmaxDaddy
 
Posts: 63
Joined: 18 October 2005

Postby kjellfp » Mon Feb 13, 2006 12:15 am

OK, a little bit more on my thoughts about memory consumption, and how splitting up the code can reduce that problem.

Here are the boxes, each has 4 rows and 3 columns:
Code: Select all
+---+---+---+---+
|B1 |B2 |B3 |B4 |
+---+---+---+---+
|B5 |B6 |B7 |B8 |
+---+---+---+---+
|B9 |B10|B11|B12|
+---+---+---+---+

We already have the band results on a file, so what we need is a program that receives the contents of B1, B2, B3, B4 and returns the counting of how to complete B5,..,B12. Following the pattern for 3x3, we run through the 173 choices for the cc (column configuration) of B5, that also gives the cc for B9. Then through the 346 choices for the cc of B6, this gives the cc of B10.

At this stage we stop up and look at which of the 9 base cases (B5,B6) belong to, and same for (B9,B10). The program is designed to only handle one particular case for (B5,B6) and one for (B9,B10). Other cases are handled by other programs.

Now, for a particular choice for B6 and B10 to be handled by this program, we iterate through the choices for B7 and B11. To get down to one of the 1546 cases for (B5,B6;B7), we need some transformation, and this implies a transformation on each choice for B8 and B12 as well. There are (at least) three ways of doing this, different choices have different influences on memory used and operations needed in the inner loop.

#1. A direct 5775x5775 lookup table gives the band count when the arguments are B7 and B8. This is the original idea about a 9x5775x5775 table, except that there is only one choice among the 9 for the first argument (otherwise, the program would skip this situation). Advantage: Right on target (few operations in the inner loop). Disadvantage: Memroy consuming.

#2. With B5 and B6 given, all operations available to bring a B7 to its desired form, are among the A automorphisms for (B5,B6). The size of a needed lookup-table to give the box value of B8 after the transformation, is A*5775. The values for A for the 9 cases are 82944, 1728, 1536, 648, 384, 144, 64, 48, 48. The good thing is that the smaller the value of A is, the higher is the possibility that (B4,B5) is of our type (among the 9). For the case A=82944, this approach is hopeless, the other cases take less memory than 5775x5775. Compared to #1, this occupies less memory (almost always), but has one extra lookup operation in the inner loop.

#3. The transformation bringing B7 to it's needed form, must leave the cannonical box fixed, and so it can be split into four parts: Permutation inside column 1 in the canonical box, same for column 2 and 3, and finally the permutation of the three columns. Run the action of these four parts on B8 sequentially. The lookup table for this must be created anyway,
even if we go for #1 or #2 above: Compared to #2, this takes even less memory, but several operations are needed in the inner loop.

Which of the three (or others, if there are good suggestions) we use for (B5,B6,B7,B8) is independent of the one for (B9,B10,B11,B12). I guess that #1 is the best for the first among the 9 cases, #2 is the best for the others.

(Sorry if this was messy, but it's late, and I just want to end the day now...)
kjellfp
 
Posts: 140
Joined: 04 October 2005

Postby deam3r » Mon Feb 13, 2006 5:08 am

wow all of it is very cool and intresting!
deam3r
 
Posts: 10
Joined: 12 February 2006

Postby kjellfp » Wed Feb 22, 2006 11:44 pm

I get time now and then to do some tiny progress.

The current state is that I've started writing on part 2 by experimenting with different ways to implement the counting.

The current code is estimated to complete in 315 days on my hardware.

To optimize my code, I'd like to ask PatmaxDaddy (or anybody else knowing these things better than me) for some advice.

The current code is (strongly simplified) like this:

Code: Select all
unsigned short count1[5775][5775];
unsigned short count2[5775][5775];

<read in count1 and count2 from file>

void count_gangster() {

  for (<many loops>) {

    <calculate b1 and b3>

    for (i=0; i<346; i++)
      total += count1[b1][b2[i]] * count2[b3][b4[i]];

  }

}


I want to improve my code by using the memory in a better way in the inner loop. I thought something like this way of caching into a more local variable could help, though I have no experience in using structs
Code: Select all
struct CountStruct {
  unsigned short count[5775];
};

CountStruct count1[5775];
CountStruct count2[5775];

<read in count1 and count2 from file>

void count_gangster() {

  CountStruct local_count1, local_count2;

  for (<many loops>) {

    <calculate b1 and b3>
    local_count1 = count1[b1];
    local_count2 = count2[b3];

    for (i=0; i<346; i++)
      total += local_count1.count[b2[i]] * local_count2.count[b4[i]];

  }

}

Testing proved that this only slowed down the code. What I really want is maybe some assembly programming like this:
Code: Select all
unsigned short count1[5775][5775];
unsigned short count2[5775][5775];

<read in count1 and count2 from file>

void count_gangster() {

  unsigned short local_count1[5775];
  unsigned short local_count2[5775];

  for (<many loops>) {

    <calculate b1 and b3>

    <Assembly code copying count1[b1][0...5774] -> local_count1[0...5774]>
    <Assembly code copying count2[b3][0...5774] -> local_count2[0...5774]>

    for (i=0; i<346; i++)
      total += local_count1[b2[i]] * local_count2[b4[i]];

  }

}

Is this doable? Could it help? The last time I did assembler programming was over 20 years ago, so I need some help to know if there is an easy way to move blocks of data this way, and how to do it.
kjellfp
 
Posts: 140
Joined: 04 October 2005

Postby PatmaxDaddy » Fri Feb 24, 2006 1:30 am

kjellfp wrote:Is this doable? Could it help?
It's doable, but all that copying will only make it slower. Local data on the stack is NOT accessed faster than static data. The thing that matters is where the data are physically, and the places are, in decreasing order of speed, CPU registers, L1 cache, L2 cache, DRAM, disk. You have 133 Mb of counts, which will fit in DRAM but not L2 cache. The data will be copied by the hardware from DRAM to L2, and from L2 to L1, as it is accessed. Copying it from static arrays to the stack just copies from one place in L1 cache to another, and is as waste of time.

There may be options for improving access speed, however. Cache is optimized for repeated access to the same locations, not for sequential access to a long array of data. When your access misses the L1 cache or L2 cache, a long delay (called a "stall") occurs. The problem is that the CPU does not know what data you will need next, but you as the programmer often do. You can tell the CPU to prefetch data you know you will need soon to reduce or eliminate stalls. There is a special assembly instruction for this on recent Pentium CPUs. It may or may not help in your case; more analysis is needed.

Are you using a Pentium 4 CPU? What compiler are you using? Can you post a disassembly of the inner loop?
PatmaxDaddy
 
Posts: 63
Joined: 18 October 2005

Postby kjellfp » Fri Feb 24, 2006 7:05 pm

The source code used for 4x3-counting will be sent to you on mail.

I have a Pentium 4 CPU, 1GB of DRAM, 3 GHz CPU. I use cygwin for Windows, with the g++ compiler. I will of course consider using another compiler if that's a benifit.

The main algorithm receives the contents of the first band, boxes b_00, b_01, b_02, b_03 (I denote the boxes as b_ij for band i, stack j, counting starting at 0). b_00 is always the cannonical box, 0123/4567/89AB. The list of the 346 pairs of boxes that fit in as b_10 and b_20 are precalculated. They are stored such that the first 173 boxes are enough when chosing b_10 and b_20 (like only 28 out of the 56 boxes are needed in the first loop for 3x3-counting).

In the beginning, the list of the 346 fitting boxes, together with a change of basis from knowing b_01, b_02, b_03 helps me calculate the list of the 346 choices for b_11,b_12,b_13,b_21,b_22,b_23.

The first loop is the 173 choices for the pairs (b_10,b_20).

At this stage, the collections of the 346 choices for b_11,b_12,b_13,b_21,b_22,b_23 are recalculated with respect to a change of basis bringing b_10 and b_20 to the cannonical box.

The second loop is the 346 choices for the pairs (b_11,b_21).

At this stage we can find the base index BASE1 of the pair (b_10,b_11), there are 9 choices for BASE1. We also find the base index BASE2 of the pair (b_20,b_21).

To save memory, this version of the program only handles one particular choice for (BASE1,BASE2) (together with (BASE2,BASE1) as it uses the same set of data). So if we don't get our particular choice for BASE1 and BASE2 now, we leave the calculation for another version of the program, and skip to the next iteration of the loop.

If we got our case for BASE1 and BASE2, we move on by a change of basis sending b_11 and b_21 to the cannonical form with repsect to their base, and recalculate again the choices for b_12, b_13, b_22, b_23.

The third and fourth loop is the 346 choices for the pairs (b_12,b_22) and inside that, the 346 choices for the pairs (b_13,b_23), and so we sum up the 346x346 terms, each being a product of the count lookup value for (b_12,b_13) and for (b_22,b_23).

I think the right place to do a DRAM to L2 copy of data is just before the third loop. We then know which 346x346 values to move over for the two tables, so if this is possible to do in one "stall", that would maybe be great. But is this possible? I know the 346 rows and columns of my original 5775x5775 matrix that I want to move over, but they come widely distributed in the original matrix, and not as "neighbouring" rows and columns.

If this is possible, what I need is 346 x 346 entries in two different arrays where the values occupy 2 bytes. That's a total of 346*346*2*2 = 478864 bytes, well within the size of the L2 cache.

This is how I conceptually imagine the code put out here before could look (though I have no idea about how this actually is done)
Code: Select all
unsigned short count1[5775][5775];
unsigned short count2[5775][5775];

<read in count1 and count2 from file>

void count_gangster() {

  for (<loops chosing boxes in first two stacks>) {

    <calculate list of 346 boxes for b_12, b_13, b_22, b_23>

    // Prefetch DRAM-stored data to L2 cache
    <inside one stall do> {
      for (i=0; i<346; i++) for (j=0; j<346; j++)
        <fetch count1[b_12[i]][b_13[j]]>;
      for (i=0; i<346; i++) for (j=0; j<346; j++)
        <fetch count2[b_22[i]][b_23[j]]>;
    }

    for (i=0; i<346; i++)
      for (j=0; j<346; j++)
        total += count1[b_12[i]][b_13[j]] * count2[b_22[i]][b_23[j]];

  }

}

(Of course I rather cache the b_12[i]-value instead of looking it up every time inside the j-loop, but that's not the issue here)
kjellfp
 
Posts: 140
Joined: 04 October 2005

4x3 progress

Postby kjellfp » Fri Mar 24, 2006 9:36 am

PatmaxDaddy and I have spent some time on the 4x3 counting challenge, and we have now started the real count.

I originally wrote some code to do the counting, this would complete in about 320 days. PatmaxDaddy has then used his valuable skills to improve the code and write parts of it in assembly, so we are probably now down to less than 100 days. His coding has been an excellent contribution to the work.

We are now using up to 5 computers for the counting on a more or less full-time basis. The job now is basically to sit down and wait!
kjellfp
 
Posts: 140
Joined: 04 October 2005

Next

Return to General

cron