The New Sudoku Players' Forum

by **PatmaxDaddy** » Tue Apr 25, 2006 11:26 pm

C programmers using large tables and looking for high speed should definitely be aware of the cache properties of the machine they're using. Understanding your memory access pattern, trying to rearrange it to avoid random access, aligning data to cache line boundaries, can all be done in C. The sorting we used for 4x3 counting made by far the most differrence in speed, and it was all written in C.

Some C compilers have a means for doing SSE2 operations in C, using intrinsic function calls. I know Microsoft Visual C++ does, I don't know about others, and I've never used them since I prefer just writing assembly. I would consider using SSE2 when your computation can benefit from SIMD parallel computation, i.e. operating on 1, 2, 4, or 8-byte integers, or 4 or 8-byte floats, in parallel. This does not work well when the data have to be fetched from random locations, but when they are packed in sequential locations the performance can be great.

I would probably forget about non-temporal stores and prefetch. Software pipelines are only useful in assembly language.

Note that most compilers make writing an inner loop in assembly language, imbedded in the C program and with access to C data, rather painless. Writing a short inner loop may be a nice project for the more adventurous C programmer.

I don't know about GCC and -O9, but do let it optimize as much as it can. I use MS Visual C++, and I'm generally impressed with the code it generates.

I almost never use a profiler; I can't remember the last time I used one. First choice for bottlenecks is to step back from the computer and think about what's going on. This works well when you have a good mental model of what the CPU is actually doing. You'll note that all of my explanations of speeding up 4x3 are based on an analysis of memory and CPU characteristics, not on emperical timing data. Of course it has taken me three decades to get to this point, and there is still quite a bit I can't figure out about what's going on inside the P4.

Second choice, for me anyway, is to measure execution times. It does help to know where to look, since it's impractical to time everything (I guess that's what a profiler does, but you have to learn how to use it, and you use it so infrequently that you have to relearn each time). clock() is good for timing relatively slow things; on my machine it has a resolution of 1 ms, so timing events less than 10 ms is unreliable. I use rdtsc because it has CPU clock tick resolution, and can be used for timing very short events, like one iteration of an inner loop. I wrap rdtsc in a C++ class for convenience.

Keep in mind that the computer is always doing other things besides running your program, so measured times can vary. One approach is to run the code say 100 times and get an average, but of course this takes 100 times longer. When I want to time code I make sure to close every other program, and I will go so far as to unplug the Ethernet cable so there won't be any network traffic to respond to. I'm not sure if this really helps, but it's easy to do.

by **Red Ed** » Wed Apr 26, 2006 3:15 pm

PatmaxDaddy wrote:Keep in mind that the computer is always doing other things besides running your program, so measured times can vary. One approach is to run the code say 100 times and get an average

I tried the following experiment. Allocate 128MB of memory using malloc(). Then for each memsz = 128MB, 64MB, ..., 512bytes take 10 recordings of the time taken to sum memsz/4 words of memory. Report the minimum of these 10 in terms of rdtsc counts per byte (and print the word sum, so the compiler doesn't just skip the calculation in its entirety!).

I was hoping to see a sharp decrease in rdtsc counts per byte at the DRAM/L2 boundary and the L2/L1 boundary, with counts per byte being steady in between. But instead I got this:

Code: Select all: 134217728 bytes: 3.830055 counts per byte (sum=0) 67108864 bytes: 3.812698 counts per byte (sum=0) 33554432 bytes: 3.810041 counts per byte (sum=0) 16777216 bytes: 3.805807 counts per byte (sum=0) 8388608 bytes: 3.792501 counts per byte (sum=0) 4194304 bytes: 3.761340 counts per byte (sum=0) 2097152 bytes: 3.773495 counts per byte (sum=0) 1048576 bytes: 3.751833 counts per byte (sum=0) 524288 bytes: 3.734318 counts per byte (sum=0) 262144 bytes: 1.397884 counts per byte (sum=0) 131072 bytes: 1.375999 counts per byte (sum=0) 65536 bytes: 1.002930 counts per byte (sum=0) 32768 bytes: 0.927185 counts per byte (sum=0) 16384 bytes: 0.755493 counts per byte (sum=0) 8192 bytes: 0.628174 counts per byte (sum=0) 4096 bytes: 0.631104 counts per byte (sum=0) 2048 bytes: 0.637207 counts per byte (sum=0) 1024 bytes: 0.649414 counts per byte (sum=0) 512 bytes: 0.673828 counts per byte (sum=0)

By taking the minimum of the 10 timings each time, rather than the average which is prone to being clobbered by outliers, I get counts per byte that wobble by ~1% per run -- so pretty reliable.

My question is: what's going on? I understand the rate increase at 256KB (that must be the size of my L2 cache), but I can't see the L2/L1 boundary. And I'm confused about the gradual improvement in speed reading down the first nine lines of the output above.

Code below in tiny font to save space. Compiled with gcc -O9.

#include <stdio.h>
#include <stdlib.h>

#define RDTSCLL(val) __asm__ __volatile__("rdtsc" : "=A" (val))

typedef unsigned long long u64;

int main(void) {
int *mem, memsz, memints;
int i;

/* grab 128MB memory
*/
memsz = 1<<27;
if (NULL == (mem=malloc(memsz))) {
perror("malloc");
exit(1);
}
memints = memsz/4;
i = (int)mem; i &= ~(i-1); printf("alignment: mem = %d mod %d\n",i,2*i);

/* time some memory accesses
*/
for (; memints>=128; memints/=2) {
int iter, sum=0;
u64 mindiff = 0;
for (iter=0; iter<10; iter++) {
u64 t0, t1;
RDTSCLL(t0);
for (i=0; i<memints; i++) sum += mem[i];
RDTSCLL(t1);
if (mindiff==0 || t1-t0<mindiff) mindiff = t1 - t0;
}
printf("%9d bytes: %f counts per byte (sum=%d)\n",memints*4,mindiff/4.0/memints,sum);
}

return 0;
}

by **PatmaxDaddy** » Wed Apr 26, 2006 7:51 pm

Using min instead of average is a good idea, on the theory that other things going on unrealted to the loop can only make the code slower.

I'm not surprised that you don't see a sharp L2/L1 boundary. Various data compete for L1 with the array you're summing, like all of the local variables (on the stack) in your program. L1 cache line size and asscoiativity (L1 is 4-way set associative, I think) interact with memory alignment is ways that are difficult to figure out. To be honest, I don't know why time times from 262144 down to 16384 behave the way they do. These machines are very complex, and I am often completely baffled by the timings I get, and frustrated that I can't get better documentation from Intel.

That said, you probably have an 8Kb L1 cache, partly because you do see a transition in behavior from 8192 to 16384, and partly because I'm pretty sure that the P4 has an 8Kb cache (I assume you have a P4). Below is a program that will tell you the cache size on your machine.

I do know why you get a gradual improvement from 512 to 8192. You are timing 128 to 2048 iterations of the loop, plus one execution of loop setup (mostly loading stuff into registers, maybe one or two branch mispredictions, which are rather expensive, probably some part of the RDTSCLL). As the loop count increases, the setup time is amortized over more iterations. If one iteration (4 bytes) is L counts, and the setup is S counts, then for B bytes the time is L/4 + S/B counts/byte. I did a linear regression on your data and found that L = 2.500205 counts and S = 24.96482 counts. These values predict the actual times you got to within 0.01%!

Compile and run this program to find your cache sizes. This was written for MS Visual C++, so it might need some minor mods for GCC.

Code: Select all: void cpuID(unsigned int code, unsigned int level, unsigned int& a, unsigned int& b, unsigned int& c, unsigned int& d) { __asm { mov eax, code mov ecx, level cpuid mov edi, a mov [edi], eax mov edi, b mov [edi], ebx mov edi, c mov [edi], ecx mov edi, d mov [edi], edx } } struct CacheString { int code; const char* string; } cacheStrings[] = { { 0x06, "L1 I: 8Kb, 4 way, 32b lines"}, { 0x08, "L1 I: 16Kb, 4 way, 32b lines"}, { 0x08, "L1 D: 8Kb, 2 way, 32b lines"}, { 0x0C, "L1 D: 16Kb, 4 way, 32b lines"}, { 0x22, "L3: 512Kb, 4 way, 64b lines, 2 lines/sector"}, { 0x23, "L3: 1Mb, 8 way, 64b lines, 2 lines/sector"}, { 0x25, "L3: 2Mb, 8 way, 64b lines, 2 lines/sector"}, { 0x29, "L3: 4Mb, 8 way, 64b lines, 2 lines/sector"}, { 0x2C, "L1 D: 32Kb, 8 way, 64b lines"}, { 0x2C, "L1 I: 32Kb, 8 way, 64b lines"}, { 0x41, "L2: 128Kb, 4 way, 32b lines"}, { 0x42, "L2: 256Kb, 4 way, 32b lines"}, { 0x43, "L2: 512Kb, 4 way, 32b lines"}, { 0x44, "L2: 1Mb, 4 way, 32b lines"}, { 0x45, "L2: 2Mb, 4 way, 32b lines"}, { 0x46, "L3: 4Mb, 4 way, 64b lines"}, { 0x47, "L3: 8Mb, 8 way, 64b lines"}, { 0x60, "L1 D: 16Kb, 8 way, 64b lines"}, { 0x66, "L1 D: 8Kb, 4 way, 64b lines"}, { 0x67, "L1 D: 16Kb, 4 way, 64b lines"}, { 0x68, "L1 D: 32Kb, 4 way, 64b lines"}, { 0x70, "Trace: 12Kb, 8 way"}, { 0x71, "Trace: 16Kb, 8 way"}, { 0x72, "Trace: 32Kb, 8 way"}, { 0x78, "L2: 1Mb, 4 way, 64b lines"}, { 0x79, "L2: 128Kb, 8 way, 64b lines, 2 lines/sector"}, { 0x7A, "L2: 256Kb, 8 way, 64b lines, 2 lines/sector"}, { 0x7B, "L2: 512Kb, 8 way, 64b lines, 2 lines/sector"}, { 0x7C, "L2: 1Mb, 8 way, 64b lines, 2 lines/sector"}, { 0x7D, "L2: 2Mb, 8 way, 64b lines"}, { 0x7F, "L2: 512Kb, 2 way, 64b lines"}, { 0x82, "L2: 256Kb, 8 way, 32b lines"}, { 0x83, "L2: 512Kb, 8 way, 32b lines"}, { 0x84, "L2: 1Mb, 8 way, 32b lines"}, { 0x85, "L2: 2Mb, 8 way, 32b lines"}, { 0x86, "L2: 512Kb, 4 way, 64b lines"}, { 0x87, "L2: 1Mb, 8 way, 64b lines"}, { 0xF0, "prefetch 64b"}, { 0xF0, "prefetch 128b"} }; int main(int argc, char* argv[]) { unsigned int eax, ebx, ecx, edx; cpuID(1, 0, eax, ebx, ecx, edx); printf("CLFLUSH line size = %d\n", 8 * ((ebx >> 8) & 0xFF)); int n = 0, count; do { unsigned int r[4]; cpuID(2, 0, r[0], r[1], r[2], r[3]); if (n == 0) count = r[0] & 0xFF; r[0] &= 0xFFFFFF00; for (int i = 0; i < 4; ++i) if ((r[i] & 0x80000000) == 0) for (int j = 0; j < 4; ++j) { int code = (r[i] >> (8 * j)) & 0xFF; if (code) { for (int k = 0; k < sizeof(cacheStrings) / sizeof(CacheString); ++k) if (cacheStrings[k].code == code) { printf("%s\n", cacheStrings[k].string); break; } } } } while (++n < count); return 0; }

by **Red Ed** » Wed Apr 26, 2006 10:04 pm

PatmaxDaddy wrote:This was written for MS Visual C++, so it might need some minor mods for GCC.

Here's the gcc equivalent of your cpuID function (I think ...)

Code: Select all: void cpuID(unsigned int code, unsigned int level, unsigned int *a, unsigned int *b, unsigned int *c, unsigned int *d) { asm ( "cpuid" : "=a" (*a), "=b" (*b), "=c" (*c), "=d" (*d) : "a" (code), "c" (level) ); }

called from the main code like this for example

Code: Select all: cpuID(1, 0, &eax, &ebx, &ecx, &edx);

On my machine, this seems to do the right thing for cpuID(1,0,...), where at least it got the processor type right, but gives me back all zeroes for cpuID(2,0,...).

I'm using an AMD K7 Athlon (PM core), apparently.

by **PatmaxDaddy** » Wed Apr 26, 2006 11:53 pm

I wrote cpuID based on Intel documentation, so I guess it just isn't right for AMD processors. If you realy want to make it run, I think you could go to AMD's web site, download a programmer's reference that has the instruction set, and lookup the CPUID. At least that's what I did for Intel. But it's probably not worth the trouble.

by **fabioca** » Wed Feb 19, 2020 12:29 am

Hi. This is a very old post and probably nobody is looking at it anymore. Still I find it very interesting. I have a question on this

First, notice that given a pair (B1,B2) of 4x3 boxes, they are configuration equivalent to a pair (Id,B2’) where Id is the configuration {1,2,3,4},{5,6,7,8},{9,10,11,12} and B2’ can be one out of 9 base boxes.

Equivalent in what sense and how is this determined?

Is the source code used for the equivalence reduction available somewhere?

by **GizmoBill** » Tue Dec 05, 2023 6:30 pm

The 4x3 count that Kjell reported over 17 years ago was independently verified by me in 2022. See https://github.com/GizmoBill/Sudoku4x3 for details and complete source code. Total run time has been reduced from the original 2568 hours to around 50 hours, due almost entirely to a much better cache hit rate. Running on 64-bit machines also helps some.

Note that back in the day I had the username "PatmaxDaddy", but my old credentials no longer work. Follow the above link to see who I really am.

The New Sudoku Players' Forum

4x3 Sudoku counting

Re: 4x3 Sudoku counting

Re: 4x3 Sudoku counting