Hi mladen,
champagne wrote:running 7 cores in parallel, I see a deterioration in the timings that I roughly estimate around 20% (but it can be worse).
You should be happy if the performance degradation is only 20%.
You are likely right, but as I wanted in priority to test the pass 1 (BTW I found another small bug in the vicinity of the last one), I just tested the sample with the full load of the processor. 
I’ll try to have a better estimate of the deterioration later, on the “final” version of the code. 
I started the 10 first “band 1” on the same processor (the best one) and I am waiting for the closure of the 3 bands with the highest number of ED bands 1+2 to have an idea of the average run time per ED band 1+2 in this area (2.7% of the bands 3 to test in the pass 1, would be much more in pass 2). 
For a code where the memory access is the bottleneck, you can halve the processing speed by only running in parallel a second instance of the program, which is sufficient to pollute the cache.
Following the simplified hardware diagram
RAM <-> cache <-> registers <-> ALU
the ideal scalability is achievable only when the inner loops work only on registers.
Clear, but not so easy to apply. Here, the inner loop uses a 256x256 matrix in a fix place and the code is limited to some tens of instructions, as for example 
- Code: Select all
- for (ix = 0; ix < nxc; ix++, px = &px[2]){
 store = ix << 8; //8 low bits for iy
 register uint64_t *py = vyc;
 for (iy = 0; iy < nyc; iy++, py=&py[2]){
 if (px[0] & py[0]|| px[1] & py[1]) continue;
 *pstore++ = store | iy;// (ix<<8) | iy
 }
 }
Where nxc=nyc=256, and this is likely 80% of the core consumption. 
The art is either to 
- find a good balance between memory operations and ALU instructions (for example few additional tables with intermediate results could work fine on single thread but cause cache pollution on 2 or more parallel threads and total performance is better when not using them), 
If you are thinking of lookup tables, I tried to kill most of them for the reasons exposed here. Some more cleaning is needed to reach the optimum.
- instead of running several instances of the process, run a single instance and run in parallel its middle loops. This increases the chances different threads to use large amount of the same data from the same memory addresses, leaving more free cache for use by the thread-specific data.
Not so easy to implement, and this is why I stick to what I can manage, the same program run in parallel, but one could think of running the same band 1 in parallel, sharing the corresponding tables. 
found no time to look whether your source code is read/write or write-only.
do you mean watching or contributing??
My last attempt to improve the process is to apply a 2X3Y or a 3X3Y for the bands with a high number of valid bands.
This goes in the same direction, the drawback is double
time spent in the “vectors building”,
more chunks far from the 256x256 size
so this has to be excluded for bands with a small number of valid bands
Cheers
champagne