Get the core hrend and vrend assembly routines to compile properly in GCC #19

Ericson2314 · 2013-04-04T06:12:28Z

Certainly use "p" (link-time constant) constraints for pointers to global variables. Ideally use "dummy variables" to avoid hard coding any intermediate constants either.

Lensman · 2013-04-04T21:51:41Z

I'm pretending I didn't see this one.

Ericson2314 · 2013-04-04T21:53:16Z

:) It actually shouldn't be too bad. Just a lot of tedious stuff. I am going to do it without dummy constraints first and just use hard-coded clobbers

Ericson2314 · 2013-04-04T21:55:25Z

a7594c3

Lensman · 2013-05-04T18:02:25Z

Just a quick note, if you want the C versionof these functions to run faster use this (inverse sqrt);

inline float f_rsqrt( const float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;

    x2 = number * 0.5F;
    y  = number;
    i  = * ( long * ) &y;
    i  = 0x5f3759df - ( i >> 1 );
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
    // 2nd iteration, this can be removed if you don't need accuracy
    //y  = y * ( threehalfs - ( x2 * y * y ) );   
    return y;

}
then change lines in vrendz/hrendz ;
_(float *)(p0+i) = (float)c0->dist/rsqrt(dirx_dirx+diry_diry);
to
*(float *)(p0+i) = (float)c0->dist_f_rsqrt(dirx_dirx+diry_diry);

It makes the c version run about 50% speed of the asm version, which is an adequate improvement. I've nearly finished converting the renderer to intrinsics, so this issue is almost complete.

Lensman · 2013-05-04T18:05:19Z

You don't need the second iteration of the Newton-Raphson approximation. It's adequate with one iteration in the renderer as the inputs are quantized from an original full precision sqrt, and stored in a lookup table in Ken's original code.

Ericson2314 · 2013-05-04T20:48:28Z

Ah, this is http://en.wikipedia.org/wiki/Fast_inverse_square_root I assume? There must also be an intrinsic which uses the single instruction to do this I'd hope.

Lensman · 2013-05-05T02:19:53Z

That's right, it's the infamous code from quake that has had whole articles written about it. The intrinsic is the reciprical sqrt which you will find in the v/hrend(z)sse part of the renderer which operates on 4 values at a time.
I've analysed the renderer in AMD code analyst, and it's completely memory contrained when using sse. The c version doesn't have the same issues, as it plods through the data in lockstep with the memory fetches anyway.
The only way to make that bit faster is to refactor the castdat structure so that color and distance are not stored next to each other, or to possibly get rid of the look up altogether and just calculate in registers. I'll put that on the back burner as an experiment for future meanderings.

Ericson2314 · 2013-05-05T02:29:17Z

Wait, so is the instruction itself or intrinsic bad with the memory-access bound? Also could you push your work?

Lensman · 2013-05-05T19:09:05Z

I'll tidy up a bit, and push so you can have a look.
When I say memory bound, in this instance, the renderer is trying to take advantage of lookup tables (angstart table in this case), which is an integration of angles made by vline/hline
These fetches and lookups may be redundant on newer cpus, because thay can calculate sincosf faster than a memory access, hence making the function memory bound. Did I explain that correctly?

Lensman · 2013-05-05T19:16:13Z

As far as intrinsics go, any intrinsics that use the __m64 datatype are not supported on x86_64, that's not to say that you can't use mmx registers in assembly. It can all be mitigated with ifdefs, so there can still be a non fatbin version of the executable which is coalesced at compile time. It's just something to be aware of. Take a look at the Intel optimization manual for gotchas. It's related to emms as well becuase mmx registers are shared with with fpu 80 bit registers. All of x86/87 is a kludge, because of backwards compatability. Just like Windows, the price you pay for a general solution is complexity. Compared to most chipsets x86 is a frankenmonster ;)

Ericson2314 · 2013-05-05T21:38:10Z

Ha, I thought you were going to say "Just like Windows, the price you pay for backwards compatibility, is kludge".

OK, yeah I didn't make the connection between no __m64 on x86-64 and intrinsics. Yeah fatbin stuff just make it harder to think, I wouldn't mind having dedicated binaries. Ideally our builds will mostly be MinGW anyways where intrinsics and x86-64 work together fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get the core hrend and vrend assembly routines to compile properly in GCC #19

Get the core hrend and vrend assembly routines to compile properly in GCC #19

Ericson2314 commented Apr 4, 2013

Lensman commented Apr 4, 2013

Ericson2314 commented Apr 4, 2013

Ericson2314 commented Apr 4, 2013

Lensman commented May 4, 2013

Lensman commented May 4, 2013

Ericson2314 commented May 4, 2013

Lensman commented May 5, 2013

Ericson2314 commented May 5, 2013

Lensman commented May 5, 2013

Lensman commented May 5, 2013

Ericson2314 commented May 5, 2013

Get the core hrend and vrend assembly routines to compile properly in GCC #19

Get the core hrend and vrend assembly routines to compile properly in GCC #19

Comments

Ericson2314 commented Apr 4, 2013

Lensman commented Apr 4, 2013

Ericson2314 commented Apr 4, 2013

Ericson2314 commented Apr 4, 2013

Lensman commented May 4, 2013

Lensman commented May 4, 2013

Ericson2314 commented May 4, 2013

Lensman commented May 5, 2013

Ericson2314 commented May 5, 2013

Lensman commented May 5, 2013

Lensman commented May 5, 2013

Ericson2314 commented May 5, 2013