Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get the core hrend and vrend assembly routines to compile properly in GCC #19

Open
Ericson2314 opened this issue Apr 4, 2013 · 11 comments

Comments

@Ericson2314
Copy link
Owner

Certainly use "p" (link-time constant) constraints for pointers to global variables. Ideally use "dummy variables" to avoid hard coding any intermediate constants either.

@Lensman
Copy link

Lensman commented Apr 4, 2013

I'm pretending I didn't see this one.

@Ericson2314
Copy link
Owner Author

:) It actually shouldn't be too bad. Just a lot of tedious stuff. I am going to do it without dummy constraints first and just use hard-coded clobbers

@Ericson2314
Copy link
Owner Author

a7594c3

@Lensman
Copy link

Lensman commented May 4, 2013

Just a quick note, if you want the C versionof these functions to run faster use this (inverse sqrt);

inline float f_rsqrt( const float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;

    x2 = number * 0.5F;
    y  = number;
    i  = * ( long * ) &y;
    i  = 0x5f3759df - ( i >> 1 );
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
    // 2nd iteration, this can be removed if you don't need accuracy
    //y  = y * ( threehalfs - ( x2 * y * y ) );   
    return y;

}
then change lines in vrendz/hrendz ;
_(float *)(p0+i) = (float)c0->dist/rsqrt(dirx_dirx+diry_diry);
to
*(float *)(p0+i) = (float)c0->dist_f_rsqrt(dirx_dirx+diry_diry);

It makes the c version run about 50% speed of the asm version, which is an adequate improvement. I've nearly finished converting the renderer to intrinsics, so this issue is almost complete.

@Lensman
Copy link

Lensman commented May 4, 2013

You don't need the second iteration of the Newton-Raphson approximation. It's adequate with one iteration in the renderer as the inputs are quantized from an original full precision sqrt, and stored in a lookup table in Ken's original code.

@Ericson2314
Copy link
Owner Author

Ah, this is http://en.wikipedia.org/wiki/Fast_inverse_square_root I assume? There must also be an intrinsic which uses the single instruction to do this I'd hope.

@Lensman
Copy link

Lensman commented May 5, 2013

That's right, it's the infamous code from quake that has had whole articles written about it. The intrinsic is the reciprical sqrt which you will find in the v/hrend(z)sse part of the renderer which operates on 4 values at a time.
I've analysed the renderer in AMD code analyst, and it's completely memory contrained when using sse. The c version doesn't have the same issues, as it plods through the data in lockstep with the memory fetches anyway.
The only way to make that bit faster is to refactor the castdat structure so that color and distance are not stored next to each other, or to possibly get rid of the look up altogether and just calculate in registers. I'll put that on the back burner as an experiment for future meanderings.

@Ericson2314
Copy link
Owner Author

Wait, so is the instruction itself or intrinsic bad with the memory-access bound? Also could you push your work?

@Lensman
Copy link

Lensman commented May 5, 2013

I'll tidy up a bit, and push so you can have a look.
When I say memory bound, in this instance, the renderer is trying to take advantage of lookup tables (angstart table in this case), which is an integration of angles made by vline/hline
These fetches and lookups may be redundant on newer cpus, because thay can calculate sincosf faster than a memory access, hence making the function memory bound. Did I explain that correctly?

@Lensman
Copy link

Lensman commented May 5, 2013

As far as intrinsics go, any intrinsics that use the __m64 datatype are not supported on x86_64, that's not to say that you can't use mmx registers in assembly. It can all be mitigated with ifdefs, so there can still be a non fatbin version of the executable which is coalesced at compile time. It's just something to be aware of. Take a look at the Intel optimization manual for gotchas. It's related to emms as well becuase mmx registers are shared with with fpu 80 bit registers. All of x86/87 is a kludge, because of backwards compatability. Just like Windows, the price you pay for a general solution is complexity. Compared to most chipsets x86 is a frankenmonster ;)

@Ericson2314
Copy link
Owner Author

Ha, I thought you were going to say "Just like Windows, the price you pay for backwards compatibility, is kludge".

OK, yeah I didn't make the connection between no __m64 on x86-64 and intrinsics. Yeah fatbin stuff just make it harder to think, I wouldn't mind having dedicated binaries. Ideally our builds will mostly be MinGW anyways where intrinsics and x86-64 work together fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants