Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is very dirty PR more a POC than anything else at this point.
half-rs is using a fork starkat99/half-rs#98 to get some currently non existing intrinsics for pure f16 computing.
Then hackilishly added them into gemm:
Copy-pasted the code for f16 gemm (which does f16 -> f32simd -> matmul -> f16) to do purely f16 -> f16.
The code requires black_box atm for the compiler to be happy. This is most likely an error of mine in half-rs intrinsics implementation (I used arm! macro but do no understand how that affects the compiler).
I didn't re-optimize this afterwards to make sure cache lines were adapted or anything of the sort.
Current results:
GGML WITHOUT ACCELERATE (f32xf16) -> f32 : 220ms (1 thread) - 197ms (8 threads)
GEMM (f16xf16x) -> f16: This is very dirty PR more a POC than anything else at this point.
half-rs is using a fork starkat99/half-rs#98 to get some currently non existing intrinsics for pure f16 computing.
Then hackilishly added them into gemm:
Copy-pasted the code for f16 gemm (which does f16 -> f32simd -> matmul -> f16) to do purely f16 -> f16.
The code requires black_box atm for the compiler to be happy. This is most likely an error of mine in half-rs intrinsics implementation (I used arm! macro but do no understand how that affects the compiler).
I didn't re-optimize this afterwards to make sure cache lines were adapted or anything of the sort.
Current results:
GGML WITHOUT ACCELERATE (f32xf16) -> f32 : 220ms (1 thread) - 197ms (8 threads)
GEMM (f16xf16x) -> f16: 134ms (thread) - 110ms (8 threads)
M, N, K : 4096 x 128 x 11108
For reference Accelerate seems to do ~25ms for the same op and threading seems to decrease performance on it , which I guess is because Accelerate already uses threading underneath). (1 thread) - 68ms (8 threads)
M, N, K : 4096 x 128 x 11108
For reference Accelerate seems to do ~25ms for the same op and threading seems to decrease performance on it , which I guess is because Accelerate already uses threading underneath).