F16 intrinsics standalone #14

Narsil · 2023-08-01T11:49:02Z

This is very dirty PR more a POC than anything else at this point.

It seems to work and be correct. (It passes in every scenario I tried.)
It is faster than without.

half-rs is using a fork starkat99/half-rs#98 to get some currently non existing intrinsics for pure f16 computing.

Then hackilishly added them into gemm:

Copy-pasted the code for f16 gemm (which does f16 -> f32simd -> matmul -> f16) to do purely f16 -> f16.

The code requires black_box atm for the compiler to be happy. This is most likely an error of mine in half-rs intrinsics implementation (I used arm! macro but do no understand how that affects the compiler).

I didn't re-optimize this afterwards to make sure cache lines were adapted or anything of the sort.

Current results:

GGML WITHOUT ACCELERATE (f32xf16) -> f32 : 220ms (1 thread) - 197ms (8 threads)
GEMM (f16xf16x) -> f16: This is very dirty PR more a POC than anything else at this point.

It seems to work and be correct. (It passes in every scenario I tried.)
It is faster than without.

half-rs is using a fork starkat99/half-rs#98 to get some currently non existing intrinsics for pure f16 computing.

Then hackilishly added them into gemm:

Copy-pasted the code for f16 gemm (which does f16 -> f32simd -> matmul -> f16) to do purely f16 -> f16.

The code requires black_box atm for the compiler to be happy. This is most likely an error of mine in half-rs intrinsics implementation (I used arm! macro but do no understand how that affects the compiler).

I didn't re-optimize this afterwards to make sure cache lines were adapted or anything of the sort.

Current results:

GGML WITHOUT ACCELERATE (f32xf16) -> f32 : 220ms (1 thread) - 197ms (8 threads)
GEMM (f16xf16x) -> f16: 134ms (thread) - 110ms (8 threads)
M, N, K : 4096 x 128 x 11108

For reference Accelerate seems to do ~25ms for the same op and threading seems to decrease performance on it , which I guess is because Accelerate already uses threading underneath). (1 thread) - 68ms (8 threads)
M, N, K : 4096 x 128 x 11108

For reference Accelerate seems to do ~25ms for the same op and threading seems to decrease performance on it , which I guess is because Accelerate already uses threading underneath).

…m-simd

benchmark)

This improves drastically overthreading issue (>48cores)

F16 vec plus wasm simd

LaurentMazare and others added 15 commits July 1, 2023 17:29

Vectorize the inner loop of f16's pack_generic.

a9f610e

Remove the old version.

4b189f2

Wasm simd ops (f32 only).

926d13a

Fix multiplier.

57922f0

Adding a parallelism bench.

8d205b6

Merge remote-tracking branch 'narsil/wasm_simd' into f16-vec-plus-was…

e91056e

…m-simd

Fixing large multi-threading (-40% improvement for parallelism

0da2128

benchmark)

Fix.

c1a5b31

Fix tests.

76ea6bd

Format.

b11ea6f

Merge pull request #2 from Narsil/rayon3

f577d01

This improves drastically overthreading issue (>48cores)

Merge pull request #3 from LaurentMazare/f16-vec-plus-wasm-simd

c03b453

F16 vec plus wasm simd

Using m1 intrinsics for f16xf16

c7a1ceb

Removing black box.

a8f0280

Cleanup.

c2d2173

Narsil closed this Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F16 intrinsics standalone #14

F16 intrinsics standalone #14

Narsil commented Aug 1, 2023

F16 intrinsics standalone #14

F16 intrinsics standalone #14

Conversation

Narsil commented Aug 1, 2023