KernelIntrinsics #562

vchuravy · 2025-02-04T15:19:32Z

The goal is to allow for kernels to be written without relying on KernelAbstractions macros

codecov · 2025-02-04T15:23:50Z

Codecov Report

Attention: Patch coverage is 0% with 28 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (f038d8c) to head (0a8301c).

Files with missing lines	Patch %	Lines
src/KernelAbstractions.jl	0.00%	12 Missing ⚠️
src/macros.jl	0.00%	10 Missing ⚠️
src/pocl/backend.jl	0.00%	6 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           vc/pocl    #562   +/-   ##
=======================================
  Coverage     0.00%   0.00%           
=======================================
  Files           21      21           
  Lines         1509    1519   +10     
=======================================
- Misses        1509    1519   +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pxl-th · 2025-02-04T19:37:56Z

src/intrinsics.jl

+
+Returns the unique local work-item ID.
+"""
+function get_local_id end


So IIUC, backends should implement these like below, right?

function get_local_id() return (threadIdx().x, threadIdx().y, threadIdx().z) end

Yeah basically, and my goal is to replace the old internal functions the people had to override with definitions based on these functions.

vchuravy · 2025-02-05T12:29:42Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

KernelIntrinsics #562 👈 (View in Graphite)
Implement a CPU backend using POCL #556 : 2 other dependent PRs (#558 , #563 )
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

src/KernelAbstractions.jl

src/intrinsics.jl

github-actions · 2025-02-05T15:53:27Z

Benchmark Results

	main	`0a8301c`...	main/0a8301c1af8f52...
saxpy/default/Float16/1024	0.731 ± 0.01 μs	0.0524 ± 0.026 ms	0.0139
saxpy/default/Float16/1048576	0.175 ± 0.0081 ms	0.891 ± 0.024 ms	0.197
saxpy/default/Float16/16384	3.33 ± 0.029 μs	0.0639 ± 0.028 ms	0.0522
saxpy/default/Float16/2048	0.913 ± 0.013 μs	0.0589 ± 0.024 ms	0.0155
saxpy/default/Float16/256	0.585 ± 0.0089 μs	0.0558 ± 0.027 ms	0.0105
saxpy/default/Float16/262144	0.0443 ± 0.00068 ms	0.271 ± 0.026 ms	0.164
saxpy/default/Float16/32768	6.02 ± 0.057 μs	0.0761 ± 0.028 ms	0.0791
saxpy/default/Float16/4096	1.31 ± 0.025 μs	0.0655 ± 0.026 ms	0.0199
saxpy/default/Float16/512	0.642 ± 0.0074 μs	0.0578 ± 0.026 ms	0.0111
saxpy/default/Float16/64	0.555 ± 0.0057 μs	0.0585 ± 0.027 ms	0.00948
saxpy/default/Float16/65536	11.6 ± 0.12 μs	0.104 ± 0.028 ms	0.111
saxpy/default/Float32/1024	0.633 ± 0.011 μs	0.0569 ± 0.026 ms	0.0111
saxpy/default/Float32/1048576	0.23 ± 0.022 ms	0.473 ± 0.033 ms	0.486
saxpy/default/Float32/16384	2.8 ± 0.26 μs	0.0557 ± 0.026 ms	0.0503
saxpy/default/Float32/2048	0.743 ± 0.054 μs	0.0543 ± 0.024 ms	0.0137
saxpy/default/Float32/256	0.568 ± 0.0059 μs	0.0559 ± 0.027 ms	0.0101
saxpy/default/Float32/262144	0.0446 ± 0.0029 ms	0.162 ± 0.035 ms	0.275
saxpy/default/Float32/32768	5.32 ± 0.56 μs	0.0612 ± 0.027 ms	0.0868
saxpy/default/Float32/4096	1.13 ± 0.094 μs	0.0591 ± 0.025 ms	0.019
saxpy/default/Float32/512	0.601 ± 0.0069 μs	0.0559 ± 0.026 ms	0.0108
saxpy/default/Float32/64	0.557 ± 0.0057 μs	0.0575 ± 0.026 ms	0.00969
saxpy/default/Float32/65536	11.7 ± 1.2 μs	0.0763 ± 0.029 ms	0.153
saxpy/default/Float64/1024	0.747 ± 0.019 μs	0.0574 ± 0.026 ms	0.013
saxpy/default/Float64/1048576	0.485 ± 0.041 ms	0.499 ± 0.038 ms	0.971
saxpy/default/Float64/16384	5.36 ± 0.49 μs	0.0568 ± 0.026 ms	0.0944
saxpy/default/Float64/2048	1.14 ± 0.092 μs	0.0515 ± 0.024 ms	0.0221
saxpy/default/Float64/256	0.574 ± 0.0081 μs	0.0583 ± 0.027 ms	0.00985
saxpy/default/Float64/262144	0.11 ± 0.011 ms	0.173 ± 0.03 ms	0.635
saxpy/default/Float64/32768	12.2 ± 1.3 μs	0.0637 ± 0.026 ms	0.191
saxpy/default/Float64/4096	1.71 ± 0.22 μs	0.0601 ± 0.025 ms	0.0284
saxpy/default/Float64/512	0.626 ± 0.014 μs	0.0555 ± 0.027 ms	0.0113
saxpy/default/Float64/64	0.551 ± 0.008 μs	0.0585 ± 0.027 ms	0.00942
saxpy/default/Float64/65536	24.3 ± 2.7 μs	0.0867 ± 0.027 ms	0.28
saxpy/static workgroup=(1024,)/Float16/1024	2.15 ± 0.024 μs	0.0514 ± 0.026 ms	0.0419
saxpy/static workgroup=(1024,)/Float16/1048576	0.163 ± 0.012 ms	0.9 ± 0.03 ms	0.181
saxpy/static workgroup=(1024,)/Float16/16384	4.4 ± 0.097 μs	0.0608 ± 0.026 ms	0.0723
saxpy/static workgroup=(1024,)/Float16/2048	2.32 ± 0.027 μs	0.0579 ± 0.024 ms	0.04
saxpy/static workgroup=(1024,)/Float16/256	2.79 ± 0.03 μs	0.0554 ± 0.026 ms	0.0504
saxpy/static workgroup=(1024,)/Float16/262144	0.0419 ± 0.0015 ms	0.27 ± 0.027 ms	0.155
saxpy/static workgroup=(1024,)/Float16/32768	6.8 ± 0.18 μs	0.074 ± 0.026 ms	0.0919
saxpy/static workgroup=(1024,)/Float16/4096	2.64 ± 0.036 μs	0.0578 ± 0.026 ms	0.0458
saxpy/static workgroup=(1024,)/Float16/512	3.24 ± 0.035 μs	0.0544 ± 0.026 ms	0.0595
saxpy/static workgroup=(1024,)/Float16/64	2.49 ± 0.22 μs	0.0587 ± 0.027 ms	0.0424
saxpy/static workgroup=(1024,)/Float16/65536	12.7 ± 0.36 μs	0.103 ± 0.026 ms	0.123
saxpy/static workgroup=(1024,)/Float32/1024	2.32 ± 0.026 μs	0.0552 ± 0.026 ms	0.0421
saxpy/static workgroup=(1024,)/Float32/1048576	0.238 ± 0.02 ms	0.462 ± 0.041 ms	0.514
saxpy/static workgroup=(1024,)/Float32/16384	4.52 ± 0.37 μs	0.0527 ± 0.025 ms	0.0857
saxpy/static workgroup=(1024,)/Float32/2048	2.47 ± 0.041 μs	0.0509 ± 0.024 ms	0.0485
saxpy/static workgroup=(1024,)/Float32/256	2.75 ± 0.048 μs	0.0558 ± 0.026 ms	0.0492
saxpy/static workgroup=(1024,)/Float32/262144	0.0586 ± 0.0037 ms	0.159 ± 0.035 ms	0.369
saxpy/static workgroup=(1024,)/Float32/32768	7.54 ± 0.67 μs	0.0587 ± 0.026 ms	0.128
saxpy/static workgroup=(1024,)/Float32/4096	2.76 ± 0.091 μs	0.0557 ± 0.026 ms	0.0496
saxpy/static workgroup=(1024,)/Float32/512	2.77 ± 0.03 μs	0.0569 ± 0.026 ms	0.0487
saxpy/static workgroup=(1024,)/Float32/64	2.76 ± 4.5 μs	0.0562 ± 0.026 ms	0.0492
saxpy/static workgroup=(1024,)/Float32/65536	15.6 ± 1.3 μs	0.0749 ± 0.029 ms	0.208
saxpy/static workgroup=(1024,)/Float64/1024	2.3 ± 0.056 μs	0.0571 ± 0.026 ms	0.0403
saxpy/static workgroup=(1024,)/Float64/1048576	0.513 ± 0.033 ms	0.501 ± 0.044 ms	1.02
saxpy/static workgroup=(1024,)/Float64/16384	7.47 ± 0.53 μs	0.0541 ± 0.025 ms	0.138
saxpy/static workgroup=(1024,)/Float64/2048	2.6 ± 0.1 μs	0.0493 ± 0.023 ms	0.0527
saxpy/static workgroup=(1024,)/Float64/256	2.64 ± 0.057 μs	0.0561 ± 0.025 ms	0.047
saxpy/static workgroup=(1024,)/Float64/262144	0.101 ± 0.012 ms	0.171 ± 0.03 ms	0.591
saxpy/static workgroup=(1024,)/Float64/32768	15.4 ± 1.1 μs	0.0627 ± 0.026 ms	0.246
saxpy/static workgroup=(1024,)/Float64/4096	3.21 ± 0.24 μs	0.055 ± 0.026 ms	0.0584
saxpy/static workgroup=(1024,)/Float64/512	2.65 ± 0.061 μs	0.0555 ± 0.026 ms	0.0478
saxpy/static workgroup=(1024,)/Float64/64	2.6 ± 0.053 μs	0.0548 ± 0.026 ms	0.0474
saxpy/static workgroup=(1024,)/Float64/65536	26.7 ± 3 μs	0.0842 ± 0.027 ms	0.317
time_to_load	0.319 ± 0.0027 s	1.12 ± 0.0072 s	0.285

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

vchuravy · 2025-02-06T10:09:13Z

src/intrinsics.jl

+function get_global_size end
+
+"""
+    get_global_id()::@NamedTuple{x::Int32, y::Int32, z::Int32}


Should this be Int32 or Int64?

OpenCL defines these as Csize_t

maleadt · 2025-02-06T10:44:54Z

So the idea is to decouple the back-ends from KA.jl, instead implementing KernelIntrinsics.jl? What's the advantage; do you envision packages other than KA.jl to build their kernel DSL on top of KernelIntrinsics.jl?

github-actions

Some suggestions could not be made:

src/pocl/nanoOpenCL.jl
- lines 670-674

github-actions

Some suggestions could not be made:

src/pocl/nanoOpenCL.jl
- lines 670-674

vchuravy · 2025-02-06T14:59:10Z

So the idea is to decouple the back-ends from KA.jl, instead implementing KernelIntrinsics.jl? What's the advantage; do you envision packages other than KA.jl to build their kernel DSL on top of KernelIntrinsics.jl?

I am very unsure if I am able to pull off #558 and as such, I want to remove the implicit if validindex(ctx) check,
and I want to make #559 possible.

My goal is to allow for a gracefull transition of the DSL macro language to something more similar to "just" OpenCL.
Removing the extra overhead introduced by the arbitrary dimensions, etc.

This would allow us to write performance critical kernels directly and solve the issue of how to write kernels that use barriers correctly.

If I can figure out #558 the macro based DSL can stick around; otherwise I will encourage folks to move their kernel to KernelIntrinsics.

anicusan · 2025-02-07T12:28:01Z

Will KA/KI still be a greatest common denominator of the GPU backends, or are you looking to introduce optional intrinsics? How will the groupreduce API do in terms of portability?

vchuravy · 2025-02-07T12:53:58Z

Will KA/KI still be a greatest common denominator of the GPU backends

The intrinsics proposed here are the greatest common denominator. I could see us adding some more intrinsics for reductions, but that is TBD.

KernelAbstractions currently creates kernels that look like: ``` if __validindex(ctx) # Body end ``` This is problematic due to the convergence requirement on `@synchronize`.

This comment was marked as outdated.

Sign in to view

pxl-th reviewed Feb 4, 2025

View reviewed changes

vchuravy mentioned this pull request Feb 5, 2025

Implement a CPU backend using POCL #556

Open

2 tasks

vchuravy force-pushed the vc/pocl branch from 2121d5c to a61ac84 Compare February 5, 2025 12:38

vchuravy force-pushed the vc/intrinsics branch from 2d12b29 to 1298a56 Compare February 5, 2025 12:42

This comment was marked as outdated.

Sign in to view

vchuravy force-pushed the vc/pocl branch from 5d82cd4 to 4ee5b78 Compare February 5, 2025 15:15

vchuravy force-pushed the vc/intrinsics branch from 1298a56 to 1831242 Compare February 5, 2025 15:15

github-actions bot reviewed Feb 5, 2025

View reviewed changes

src/KernelAbstractions.jl Outdated Show resolved Hide resolved

src/intrinsics.jl Outdated Show resolved Hide resolved

src/intrinsics.jl Outdated Show resolved Hide resolved

vchuravy force-pushed the vc/intrinsics branch from 1831242 to 63244eb Compare February 5, 2025 15:29

vchuravy force-pushed the vc/pocl branch from 4ee5b78 to af03bce Compare February 5, 2025 15:32

vchuravy force-pushed the vc/intrinsics branch from 63244eb to af5c7db Compare February 5, 2025 15:32

vchuravy force-pushed the vc/pocl branch from af03bce to d065562 Compare February 5, 2025 15:33

vchuravy force-pushed the vc/intrinsics branch from af5c7db to ad4c968 Compare February 5, 2025 15:33

vchuravy force-pushed the vc/pocl branch 2 times, most recently from c51072f to a0e822e Compare February 5, 2025 17:26

vchuravy force-pushed the vc/intrinsics branch from ad4c968 to 1c8b7e9 Compare February 5, 2025 17:26

vchuravy force-pushed the vc/pocl branch from a0e822e to deb3251 Compare February 5, 2025 17:45

vchuravy force-pushed the vc/intrinsics branch from 1c8b7e9 to 55dde0c Compare February 5, 2025 17:46

vchuravy force-pushed the vc/pocl branch from deb3251 to 6fb1cea Compare February 5, 2025 18:15

vchuravy force-pushed the vc/intrinsics branch from 55dde0c to f1e98cc Compare February 5, 2025 18:15

vchuravy commented Feb 6, 2025

View reviewed changes

vchuravy force-pushed the vc/pocl branch from 6fb1cea to 8098378 Compare February 6, 2025 14:48

vchuravy force-pushed the vc/intrinsics branch from f1e98cc to 6841207 Compare February 6, 2025 14:48

github-actions bot reviewed Feb 6, 2025

View reviewed changes

vchuravy force-pushed the vc/pocl branch from 8098378 to f5c1025 Compare February 6, 2025 14:50

vchuravy force-pushed the vc/intrinsics branch from 6841207 to 6a3e192 Compare February 6, 2025 14:51

github-actions bot reviewed Feb 6, 2025

View reviewed changes

vchuravy force-pushed the vc/pocl branch from f5c1025 to 77f1ee0 Compare February 6, 2025 15:02

vchuravy force-pushed the vc/intrinsics branch from 6a3e192 to 87af609 Compare February 6, 2025 15:02

vchuravy force-pushed the vc/pocl branch from 77f1ee0 to ba4eee7 Compare February 7, 2025 10:56

vchuravy force-pushed the vc/intrinsics branch from 87af609 to aec6c0a Compare February 7, 2025 10:56

vchuravy force-pushed the vc/pocl branch from ba4eee7 to a6ae55b Compare February 7, 2025 11:31

vchuravy force-pushed the vc/intrinsics branch from aec6c0a to e91b33b Compare February 7, 2025 11:31

vchuravy changed the base branch from vc/pocl to 02-07-allow_opt-out_of_implicit_bounds-checking February 7, 2025 11:31

vchuravy mentioned this pull request Feb 7, 2025

Allow opt-out of implicit bounds-checking #563

Open

vchuravy force-pushed the 02-07-allow_opt-out_of_implicit_bounds-checking branch from 48e3752 to e565304 Compare February 7, 2025 13:51

vchuravy force-pushed the vc/intrinsics branch from e91b33b to 7a5e159 Compare February 7, 2025 13:51

vchuravy changed the base branch from 02-07-allow_opt-out_of_implicit_bounds-checking to vc/pocl February 7, 2025 13:52

vchuravy mentioned this pull request Feb 7, 2025

Forbid divergent execution of work-group barriers #558

Open

vchuravy added 3 commits February 10, 2025 16:04

Use POCL as a CPU backend

f038d8c

Allow opt-out of implicit bounds-checking

4dd0acc

KernelAbstractions currently creates kernels that look like: ``` if __validindex(ctx) # Body end ``` This is problematic due to the convergence requirement on `@synchronize`.

define basic intrinsics

0a8301c

vchuravy force-pushed the vc/pocl branch from 90a10d7 to f038d8c Compare February 10, 2025 15:08

vchuravy force-pushed the vc/intrinsics branch from 7a5e159 to 0a8301c Compare February 10, 2025 15:08

vchuravy force-pushed the vc/pocl branch from f038d8c to 777c099 Compare February 10, 2025 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KernelIntrinsics #562

KernelIntrinsics #562

vchuravy commented Feb 4, 2025

This comment was marked as outdated.

This comment was marked as outdated.

codecov bot commented Feb 4, 2025 •

edited

Loading

pxl-th Feb 4, 2025

vchuravy Feb 5, 2025

vchuravy commented Feb 5, 2025 •

edited

Loading

This comment was marked as outdated.

github-actions bot commented Feb 5, 2025 •

edited

Loading

vchuravy Feb 6, 2025

vchuravy Feb 6, 2025

maleadt commented Feb 6, 2025

github-actions bot left a comment

github-actions bot left a comment

vchuravy commented Feb 6, 2025

anicusan commented Feb 7, 2025

vchuravy commented Feb 7, 2025

KernelIntrinsics #562

Are you sure you want to change the base?

KernelIntrinsics #562

Conversation

vchuravy commented Feb 4, 2025

This comment was marked as outdated.

This comment was marked as outdated.

codecov bot commented Feb 4, 2025 • edited Loading

Codecov Report

pxl-th Feb 4, 2025

Choose a reason for hiding this comment

vchuravy Feb 5, 2025

Choose a reason for hiding this comment

vchuravy commented Feb 5, 2025 • edited Loading

This comment was marked as outdated.

github-actions bot commented Feb 5, 2025 • edited Loading

Benchmark Results

Benchmark Plots

vchuravy Feb 6, 2025

Choose a reason for hiding this comment

vchuravy Feb 6, 2025

Choose a reason for hiding this comment

maleadt commented Feb 6, 2025

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

vchuravy commented Feb 6, 2025

anicusan commented Feb 7, 2025

vchuravy commented Feb 7, 2025

codecov bot commented Feb 4, 2025 •

edited

Loading

vchuravy commented Feb 5, 2025 •

edited

Loading

github-actions bot commented Feb 5, 2025 •

edited

Loading