Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KernelIntrinsics #562

Open
wants to merge 3 commits into
base: vc/pocl
Choose a base branch
from
Open

KernelIntrinsics #562

wants to merge 3 commits into from

Conversation

vchuravy
Copy link
Member

@vchuravy vchuravy commented Feb 4, 2025

The goal is to allow for kernels to be written without relying on KernelAbstractions macros

cc: @maleadt @pxl-th

github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

Copy link

codecov bot commented Feb 4, 2025

Codecov Report

Attention: Patch coverage is 0% with 28 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (f038d8c) to head (0a8301c).

Files with missing lines Patch % Lines
src/KernelAbstractions.jl 0.00% 12 Missing ⚠️
src/macros.jl 0.00% 10 Missing ⚠️
src/pocl/backend.jl 0.00% 6 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           vc/pocl    #562   +/-   ##
=======================================
  Coverage     0.00%   0.00%           
=======================================
  Files           21      21           
  Lines         1509    1519   +10     
=======================================
- Misses        1509    1519   +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


Returns the unique local work-item ID.
"""
function get_local_id end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIUC, backends should implement these like below, right?

function get_local_id()
    return (threadIdx().x, threadIdx().y, threadIdx().z)
end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah basically, and my goal is to replace the old internal functions the people had to override with definitions based on these functions.

Copy link
Member Author

vchuravy commented Feb 5, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

github-actions[bot]

This comment was marked as outdated.

src/KernelAbstractions.jl Outdated Show resolved Hide resolved
src/intrinsics.jl Outdated Show resolved Hide resolved
src/intrinsics.jl Outdated Show resolved Hide resolved
Copy link
Contributor

github-actions bot commented Feb 5, 2025

Benchmark Results

main 0a8301c... main/0a8301c1af8f52...
saxpy/default/Float16/1024 0.731 ± 0.01 μs 0.0524 ± 0.026 ms 0.0139
saxpy/default/Float16/1048576 0.175 ± 0.0081 ms 0.891 ± 0.024 ms 0.197
saxpy/default/Float16/16384 3.33 ± 0.029 μs 0.0639 ± 0.028 ms 0.0522
saxpy/default/Float16/2048 0.913 ± 0.013 μs 0.0589 ± 0.024 ms 0.0155
saxpy/default/Float16/256 0.585 ± 0.0089 μs 0.0558 ± 0.027 ms 0.0105
saxpy/default/Float16/262144 0.0443 ± 0.00068 ms 0.271 ± 0.026 ms 0.164
saxpy/default/Float16/32768 6.02 ± 0.057 μs 0.0761 ± 0.028 ms 0.0791
saxpy/default/Float16/4096 1.31 ± 0.025 μs 0.0655 ± 0.026 ms 0.0199
saxpy/default/Float16/512 0.642 ± 0.0074 μs 0.0578 ± 0.026 ms 0.0111
saxpy/default/Float16/64 0.555 ± 0.0057 μs 0.0585 ± 0.027 ms 0.00948
saxpy/default/Float16/65536 11.6 ± 0.12 μs 0.104 ± 0.028 ms 0.111
saxpy/default/Float32/1024 0.633 ± 0.011 μs 0.0569 ± 0.026 ms 0.0111
saxpy/default/Float32/1048576 0.23 ± 0.022 ms 0.473 ± 0.033 ms 0.486
saxpy/default/Float32/16384 2.8 ± 0.26 μs 0.0557 ± 0.026 ms 0.0503
saxpy/default/Float32/2048 0.743 ± 0.054 μs 0.0543 ± 0.024 ms 0.0137
saxpy/default/Float32/256 0.568 ± 0.0059 μs 0.0559 ± 0.027 ms 0.0101
saxpy/default/Float32/262144 0.0446 ± 0.0029 ms 0.162 ± 0.035 ms 0.275
saxpy/default/Float32/32768 5.32 ± 0.56 μs 0.0612 ± 0.027 ms 0.0868
saxpy/default/Float32/4096 1.13 ± 0.094 μs 0.0591 ± 0.025 ms 0.019
saxpy/default/Float32/512 0.601 ± 0.0069 μs 0.0559 ± 0.026 ms 0.0108
saxpy/default/Float32/64 0.557 ± 0.0057 μs 0.0575 ± 0.026 ms 0.00969
saxpy/default/Float32/65536 11.7 ± 1.2 μs 0.0763 ± 0.029 ms 0.153
saxpy/default/Float64/1024 0.747 ± 0.019 μs 0.0574 ± 0.026 ms 0.013
saxpy/default/Float64/1048576 0.485 ± 0.041 ms 0.499 ± 0.038 ms 0.971
saxpy/default/Float64/16384 5.36 ± 0.49 μs 0.0568 ± 0.026 ms 0.0944
saxpy/default/Float64/2048 1.14 ± 0.092 μs 0.0515 ± 0.024 ms 0.0221
saxpy/default/Float64/256 0.574 ± 0.0081 μs 0.0583 ± 0.027 ms 0.00985
saxpy/default/Float64/262144 0.11 ± 0.011 ms 0.173 ± 0.03 ms 0.635
saxpy/default/Float64/32768 12.2 ± 1.3 μs 0.0637 ± 0.026 ms 0.191
saxpy/default/Float64/4096 1.71 ± 0.22 μs 0.0601 ± 0.025 ms 0.0284
saxpy/default/Float64/512 0.626 ± 0.014 μs 0.0555 ± 0.027 ms 0.0113
saxpy/default/Float64/64 0.551 ± 0.008 μs 0.0585 ± 0.027 ms 0.00942
saxpy/default/Float64/65536 24.3 ± 2.7 μs 0.0867 ± 0.027 ms 0.28
saxpy/static workgroup=(1024,)/Float16/1024 2.15 ± 0.024 μs 0.0514 ± 0.026 ms 0.0419
saxpy/static workgroup=(1024,)/Float16/1048576 0.163 ± 0.012 ms 0.9 ± 0.03 ms 0.181
saxpy/static workgroup=(1024,)/Float16/16384 4.4 ± 0.097 μs 0.0608 ± 0.026 ms 0.0723
saxpy/static workgroup=(1024,)/Float16/2048 2.32 ± 0.027 μs 0.0579 ± 0.024 ms 0.04
saxpy/static workgroup=(1024,)/Float16/256 2.79 ± 0.03 μs 0.0554 ± 0.026 ms 0.0504
saxpy/static workgroup=(1024,)/Float16/262144 0.0419 ± 0.0015 ms 0.27 ± 0.027 ms 0.155
saxpy/static workgroup=(1024,)/Float16/32768 6.8 ± 0.18 μs 0.074 ± 0.026 ms 0.0919
saxpy/static workgroup=(1024,)/Float16/4096 2.64 ± 0.036 μs 0.0578 ± 0.026 ms 0.0458
saxpy/static workgroup=(1024,)/Float16/512 3.24 ± 0.035 μs 0.0544 ± 0.026 ms 0.0595
saxpy/static workgroup=(1024,)/Float16/64 2.49 ± 0.22 μs 0.0587 ± 0.027 ms 0.0424
saxpy/static workgroup=(1024,)/Float16/65536 12.7 ± 0.36 μs 0.103 ± 0.026 ms 0.123
saxpy/static workgroup=(1024,)/Float32/1024 2.32 ± 0.026 μs 0.0552 ± 0.026 ms 0.0421
saxpy/static workgroup=(1024,)/Float32/1048576 0.238 ± 0.02 ms 0.462 ± 0.041 ms 0.514
saxpy/static workgroup=(1024,)/Float32/16384 4.52 ± 0.37 μs 0.0527 ± 0.025 ms 0.0857
saxpy/static workgroup=(1024,)/Float32/2048 2.47 ± 0.041 μs 0.0509 ± 0.024 ms 0.0485
saxpy/static workgroup=(1024,)/Float32/256 2.75 ± 0.048 μs 0.0558 ± 0.026 ms 0.0492
saxpy/static workgroup=(1024,)/Float32/262144 0.0586 ± 0.0037 ms 0.159 ± 0.035 ms 0.369
saxpy/static workgroup=(1024,)/Float32/32768 7.54 ± 0.67 μs 0.0587 ± 0.026 ms 0.128
saxpy/static workgroup=(1024,)/Float32/4096 2.76 ± 0.091 μs 0.0557 ± 0.026 ms 0.0496
saxpy/static workgroup=(1024,)/Float32/512 2.77 ± 0.03 μs 0.0569 ± 0.026 ms 0.0487
saxpy/static workgroup=(1024,)/Float32/64 2.76 ± 4.5 μs 0.0562 ± 0.026 ms 0.0492
saxpy/static workgroup=(1024,)/Float32/65536 15.6 ± 1.3 μs 0.0749 ± 0.029 ms 0.208
saxpy/static workgroup=(1024,)/Float64/1024 2.3 ± 0.056 μs 0.0571 ± 0.026 ms 0.0403
saxpy/static workgroup=(1024,)/Float64/1048576 0.513 ± 0.033 ms 0.501 ± 0.044 ms 1.02
saxpy/static workgroup=(1024,)/Float64/16384 7.47 ± 0.53 μs 0.0541 ± 0.025 ms 0.138
saxpy/static workgroup=(1024,)/Float64/2048 2.6 ± 0.1 μs 0.0493 ± 0.023 ms 0.0527
saxpy/static workgroup=(1024,)/Float64/256 2.64 ± 0.057 μs 0.0561 ± 0.025 ms 0.047
saxpy/static workgroup=(1024,)/Float64/262144 0.101 ± 0.012 ms 0.171 ± 0.03 ms 0.591
saxpy/static workgroup=(1024,)/Float64/32768 15.4 ± 1.1 μs 0.0627 ± 0.026 ms 0.246
saxpy/static workgroup=(1024,)/Float64/4096 3.21 ± 0.24 μs 0.055 ± 0.026 ms 0.0584
saxpy/static workgroup=(1024,)/Float64/512 2.65 ± 0.061 μs 0.0555 ± 0.026 ms 0.0478
saxpy/static workgroup=(1024,)/Float64/64 2.6 ± 0.053 μs 0.0548 ± 0.026 ms 0.0474
saxpy/static workgroup=(1024,)/Float64/65536 26.7 ± 3 μs 0.0842 ± 0.027 ms 0.317
time_to_load 0.319 ± 0.0027 s 1.12 ± 0.0072 s 0.285

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

function get_global_size end

"""
get_global_id()::@NamedTuple{x::Int32, y::Int32, z::Int32}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Int32 or Int64?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenCL defines these as Csize_t

@maleadt
Copy link
Member

maleadt commented Feb 6, 2025

So the idea is to decouple the back-ends from KA.jl, instead implementing KernelIntrinsics.jl? What's the advantage; do you envision packages other than KA.jl to build their kernel DSL on top of KernelIntrinsics.jl?

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions could not be made:

  • src/pocl/nanoOpenCL.jl
    • lines 670-674

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions could not be made:

  • src/pocl/nanoOpenCL.jl
    • lines 670-674

@vchuravy
Copy link
Member Author

vchuravy commented Feb 6, 2025

So the idea is to decouple the back-ends from KA.jl, instead implementing KernelIntrinsics.jl? What's the advantage; do you envision packages other than KA.jl to build their kernel DSL on top of KernelIntrinsics.jl?

I am very unsure if I am able to pull off #558 and as such, I want to remove the implicit if validindex(ctx) check,
and I want to make #559 possible.

My goal is to allow for a gracefull transition of the DSL macro language to something more similar to "just" OpenCL.
Removing the extra overhead introduced by the arbitrary dimensions, etc.

This would allow us to write performance critical kernels directly and solve the issue of how to write kernels that use barriers correctly.

If I can figure out #558 the macro based DSL can stick around; otherwise I will encourage folks to move their kernel to KernelIntrinsics.

@vchuravy vchuravy changed the base branch from vc/pocl to 02-07-allow_opt-out_of_implicit_bounds-checking February 7, 2025 11:31
@anicusan
Copy link
Member

anicusan commented Feb 7, 2025

Will KA/KI still be a greatest common denominator of the GPU backends, or are you looking to introduce optional intrinsics? How will the groupreduce API do in terms of portability?

@vchuravy
Copy link
Member Author

vchuravy commented Feb 7, 2025

Will KA/KI still be a greatest common denominator of the GPU backends

The intrinsics proposed here are the greatest common denominator. I could see us adding some more intrinsics for reductions, but that is TBD.

@vchuravy vchuravy force-pushed the 02-07-allow_opt-out_of_implicit_bounds-checking branch from 48e3752 to e565304 Compare February 7, 2025 13:51
@vchuravy vchuravy changed the base branch from 02-07-allow_opt-out_of_implicit_bounds-checking to vc/pocl February 7, 2025 13:52
KernelAbstractions currently creates kernels that look like:

```
if __validindex(ctx)
   # Body
end
```

This is problematic due to the convergence requirement on
`@synchronize`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants