-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVRTC kernels for cast-transpose #258
Conversation
Signed-off-by: Tim Moon <tmoon@nvidia.com>
dfc121b
to
aed892a
Compare
/te-ci |
/te-ci |
/te-ci |
/te-ci |
/te-ci |
/te-ci |
/te-ci |
Signed-off-by: Tim Moon <tmoon@nvidia.com>
/te-ci |
/te-ci |
/te-ci |
/te-ci |
If there are no objections, I will merge this after the 1.6 release. |
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
/te-ci |
/te-ci |
const size_t num_tiles) { | ||
__launch_bounds__(block_size) | ||
cast_transpose_general_kernel(const IType * __restrict__ const input, | ||
const CType * const noop, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the restrict keyword intentionally omitted here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, besides the early return from the kernel configuration constructors
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
/te-ci |
* Add NVRTC kernels for cast-transpose Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update copyright year Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add noop flag to NVRTC cast-transpose kernel Signed-off-by: Tim Moon <tmoon@nvidia.com> * Apply suggestions from code review Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* Add NVRTC kernels for cast-transpose Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update copyright year Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add noop flag to NVRTC cast-transpose kernel Signed-off-by: Tim Moon <tmoon@nvidia.com> * Apply suggestions from code review Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* Add NVRTC kernels for cast-transpose Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update copyright year Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add noop flag to NVRTC cast-transpose kernel Signed-off-by: Tim Moon <tmoon@nvidia.com> * Apply suggestions from code review Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
This is a follow-up to #138 that adds NVRTC kernels for cast-transpose.
I've also tweaked the kernel selection heuristics to be a bit more general. There are some performance differences at intermediate scale, but nothing too drastic:
Transpose kernel times
Evaluated FP8 transposes on an H100 PCIe with square matrices with nice dims (power of two or halfway between powers of two).