-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement and extend data-specific cartesian index #1902
Conversation
b41eb05
to
cd17c13
Compare
On the Clima node, here's how the benchmark MainBenchmarking ClimaCore copyto! for VIJFH DataLayout
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 116.978 μs … 3.014 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 120.999 μs ┊ GC (median): 0.00%
Time (mean ± σ): 121.499 μs ± 29.288 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▁▂▃▄▅▅▇█▅▇▇▇▆▅▆▆▇▄▄▄▂▁▁
▁▁▁▁▁▁▁▂▂▂▂▃▄▅▆█▇██████████████████████████▆▆▆▅▅▄▄▃▃▂▂▂▂▂▂▂▂ ▅
117 μs Histogram: frequency by time 125 μs <
Memory estimate: 720 bytes, allocs estimate: 28.
Benchmarking array copyto! for VIJFH DataLayout
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 112.709 μs … 3.142 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 114.778 μs ┊ GC (median): 0.00%
Time (mean ± σ): 115.419 μs ± 30.815 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▄▄▄▇█▆▄▄▂▁
▁▁▁▁▂▂▂▂▃▃▄▆▆███████████▇▆▅▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
113 μs Histogram: frequency by time 119 μs <
Memory estimate: 1.50 KiB, allocs estimate: 65.
Test Summary: | Pass Total Time
copyto! with Nf = 1 | 1 1 29.2s
Test.DefaultTestSet("copyto! with Nf = 1", Any[], 1, false, false, true, 1.721913773875106e9, 1.721913803097613e9, false, "REPL[9]") This PRBenchmarking ClimaCore copyto! for VIJFH DataLayout
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 64.350 μs … 2.607 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 67.989 μs ┊ GC (median): 0.00%
Time (mean ± σ): 68.561 μs ± 27.209 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▄▆▆▅▅▃▄▄▆▄▅▆██▆▆▅▄▄▅▄▂▁▁▁▁
▁▁▂▃▃▄▄▅▅▄▄▄▄▆███████████████████████████▇▆▆▅▅▅▄▄▃▃▃▃▂▂▂▁▁▁ ▅
64.4 μs Histogram: frequency by time 72.2 μs <
Memory estimate: 720 bytes, allocs estimate: 28.
Benchmarking array copyto! for VIJFH DataLayout
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 111.759 μs … 4.475 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 113.349 μs ┊ GC (median): 0.00%
Time (mean ± σ): 113.971 μs ± 43.903 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▄▅▇▇██▇▇▅▄▂▁
▁▁▁▁▁▂▂▄▅▆██████████████▇▆▅▄▃▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
112 μs Histogram: frequency by time 117 μs <
Memory estimate: 1.50 KiB, allocs estimate: 65.
Test Summary: | Pass Total Time
copyto! with Nf = 1 | 1 1 28.5s
Test.DefaultTestSet("copyto! with Nf = 1", Any[], 1, false, false, true, 1.721913933078668e9, 1.721913961585961e9, false, "REPL[9]") From a hardware performance POV, we'll now have: julia> function bandwidth_efficiency(;
problem_size=(63,4,4,1,5400),
float_type,
device_bandwidth_GBs=2_039, # clima A100 specs: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf
kernel_time_s,
n_reads_writes
)
N = prod(problem_size)
GB = N*n_reads_writes*sizeof(float_type)/1024^3
achieved_bandwidth_GBs = GB/kernel_time_s
percent_efficiency = achieved_bandwidth_GBs/device_bandwidth_GBs*100
@info "Bandwidth info" problem_size float_type achieved_bandwidth_GBs device_bandwidth_GBs percent_efficiency
end
bandwidth_efficiency (generic function with 1 method)
julia> bandwidth_efficiency(; # main (with index swapping)
float_type=Float64, kernel_time_s=(116.978)/10^6, n_reads_writes=2
)
┌ Info: Bandwidth info
│ problem_size = (63, 4, 4, 1, 5400)
│ float_type = Float64
│ achieved_bandwidth_GBs = 693.3782472802712
│ device_bandwidth_GBs = 2039
└ percent_efficiency = 34.00579927809079
julia> bandwidth_efficiency(; # this PR (no index swapping)
float_type=Float64, kernel_time_s=(64.350)/10^6, n_reads_writes=2
)
┌ Info: Bandwidth info
│ problem_size = (63, 4, 4, 1, 5400)
│ float_type = Float64
│ achieved_bandwidth_GBs = 1260.4506699355334
│ device_bandwidth_GBs = 2039
└ percent_efficiency = 61.817100045881965 From https://developer.download.nvidia.com/GTC/PDF/1083_Wang.pdf, slide 35:
So I'd say that this PR puts us in the "good" range for pointwise broadcast expressions, 🎉. cc @tapios Next, we should fix the stencil kernels. That will be more tricky since the indexing is split up, and I think some of those may use the non-cartesian indexing. But we should see similar speedups there. For future self: the reason that we're faster than ordinary CuArrays, is because our datalayouts now know completely static sizes, and offsets can be determined at compile time. For CuArrays, the sizes are not compile-time known. |
cd17c13
to
1cd4f4f
Compare
1cd4f4f
to
476a184
Compare
476a184
to
2c15143
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I'd add a comment about the need for redefining the broadcast machinery before merging (as you did in the comment).
5a4e9c0
to
137ac8b
Compare
Extend data-specific cartesian index
137ac8b
to
2243091
Compare
Just to cycle back on this PR: It turns out that this solution is pretty effective for Float64, but not so much for Float32 (xref) |
Superseded by #2055. |
This PR extends and is a generalization of #1900. Supersedes #1900.
This PR effectively fixes the current issue, where we have uncoalesced reads/writes in our pointwise broadcast expressions.
The fix was to specialize on broadcast expressions where we don't need to use a universal cartesian index for all datalayouts, but instead use a datalayout-specific index (defined as
DataSpecificCartesianIndex
).Then, we define
getindex
/setindex
on datalayouts for this index type, and do not swap the indices, as we've launch the kernel withCartesianIndices
corresponding tosize(parent(data))
(as opposed tosize(data)
for the generalized case).The result is that all data is accessed in a contiguous / coalesced fashion.