-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix incorrect timing results for CUDA.@elapsed #2118
Conversation
Could you also add a test? |
Added a test that checks that a naive matmul's runtime scales (approximately) cubicly. This might be a bit flaky, though, so I'm open to suggestions on other ways to test this. |
I was more thinking that you changed the macro to interpolate code and not ex. So what did this fix? Reading the code it seems the only difference is the splitting off of keyword arguments? |
Here's an example illustrating the issue: using CUDA
function simple_matmul(C, A, B, N)
for i = 1:N
for j = 1:N
@inbounds elem = C[i, j]
for k = 1:N
@inbounds elem += A[i, k] * B[k, j]
end
@inbounds C[i, j] = elem
end
end
nothing
end
A = CuArray(rand(Float32, (1000, 1000)))
B = CuArray(rand(Float32, (1000, 1000)))
C = similar(A)
# warmup
@cuda simple_matmul(C, A, B, 500)
N1 = CUDA.@elapsed @cuda simple_matmul(C, A, B, 500)
N2 = CUDA.@elapsed @cuda simple_matmul(C, A, B, 1000)
println(N1)
println(N2) With this PR, this outputs:
Without this PR, I get:
For other kernels I've seen similar results:
Yes, and I'm not exactly sure why the code as written in #2113 leads to this behaviour, because, as you said, the only difference should be the keyword arguments, which shouldn't impact the results. |
What is the |
|
So... They are identical? |
No, there's a small difference:
vs.
(Admittedly, it also took me some time to notice...) |
9ce5f7f
to
42126a5
Compare
Anyway, the problem thus is that the code is inserted as a 1-element tuple containing the to-be-benchmarked I adapted the test to check this. |
No description provided.