You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One quick solution is to simply make the slow path an error, which #612 allows. If you accidentally have Float64 showing up, you really want to fix it, not for batched_mul to choose the least-awful path.
If you accidentally have Float64 showing up, you really want to fix it, not for batched_mul to choose the least-awful path.
I actually discovered this due to the reverse. Code I wanted to be running in double precision but had Float32s being introduced since this is the default type used by CUDA.rand when a type isn't specified.
But yes, I agree with the sentiment, and that seems like a sensible approach.
Small bug that can lead to massive slowdowns.
Minimal example:
For TA = TB = Float32, we have expected results
When the types differ, rather than promoting, the generic method is called which launches one kernel per matmul and is incredibly slow.
The text was updated successfully, but these errors were encountered: