-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add GPUCompiler precompilation caching #425
base: master
Are you sure you want to change the base?
Conversation
You forgot to commit |
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## master #425 +/- ##
===========================================
- Coverage 87.08% 76.87% -10.21%
===========================================
Files 24 25 +1
Lines 2943 2993 +50
===========================================
- Hits 2563 2301 -262
- Misses 380 692 +312
☔ View full report in Codecov by Sentry. |
Could you explain what the purpose/design of this PR is? It's not at all clear to me, and looking downstream lots of functionality is entirely unused (e.g. I'm not sure why this needs anything in GPUCompiler.jl at all. Shouldn't it be sufficient for downstream packages to trigger a compilation to cache whatever they need, e.g., how JET.jl does it https://github.com/aviatesk/JET.jl/blob/b688eda6eb50a18e9e218d32650d2de23f085d50/src/JET.jl#L1382-L1396 |
Updated initial comment and added some example code. Hope this clears some things up! |
Not really, sorry. Could you describe what's the problem you want to solve, why it doesn't work with current precompilation tools, and why you opted for the design you did? Those global undocumented macros (doing questionable things) are a very non-Julian API. |
The main issue is that GPUCompiler's GLOBAL_CI_CACHES are not persistent on reruns. This commit fixes this issue requiring some user input. This would improve time-to-first-x for things requiring GPUCompiler. Have a pull request for Enzyme in the works that is one downstream use case another is in CUDA. Been working with @vchuravy on this, with an eventual extension to be to cache binary code between runs not just type hints. The reason for so much user involvement and use of macros is this was the simplest way forward. We use macros to create a local cache outside of the user control that has a unique id that does not conflict with the user code. We want a unique cache to eliminate duplications in the cache. Additionally we tried making all of this run at init time, but that was to late, the caches had already been serialized at that point, so we needed user involvement. We definitely want to try to reduce this but this is a first polished attempt at the matter. |
Why not? It's just a global dict, why doesn't it get serialized in the .ji file? |
It is serialized, it just occurs to early in the process. By the time the dependent packages have inserted into the cache it is too late for the global. Additionally multiple children are now allowed to mutate and still have some see a cache improvements. |
Repeating my comment from Slack: Is this because the global is serialized as part of the GPUCompiler.ji, and isn't part of, e.g., CUDA.jl's precompilation image? In that case, you could override If that turns out to be the way to do it, we could even remove the global CI cache here to force users to bring their own (and thus get proper precompilation of GPUCompiler-inferred CIs). |
So the overarching design consideration is: Users of Enzyme.jl/CUDA.jl/AMDGPU.jl should be able to "precompile" their code. Each user package will need to declare an "anchor"/cache that will be serialized into the
So it is not the down-stream packages of GPUCompiler that need to bring their own cache, We use the cachefile of That's at least the high-level design Collin and I came up with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new content in examples
should likely go into test
.
CodeCache(cache::CodeCache) = new(GPUCompiler.copyAndFilter(cache.dict)) | ||
end | ||
|
||
function copyAndFilter(dict::IdDict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this needed for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is used in https://github.com/collinwarner/GPUCompiler.jl/blob/3dbe9d5b7c7c5f56f18553f0e4d4bd9c2bdcaca5/src/precompile_native.jl#L102
It creates a CodeCache that contains unbounded entries only. Used when snapshotting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just write this as filter(validate_codecache, cache.dict)
where valid is:
function validate_codecache(cache)
for ci in cache
if ci.max_world < typemax(typeof(ci.max_world))
return false
end
return true
end
end
But that seems overeager, are we gurantueed just one entry? Or do we want to remove all CIs that don't have max_world?
Why does the API consist of macros? Why doesn't something like this work: module DownstreamPackage
using GPUCompiler, CUDA
const cache_snapshot = GPUCompiler.ci_cache_snapshot()
include("precompile.jl")
const cache = GPUCompiler.ci_cache_delta(cache_snapshot)
__init__() = GPUCompiler.ci_cache_insert(cache)
end |
That would seem to work. Updating now |
Downstream packages probably should not serialize the entire cache snapshot, and rather do something like: module DownstreamPackage
using GPUCompiler, CUDA
const cache = let
cache_snapshot = GPUCompiler.ci_cache_snapshot()
include("precompile.jl")
GPUCompiler.ci_cache_delta(cache_snapshot)
end
__init__() = GPUCompiler.ci_cache_insert(cache)
end But that doesn't change the actual API. |
Changed API to follow @maleadt advice. Leads to a cleaner interface. Added an example kernel with caching at test/ExamplePersistentCache/GPUKernel.jl . Using this you are able to get a persistent cache, which reduces the recompilation time on consecutive calls of using Package when restarting Julia. |
Remaining work is to test integration with Downstream packages such as Enzyme, Oceananigans, CUDA, AMDGPU,.... Additionally, there are potentially some algorithmic improvements for the merge algorithm to bring precompile times with and without this feature more inline. |
src/precompilation_cache.jl
Outdated
Reloads Global Cache from global variable which stores the previous | ||
cached results | ||
""" | ||
function reinit_cache(LOCAL_CACHE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops reminent code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead code, removed
cc @aviatesk, this may be relevant to DAECompiler (as a workaround, until we have the ability to update another module's globals, i.e., a ci cache). |
See greater performance improvement when used during Enzyme.jl's precompilation phase. EnzymeAD/Enzyme.jl#760 |
export ci_cache_snapshot, ci_cache_delta, ci_cache_insert, precompile_gpucompiler | ||
|
||
function ci_cache_snapshot() | ||
cleaned_cache_to_save = IdDict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just copy(GPUCompiler.GLOBAL_CI_CACHES)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an additional parse when constructing the CodeCache that removes CodeInstances in finite ranges. I could potentially split up that process so there are two phases. Copying then filtering, I though since we were already doing one pass over the data we could add filtering in directly.
Adds ability to precompile code in GPUCompiler.GLOBAL_CI_CACHES. Taps into non-gpu caching of global constants to write the current instance of the global cache and reload on initialization. Requires user to declare, initialize and snapshot local cache. The user will then use GPUCompiler.precompile_gpucompiler. Mainly this adds and api for downstream packages such as Enzyme, CUDA, to use to cache instances of their functions. A sample SimpleGPU and Example.jl illustrate usage.