Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn MPI and CUDA to extensions and standardize how to context and device and picked #74

Closed
Sbozzolo opened this issue Apr 1, 2024 · 0 comments · Fixed by #75
Closed
Assignees
Labels
enhancement New feature or request

Comments

@Sbozzolo
Copy link
Member

Sbozzolo commented Apr 1, 2024

As things are, everything that depends on ClimaComms inherits its dependencies: MPI and CUDA. These are not lightweight, especially CUDA. I think that extensions might work well here (This is what #4 proposed, and now that's a good way to implement that.
).

The only downside is that downstream dependencies will have to explicitely depend and load MPI/CUDA. This would be quite painful to implement because we rely heavily on automatic selection of the context and the device.

The key functions are:

function context(device = device())
name = get(ENV, "CLIMACOMMS_CONTEXT", nothing)
if !isnothing(name)
if name == "MPI"
return MPICommsContext()
elseif name == "SINGLETON"
return SingletonCommsContext()
else
error("Invalid context: $name")
end
end
# detect common environment variables used by MPI launchers
# PMI_RANK appears to be used by MPICH and srun
# OMPI_COMM_WORLD_RANK appears to be used by OpenMPI
if haskey(ENV, "PMI_RANK") || haskey(ENV, "OMPI_COMM_WORLD_RANK")
return MPICommsContext(device)
else
return SingletonCommsContext(device)
end
end

and
function device()
env_var = get(ENV, "CLIMACOMMS_DEVICE", nothing)
if !isnothing(env_var)
if env_var == "CPU"
return Threads.nthreads() > 1 ? CPUMultiThreaded() :
CPUSingleThreaded()
elseif env_var == "CPUSingleThreaded"
return CPUSingleThreaded()
elseif env_var == "CPUMultiThreaded"
return CPUMultiThreaded()
elseif env_var == "CUDA"
return CUDADevice()
else
error("Invalid CLIMACOMMS_DEVICE: $env_var")
end
end
if CUDA.functional()
return CUDADevice()
else
return Threads.nthreads() == 1 ? CPUSingleThreaded() :
CPUMultiThreaded()
end
end

These functions read some env variable, and if not available, they try to guess a resonable device. This is nice because for the most part we don't have to worry about setting anything and the code just runs (but also see #67, sometimes the guesses are incorrect). Unfortunately, I think that guessing the correct device/context is incompatible with using extensions because specific packages have to be loaded (eg CUDA). With the current implementation of PR #75, one would have to load CUDA to use a GPU on a GPU-capable device, otherwise the device would be set to CPU. This is particularly annoying in all CI runs, where we would have to add logic to handle the different cases and import the relevant packages.

Instead, I propose to deprecate the automatic selection of the device and let ClimaComms load the relevant module when asked.
This might look like:

function device()
    env_var = get(ENV, "CLIMACOMMS_DEVICE", nothing)
    if !isnothing(env_var)
        if env_var == "CPU"
            return Threads.nthreads() > 1 ? CPUMultiThreaded() :
                   CPUSingleThreaded()
        elseif env_var == "CPUSingleThreaded"
            return CPUSingleThreaded()
        elseif env_var == "CPUMultiThreaded"
            return CPUMultiThreaded()
        elseif env_var == "CUDA"
           pkgid = Base.PkgId(Base.UUID("052768ef-5323-5732-b1bb-66c8b64840ba"), "CUDA")
           if !haskey(Base.loaded_modules, pkgid)
               try
                    Base.eval(Main, :(using CUDA))
                catch err
                     error("Cannot load CUDA.jl. Make sure it is included in your environment stack.")
                end
            end
            return CUDADevice()
        else
            error("Invalid CLIMACOMMS_DEVICE: $env_var")
        end
    end
end

This loads CUDA when CLIMACOMMS_DEVICE = CUDA. Now, the only responsability of the user is to ensure that they have CUDA in the environment they are using. The function will fail when the env variable CLIMACOMMS_DEVICE is not set. (This solution is currently loading CUDA into Main, but maybe it's best to load it within ClimaComms)

@Sbozzolo Sbozzolo added the enhancement New feature or request label Apr 1, 2024
@Sbozzolo Sbozzolo added this to the Maintenance and Improvements milestone Apr 18, 2024
@Sbozzolo Sbozzolo self-assigned this Apr 22, 2024
@Sbozzolo Sbozzolo changed the title Turn MPI and CUDA to extensions Turn MPI and CUDA to extensions and standardize how to context and device and picked Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant