Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we remove haskey(ENV, "PMI_RANK") || haskey(ENV, "OMPI_COMM_WORLD_RANK") check in context_type? #94

Open
charleskawczynski opened this issue Oct 1, 2024 · 4 comments

Comments

@charleskawczynski
Copy link
Member

This seems to be causing issues for people when they don't explicitly request MPI (and it automatically gets scooped up). I think it might be less surprising if we change

function context_type()
    name = get(ENV, "CLIMACOMMS_CONTEXT", nothing)
    if !isnothing(name)
        if name == "MPI"
            return :MPICommsContext
        elseif name == "SINGLETON"
            return :SingletonCommsContext
        else
            error("Invalid context: $name")
        end
    end
    # detect common environment variables used by MPI launchers
    #   PMI_RANK appears to be used by MPICH and srun
    #   OMPI_COMM_WORLD_RANK appears to be used by OpenMPI
    if haskey(ENV, "PMI_RANK") || haskey(ENV, "OMPI_COMM_WORLD_RANK")
        return :MPICommsContext
    else
        return :SingletonCommsContext
    end
end

to

function context_type()
    name = get(ENV, "CLIMACOMMS_CONTEXT", nothing)
    if !isnothing(name)
        if name == "MPI"
            return :MPICommsContext
        elseif name == "SINGLETON"
            return :SingletonCommsContext
        else
            error("Invalid context: $name")
        end
    else
      return :SingletonCommsContext
    end
end
@haakon-e
Copy link
Member

I was just hit by this, and am for the time being working around it by doing a few additional checks, e.g.:

auto_ctx = ClimaComms.context_type() == :MPICommsContext
two_or_more_gpus = length(split(cuda_devices, ','))  2
two_or_more_slurm_tasks = parse(Int, get(ENV, "SLURM_NTASKS", 0))  2
is_mpi_ctx = auto_ctx && (two_or_more_gpus || two_or_more_slurm_tasks)

main_context = if is_mpi_ctx
    import MPI
    ClimaComms.MPICommsContext(device)
else
    ClimaComms.SingletonCommsContext(device)
end

@nefrathenrici
Copy link
Member

I support this change, using both the CLIMACOMMS_CONTEXT and the MPI variables leads to unexpected behavior, we should improve this check or just use the CLIMACOMMS_CONTEXT to set the context.

@nefrathenrici
Copy link
Member

The main issue I have run into is that srun sets PMI_RANK, so we should change this variable check to something like MPICH_VERSION, or cast a wider net of variable checks to be safer.

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Dec 19, 2024

Maybe we could change the context to contain a string / symbol indicating why that context was returned. Then, users could print:

context = ClimaComms.context()
@show ClimaComms.why(context) # prints, e.g., :explicitly_requested

This could be really helpful on the user-side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants