Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Context and Endpoint classes to enable non-Communicator use-cases #166

Merged
merged 17 commits into from
Sep 6, 2023

Conversation

olsaarik
Copy link
Contributor

This PR implements and closes #137. The new Endpoint and Context classes expose the connection establishing functionality from Communicator, which now is only responsible for tying together the bootstrapper with a context.

The largest breaking change here is that Communicator.connectOnSetup(...) now returns the Connection wrapped inside a NonblockingFuture. This is because with the way Context is implemented a Connection is now fully initialized on construction.

Some smaller breaking API changes from this change are that RegisteredMemory no longer has a rank() function (as there maybe no concept of rank), and similarly Connection has no remoteRank() and tag() functions. The latter are replaced by remoteRankOf and tagOf functions in Communicator.

A new EndpointConfig class is introduced to avoid duplication of the IB configuration parameters in the APIs of Context and Communicator. The usual usage pattern of just passing in a Transport still works due to an implicit conversion into EndpointConfig.

Miscellaneous changes:
-Cleans up how the PIMPL pattern is applied by making both the Impl struct and the pimpl_ pointers private for all relevant classes in the core API.
-Enables ctest to be run from the build root directory.

@olsaarik
Copy link
Contributor Author

olsaarik commented Sep 1, 2023

I rebased to main, which had a conflict related to the closing of the CUDA IPC handle on RegisteredMemory deconstruction. I ended up improving the whole RegisteredMemory serialization such that now the whole thing round-trips through multiple rounds of serialization/deserialization. There is a new originalDataPtr field that always stores the pointer passed in by the user when the memory was registered. Then depending on whether the memory being deserialized is local to the process and/or machine, the pointer returned by data() will either be the original pointer, a CUDA IPC pointer or nullptr.

I also resurrected one test that was related to this (it had an unrelated bug).

Copy link
Contributor

@Binyang2014 Binyang2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Could you provide a demonstration on how to set up a connection using Context? This would make the workflow easier to understand.

src/context.cc Show resolved Hide resolved
src/registered_memory.cc Show resolved Hide resolved
Copy link
Contributor

@chhwang chhwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great

include/mscclpp/core.hpp Outdated Show resolved Hide resolved
@olsaarik
Copy link
Contributor Author

olsaarik commented Sep 5, 2023

@Binyang2014 for the demonstration of how to use Context, does the comment at the top of the class declaration look good? I will also add a Python equivalent testing that kind of client-server scenario later (tracked here #173).

@Binyang2014
Copy link
Contributor

@Binyang2014 for the demonstration of how to use Context, does the comment at the top of the class declaration look good? I will also add a Python equivalent testing that kind of client-server scenario later (tracked here #173).

Yes, the comment looks good, but sample code would be even easier to follow. The idea of Python testing sounds good to me.

@chhwang chhwang merged commit 828be48 into main Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Connection without Bootstrap
3 participants