API library#
-
struct ncclConfig_t#
Communicator configuration.
Users can assign value to attributes to specify the behavior of a communicator
Public Members
-
size_t size#
Should not be touched
-
unsigned int magic#
Should not be touched
-
unsigned int version#
Should not be touched
-
int blocking#
Whether or not calls should block or not
-
int cgaClusterSize#
Cooperative group array cluster size
-
int minCTAs#
Minimum number of cooperative thread arrays (blocks)
-
int maxCTAs#
Maximum number of cooperative thread arrays (blocks)
-
const char *netName#
Force NCCL to use a specfic network
Allow communicators to share resources
-
int trafficClass#
Traffic class
-
const char *commName#
Name of the communicator
-
int collnetEnable#
Check for collnet enablement
-
int CTAPolicy#
CTA Policy
Shrink size
-
int nvlsCTAs#
Number of NVLS cooperative thread arrays (blocks)
-
size_t size#
-
struct ncclSimInfo_t#
-
struct ncclUniqueId#
Opaque unique id used to initialize communicators.
The ncclUniqueId must be passed to all participating ranks
Public Members
-
char internal[NCCL_UNIQUE_ID_BYTES]#
Opaque array>
-
char internal[NCCL_UNIQUE_ID_BYTES]#
- file mainpage.txt
- file nccl.h.in
- #include <hip/hip_runtime.h>#include <hip/hip_fp16.h>#include <limits.h>
Defines
-
NCCL_H_#
-
NCCL_MAJOR#
-
NCCL_MINOR#
-
NCCL_PATCH#
-
NCCL_SUFFIX#
-
NCCL_VERSION_CODE#
-
NCCL_VERSION(X, Y, Z)#
-
RCCL_BFLOAT16#
-
RCCL_FLOAT8#
-
RCCL_GATHER_SCATTER#
-
RCCL_ALLTOALLV#
-
RCCL_ALLREDUCE_WITH_BIAS#
-
NCCL_COMM_NULL#
-
NCCL_UNIQUE_ID_BYTES#
-
NCCL_CONFIG_UNDEF_INT#
-
NCCL_CONFIG_UNDEF_PTR#
-
NCCL_SPLIT_NOCOLOR#
-
NCCL_UNDEF_FLOAT#
-
NCCL_WIN_DEFAULT#
-
NCCL_WIN_COLL_SYMMETRIC#
-
NCCL_CTA_POLICY_DEFAULT#
-
NCCL_CTA_POLICY_EFFICIENCY#
-
NCCL_SHRINK_DEFAULT#
-
NCCL_SHRINK_ABORT#
-
NCCL_SIM_INFO_INITIALIZER#
Typedefs
-
typedef struct ncclComm *ncclComm_t#
Opaque handle to communicator.
A communicator contains information required to facilitate collective communications calls
-
typedef struct ncclWindow *ncclWindow_t#
Functions
-
ncclResult_t ncclMemAlloc(void **ptr, size_t size)#
-
ncclResult_t pncclMemAlloc(void **ptr, size_t size)#
-
ncclResult_t ncclMemFree(void *ptr)#
-
ncclResult_t pncclMemFree(void *ptr)#
-
ncclResult_t ncclCommShrink(ncclComm_t comm, int *excludeRanksList, int excludeRanksCount, ncclComm_t *newcomm, ncclConfig_t *config, int shrinkFlags)#
Shrink existing communicator.
Ranks in excludeRanksList will be removed form the existing communicator. Within the new communicator, ranks will be re-ordered to fill the gap of removed ones. If config is NULL, the new communicator will inherit the original communicator’s configuration. The flag enables NCCL to adapt to various states of the parent communicator, see NCCL_SHRINK flags.
- Parameters:
comm – [in] Original communicator object for this rank
excludeRanksList – [in] List of ranks to be exluded
excludeRanksCount – [in] Number of ranks to be excluded
newcomm – [out] Pointer to new communicator
config – [in] Config file for new communicator. May be NULL to inherit from comm
shrinkFlags – [in] Flag to adapt to various states of the parent communicator (see NCCL_SHRINK flags)
- Returns:
Result code. See Result Codes for more details.
-
ncclResult_t pncclCommShrink(ncclComm_t comm, int *excludeRanksList, int excludeRanksCount, ncclComm_t *newcomm, ncclConfig_t *config, int shrinkFlags)#
-
ncclResult_t ncclCommInitRankScalable(ncclComm_t *newcomm, int nranks, int myrank, int nId, ncclUniqueId *commIds, ncclConfig_t *config)#
Creates a new communicator (multi thread/process version), similar to ncclCommInitRankConfig.
Allows to use more than one ncclUniqueId (up to one per rank), indicated by nId, to accelerate the init operation. The number of ncclUniqueIds and their order must be the same for every rank.
- Parameters:
newcomm – [out] Pointer to new communicator
nranks – [in] Total number of ranks participating in this communicator
myrank – [in] Current rank
nId – [in] Number of unique IDs
commIds – [in] List of unique IDs
config – [in] Config file for new communicator. May be NULL to inherit from comm
- Returns:
Result code. See Result Codes for more details.
-
ncclResult_t ncclCommRegister(const ncclComm_t comm, void *buff, size_t size, void **handle)#
-
ncclResult_t ncclCommDeregister(const ncclComm_t comm, void *handle)#
-
ncclResult_t ncclCommWindowRegister(ncclComm_t comm, void *buff, size_t size, ncclWindow_t *win, int winFlags)#
-
ncclResult_t ncclCommWindowDeregister(ncclComm_t comm, ncclWindow_t win)#
-
ncclResult_t ncclGroupSimulateEnd(ncclSimInfo_t *simInfo)#
-
ncclResult_t pncclGroupSimulateEnd(ncclSimInfo_t *simInfo)#
-
NCCL_H_#
- group Result Codes
The various result codes that RCCL API calls may return
Enums
-
enum ncclResult_t#
Result type.
Return codes aside from ncclSuccess indicate that a call has failed
Values:
-
enumerator ncclSuccess#
No error
-
enumerator ncclUnhandledCudaError#
Unhandled HIP error
-
enumerator ncclSystemError#
Unhandled system error
-
enumerator ncclInternalError#
Internal Error - Please report to RCCL developers
-
enumerator ncclInvalidArgument#
Invalid argument
-
enumerator ncclInvalidUsage#
Invalid usage
-
enumerator ncclRemoteError#
Remote process exited or there was a network error
-
enumerator ncclInProgress#
RCCL operation in progress
-
enumerator ncclNumResults#
Number of result types
-
enumerator ncclSuccess#
-
enum ncclResult_t#
- group Communicator Configuration
Structure that allows for customizing Communicator behavior via ncclCommInitRankConfig
Defines
-
NCCL_CONFIG_INITIALIZER#
-
NCCL_CONFIG_INITIALIZER#
- group Version Information
API call that returns RCCL version
- group Communicator Initialization/Destruction
API calls that operate on communicators. Communicators objects are used to launch collective communication operations. Unique ranks between 0 and N-1 must be assigned to each HIP device participating in the same Communicator. Using the same HIP device for multiple ranks of the same Communicator is not supported at this time.
- group Error Checking Calls
API calls that check for errors
- group Communicator Information
API calls that query communicator information
- group API Enumerations
Enumerations used by collective communication calls
Enums
-
enum ncclRedOp_dummy_t#
Dummy reduction enumeration.
Dummy reduction enumeration used to determine value for ncclMaxRedOp
Values:
-
enumerator ncclNumOps_dummy#
-
enumerator ncclNumOps_dummy#
-
enum ncclRedOp_t#
Reduction operation selector.
Enumeration used to specify the various reduction operations ncclNumOps is the number of built-in ncclRedOp_t values and serves as the least possible value for dynamic ncclRedOp_t values constructed by ncclRedOpCreate functions.
ncclMaxRedOp is the largest valid value for ncclRedOp_t and is defined to be the largest signed value (since compilers are permitted to use signed enums) that won’t grow sizeof(ncclRedOp_t) when compared to previous RCCL versions to maintain ABI compatibility.
Values:
-
enumerator ncclSum#
Sum
-
enumerator ncclProd#
Product
-
enumerator ncclMax#
Max
-
enumerator ncclMin#
Min
-
enumerator ncclAvg#
Average
-
enumerator ncclNumOps#
Number of built-in reduction ops
-
enumerator ncclMaxRedOp#
Largest value for ncclRedOp_t
-
enumerator ncclSum#
-
enum ncclDataType_t#
Data types.
Enumeration of the various supported datatype
Values:
-
enumerator ncclInt8#
-
enumerator ncclChar#
-
enumerator ncclUint8#
-
enumerator ncclInt32#
-
enumerator ncclInt#
-
enumerator ncclUint32#
-
enumerator ncclInt64#
-
enumerator ncclUint64#
-
enumerator ncclFloat16#
-
enumerator ncclHalf#
-
enumerator ncclFloat32#
-
enumerator ncclFloat#
-
enumerator ncclFloat64#
-
enumerator ncclDouble#
-
enumerator ncclBfloat16#
-
enumerator ncclFloat8e4m3#
-
enumerator ncclFloat8e5m2#
-
enumerator ncclNumTypes#
-
enumerator ncclInt8#
-
enum ncclRedOp_dummy_t#
- group Custom Reduction Operator
API calls relating to creation/destroying custom reduction operator that pre-multiplies local source arrays prior to reduction
Enums
-
enum ncclScalarResidence_t#
Location and dereferencing logic for scalar arguments.
Enumeration specifying memory location of the scalar argument. Based on where the value is stored, the argument will be dereferenced either while the collective is running (if in device memory), or before the ncclRedOpCreate() function returns (if in host memory).
Values:
-
enumerator ncclScalarDevice#
Scalar is in device-visible memory
-
enumerator ncclScalarHostImmediate#
Scalar is in host-visible memory
-
enumerator ncclScalarDevice#
-
enum ncclScalarResidence_t#
- group Collective Communication Operations
Collective communication operations must be called separately for each communicator in a communicator clique.
They return when operations have been enqueued on the HIP stream. Since they may perform inter-CPU synchronization, each call has to be done from a different thread or process, or need to use Group Semantics (see below).
- group MSCCL Algorithm
API calls relating to the optional MSCCL algorithm datapath
Typedefs
-
typedef int mscclAlgoHandle_t#
Opaque handle to MSCCL algorithm.
-
typedef int mscclAlgoHandle_t#
- group Group semantics
When managing multiple GPUs from a single thread, and since RCCL collective calls may perform inter-CPU synchronization, we need to “group” calls for different ranks/devices into a single call.
Grouping RCCL calls as being part of the same collective operation is done using ncclGroupStart and ncclGroupEnd. ncclGroupStart will enqueue all collective calls until the ncclGroupEnd call, which will wait for all calls to be complete. Note that for collective communication, ncclGroupEnd only guarantees that the operations are enqueued on the streams, not that the operation is effectively done.
Both collective communication and ncclCommInitRank can be used in conjunction of ncclGroupStart/ncclGroupEnd, but not together.
Group semantics also allow to fuse multiple operations on the same device to improve performance (for aggregated collective calls), or to permit concurrent progress of multiple send/receive operations.
- page Deprecated List
- Global mscclLoadAlgo (const char *mscclAlgoFilePath, mscclAlgoHandle_t *mscclAlgoHandle, int rank)
This function has been removed from the public API.
- Global mscclRunAlgo (const void *sendBuff, const size_t sendCounts[], const size_t sDisPls[], void *recvBuff, const size_t recvCounts[], const size_t rDisPls[], size_t count, ncclDataType_t dataType, int root, int peer, ncclRedOp_t op, mscclAlgoHandle_t mscclAlgoHandle, ncclComm_t comm, hipStream_t stream)
This function has been removed from the public API.
- Global mscclUnloadAlgo (mscclAlgoHandle_t mscclAlgoHandle)
This function has been removed from the public API.
- dir src
- page RCCL Documentation
Introduction#
RCCL (pronounced “Rickle”) is a stand-alone library of standard collective communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, gather, scatter, and all-to-all. There is also initial support for direct GPU-to-GPU send and receive operations. It has been optimized to achieve high bandwidth on platforms using PCIe, xGMI as well as networking using InfiniBand Verbs or TCP/IP sockets. RCCL supports an arbitrary number of GPUs installed in a single node or multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.
The collective operations are implemented using ring and tree algorithms and have been optimized for throughput and latency. For best performance, small operations can be either batched into larger operations or aggregated through the API.
RCCL API Contents#
RCCL API File#