RCCL library specification#
This document provides details of the API library.
Communicator functions#
Warning
doxygenfunction: Cannot find function “ncclGetUniqueId” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclCommInitRank” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclCommInitAll” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclCommDestroy” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclCommAbort” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclCommCount” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclCommCuDevice” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclCommUserRank” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Collective communication operations#
Collective communication operations must be called separately for each communicator in a communicator clique.
They return when operations have been enqueued on the hipstream.
Since they may perform inter-CPU synchronization, each call has to be done from a different thread or process, or need to use Group Semantics (see below).
Warning
doxygenfunction: Cannot find function “ncclReduce” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclBcast” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclBroadcast” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclAllReduce” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclReduceScatter” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclAllGather” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclSend” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclRecv” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclGather” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclScatter” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclAllToAll” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Group semantics#
When managing multiple GPUs from a single thread, and since NCCL collective calls may perform inter-CPU synchronization, we need to “group” calls for different ranks/devices into a single call.
Grouping NCCL calls as being part of the same collective operation is done using ncclGroupStart and ncclGroupEnd. ncclGroupStart will enqueue all collective calls until the ncclGroupEnd call, which will wait for all calls to be complete. Note that for collective communication, ncclGroupEnd only guarantees that the operations are enqueued on the streams, not that the operation is effectively done.
Both collective communication and ncclCommInitRank can be used in conjunction of ncclGroupStart/ncclGroupEnd.
Warning
doxygenfunction: Cannot find function “ncclGroupStart” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclGroupEnd” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Library functions#
Warning
doxygenfunction: Cannot find function “ncclGetVersion” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Warning
doxygenfunction: Cannot find function “ncclGetErrorString” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml
Types#
There are few data structures that are internal to the library. The pointer types to these structures are given below. The user would need to use these types to create handles and pass them between different library functions.
-
typedef struct ncclComm *ncclComm_t
Opaque handle to communicator.
A communicator contains information required to facilitate collective communications calls
-
struct ncclUniqueId
Opaque unique id used to initialize communicators.
The ncclUniqueId must be passed to all participating ranks
Enumerations#
This section provides all the enumerations used.
-
enum ncclResult_t
Result type.
Return codes aside from ncclSuccess indicate that a call has failed
Values:
-
enumerator ncclSuccess
No error
-
enumerator ncclUnhandledCudaError
Unhandled HIP error
-
enumerator ncclSystemError
Unhandled system error
-
enumerator ncclInternalError
Internal Error - Please report to RCCL developers
-
enumerator ncclInvalidArgument
Invalid argument
-
enumerator ncclInvalidUsage
Invalid usage
-
enumerator ncclRemoteError
Remote process exited or there was a network error
-
enumerator ncclInProgress
RCCL operation in progress
-
enumerator ncclNumResults
Number of result types
-
enumerator ncclSuccess
-
enum ncclRedOp_t
Reduction operation selector.
Enumeration used to specify the various reduction operations ncclNumOps is the number of built-in ncclRedOp_t values and serves as the least possible value for dynamic ncclRedOp_t values constructed by ncclRedOpCreate functions.
ncclMaxRedOp is the largest valid value for ncclRedOp_t and is defined to be the largest signed value (since compilers are permitted to use signed enums) that won’t grow sizeof(ncclRedOp_t) when compared to previous RCCL versions to maintain ABI compatibility.
Values:
-
enumerator ncclSum
Sum
-
enumerator ncclProd
Product
-
enumerator ncclMax
Max
-
enumerator ncclMin
Min
-
enumerator ncclAvg
Average
-
enumerator ncclNumOps
Number of built-in reduction ops
-
enumerator ncclMaxRedOp
Largest value for ncclRedOp_t
-
enumerator ncclSum
-
enum ncclDataType_t
Data types.
Enumeration of the various supported datatype
Values:
-
enumerator ncclInt8
-
enumerator ncclChar
-
enumerator ncclUint8
-
enumerator ncclInt32
-
enumerator ncclInt
-
enumerator ncclUint32
-
enumerator ncclInt64
-
enumerator ncclUint64
-
enumerator ncclFloat16
-
enumerator ncclHalf
-
enumerator ncclFloat32
-
enumerator ncclFloat
-
enumerator ncclFloat64
-
enumerator ncclDouble
-
enumerator ncclBfloat16
-
enumerator ncclFloat8e4m3
-
enumerator ncclFloat8e5m2
-
enumerator ncclNumTypes
-
enumerator ncclInt8