RCCL library specification#

This document provides details of the API library.

Communicator functions#

Warning

doxygenfunction: Cannot find function “ncclGetUniqueId” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclCommInitRank” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclCommInitAll” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclCommDestroy” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclCommAbort” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclCommCount” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclCommCuDevice” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclCommUserRank” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Collective communication operations#

Collective communication operations must be called separately for each communicator in a communicator clique.

They return when operations have been enqueued on the hipstream.

Since they may perform inter-CPU synchronization, each call has to be done from a different thread or process, or need to use Group Semantics (see below).

Warning

doxygenfunction: Cannot find function “ncclReduce” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclBcast” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclBroadcast” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclAllReduce” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclReduceScatter” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclAllGather” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclSend” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclRecv” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclGather” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclScatter” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclAllToAll” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Group semantics#

When managing multiple GPUs from a single thread, and since NCCL collective calls may perform inter-CPU synchronization, we need to “group” calls for different ranks/devices into a single call.

Grouping NCCL calls as being part of the same collective operation is done using ncclGroupStart and ncclGroupEnd. ncclGroupStart will enqueue all collective calls until the ncclGroupEnd call, which will wait for all calls to be complete. Note that for collective communication, ncclGroupEnd only guarantees that the operations are enqueued on the streams, not that the operation is effectively done.

Both collective communication and ncclCommInitRank can be used in conjunction of ncclGroupStart/ncclGroupEnd.

Warning

doxygenfunction: Cannot find function “ncclGroupStart” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclGroupEnd” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Library functions#

Warning

doxygenfunction: Cannot find function “ncclGetVersion” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Warning

doxygenfunction: Cannot find function “ncclGetErrorString” in doxygen xml output for project “RCCL 2.27.7 Documentation” from directory: /build/rccl-u8xwek/rccl-7.1.0/docs/doxygen/xml

Types#

There are few data structures that are internal to the library. The pointer types to these structures are given below. The user would need to use these types to create handles and pass them between different library functions.

typedef struct ncclComm *ncclComm_t

Opaque handle to communicator.

A communicator contains information required to facilitate collective communications calls

struct ncclUniqueId

Opaque unique id used to initialize communicators.

The ncclUniqueId must be passed to all participating ranks

Enumerations#

This section provides all the enumerations used.

enum ncclResult_t

Result type.

Return codes aside from ncclSuccess indicate that a call has failed

Values:

enumerator ncclSuccess

No error

enumerator ncclUnhandledCudaError

Unhandled HIP error

enumerator ncclSystemError

Unhandled system error

enumerator ncclInternalError

Internal Error - Please report to RCCL developers

enumerator ncclInvalidArgument

Invalid argument

enumerator ncclInvalidUsage

Invalid usage

enumerator ncclRemoteError

Remote process exited or there was a network error

enumerator ncclInProgress

RCCL operation in progress

enumerator ncclNumResults

Number of result types

enum ncclRedOp_t

Reduction operation selector.

Enumeration used to specify the various reduction operations ncclNumOps is the number of built-in ncclRedOp_t values and serves as the least possible value for dynamic ncclRedOp_t values constructed by ncclRedOpCreate functions.

ncclMaxRedOp is the largest valid value for ncclRedOp_t and is defined to be the largest signed value (since compilers are permitted to use signed enums) that won’t grow sizeof(ncclRedOp_t) when compared to previous RCCL versions to maintain ABI compatibility.

Values:

enumerator ncclSum

Sum

enumerator ncclProd

Product

enumerator ncclMax

Max

enumerator ncclMin

Min

enumerator ncclAvg

Average

enumerator ncclNumOps

Number of built-in reduction ops

enumerator ncclMaxRedOp

Largest value for ncclRedOp_t

enum ncclDataType_t

Data types.

Enumeration of the various supported datatype

Values:

enumerator ncclInt8
enumerator ncclChar
enumerator ncclUint8
enumerator ncclInt32
enumerator ncclInt
enumerator ncclUint32
enumerator ncclInt64
enumerator ncclUint64
enumerator ncclFloat16
enumerator ncclHalf
enumerator ncclFloat32
enumerator ncclFloat
enumerator ncclFloat64
enumerator ncclDouble
enumerator ncclBfloat16
enumerator ncclFloat8e4m3
enumerator ncclFloat8e5m2
enumerator ncclNumTypes