API library

Contents

API library#

struct ncclConfig_t#

Communicator configuration.

Users can assign value to attributes to specify the behavior of a communicator

Public Members

size_t size#

Should not be touched

unsigned int magic#

Should not be touched

unsigned int version#

Should not be touched

int blocking#

Whether or not calls should block or not

int cgaClusterSize#

Cooperative group array cluster size

int minCTAs#

Minimum number of cooperative thread arrays (blocks)

int maxCTAs#

Maximum number of cooperative thread arrays (blocks)

const char *netName#

Force NCCL to use a specfic network

int splitShare#

Allow communicators to share resources

int trafficClass#

Traffic class

const char *commName#

Name of the communicator

int collnetEnable#

Check for collnet enablement

int CTAPolicy#

CTA Policy

int shrinkShare#

Shrink size

int nvlsCTAs#

Number of NVLS cooperative thread arrays (blocks)

struct ncclSimInfo_t#

Public Members

size_t size#
unsigned int magic#
unsigned int version#
float estimatedTime#
struct ncclUniqueId#

Opaque unique id used to initialize communicators.

The ncclUniqueId must be passed to all participating ranks

Public Members

char internal[NCCL_UNIQUE_ID_BYTES]#

Opaque array>

file mainpage.txt
file nccl.h.in
#include <hip/hip_runtime.h>
#include <hip/hip_fp16.h>
#include <limits.h>

Defines

NCCL_H_#
NCCL_MAJOR#
NCCL_MINOR#
NCCL_PATCH#
NCCL_SUFFIX#
NCCL_VERSION_CODE#
NCCL_VERSION(X, Y, Z)#
RCCL_BFLOAT16#
RCCL_FLOAT8#
RCCL_GATHER_SCATTER#
RCCL_ALLTOALLV#
RCCL_ALLREDUCE_WITH_BIAS#
NCCL_COMM_NULL#
NCCL_UNIQUE_ID_BYTES#
NCCL_CONFIG_UNDEF_INT#
NCCL_CONFIG_UNDEF_PTR#
NCCL_SPLIT_NOCOLOR#
NCCL_UNDEF_FLOAT#
NCCL_WIN_DEFAULT#
NCCL_WIN_COLL_SYMMETRIC#
NCCL_CTA_POLICY_DEFAULT#
NCCL_CTA_POLICY_EFFICIENCY#
NCCL_SHRINK_DEFAULT#
NCCL_SHRINK_ABORT#
NCCL_SIM_INFO_INITIALIZER#

Typedefs

typedef struct ncclComm *ncclComm_t#

Opaque handle to communicator.

A communicator contains information required to facilitate collective communications calls

typedef struct ncclWindow *ncclWindow_t#

Functions

ncclResult_t ncclMemAlloc(void **ptr, size_t size)#
ncclResult_t pncclMemAlloc(void **ptr, size_t size)#
ncclResult_t ncclMemFree(void *ptr)#
ncclResult_t pncclMemFree(void *ptr)#
ncclResult_t ncclCommShrink(ncclComm_t comm, int *excludeRanksList, int excludeRanksCount, ncclComm_t *newcomm, ncclConfig_t *config, int shrinkFlags)#

Shrink existing communicator.

Ranks in excludeRanksList will be removed form the existing communicator. Within the new communicator, ranks will be re-ordered to fill the gap of removed ones. If config is NULL, the new communicator will inherit the original communicator’s configuration. The flag enables NCCL to adapt to various states of the parent communicator, see NCCL_SHRINK flags.

Parameters:
  • comm[in] Original communicator object for this rank

  • excludeRanksList[in] List of ranks to be exluded

  • excludeRanksCount[in] Number of ranks to be excluded

  • newcomm[out] Pointer to new communicator

  • config[in] Config file for new communicator. May be NULL to inherit from comm

  • shrinkFlags[in] Flag to adapt to various states of the parent communicator (see NCCL_SHRINK flags)

Returns:

Result code. See Result Codes for more details.

ncclResult_t pncclCommShrink(ncclComm_t comm, int *excludeRanksList, int excludeRanksCount, ncclComm_t *newcomm, ncclConfig_t *config, int shrinkFlags)#
ncclResult_t ncclCommInitRankScalable(ncclComm_t *newcomm, int nranks, int myrank, int nId, ncclUniqueId *commIds, ncclConfig_t *config)#

Creates a new communicator (multi thread/process version), similar to ncclCommInitRankConfig.

Allows to use more than one ncclUniqueId (up to one per rank), indicated by nId, to accelerate the init operation. The number of ncclUniqueIds and their order must be the same for every rank.

Parameters:
  • newcomm[out] Pointer to new communicator

  • nranks[in] Total number of ranks participating in this communicator

  • myrank[in] Current rank

  • nId[in] Number of unique IDs

  • commIds[in] List of unique IDs

  • config[in] Config file for new communicator. May be NULL to inherit from comm

Returns:

Result code. See Result Codes for more details.

ncclResult_t ncclCommRegister(const ncclComm_t comm, void *buff, size_t size, void **handle)#
ncclResult_t ncclCommDeregister(const ncclComm_t comm, void *handle)#
ncclResult_t ncclCommWindowRegister(ncclComm_t comm, void *buff, size_t size, ncclWindow_t *win, int winFlags)#
ncclResult_t ncclCommWindowDeregister(ncclComm_t comm, ncclWindow_t win)#
ncclResult_t ncclGroupSimulateEnd(ncclSimInfo_t *simInfo)#
ncclResult_t pncclGroupSimulateEnd(ncclSimInfo_t *simInfo)#
group Result Codes

The various result codes that RCCL API calls may return

Enums

enum ncclResult_t#

Result type.

Return codes aside from ncclSuccess indicate that a call has failed

Values:

enumerator ncclSuccess#

No error

enumerator ncclUnhandledCudaError#

Unhandled HIP error

enumerator ncclSystemError#

Unhandled system error

enumerator ncclInternalError#

Internal Error - Please report to RCCL developers

enumerator ncclInvalidArgument#

Invalid argument

enumerator ncclInvalidUsage#

Invalid usage

enumerator ncclRemoteError#

Remote process exited or there was a network error

enumerator ncclInProgress#

RCCL operation in progress

enumerator ncclNumResults#

Number of result types

group Communicator Configuration

Structure that allows for customizing Communicator behavior via ncclCommInitRankConfig

Defines

NCCL_CONFIG_INITIALIZER#
group Version Information

API call that returns RCCL version

group Communicator Initialization/Destruction

API calls that operate on communicators. Communicators objects are used to launch collective communication operations. Unique ranks between 0 and N-1 must be assigned to each HIP device participating in the same Communicator. Using the same HIP device for multiple ranks of the same Communicator is not supported at this time.

group Error Checking Calls

API calls that check for errors

group Communicator Information

API calls that query communicator information

group API Enumerations

Enumerations used by collective communication calls

Enums

enum ncclRedOp_dummy_t#

Dummy reduction enumeration.

Dummy reduction enumeration used to determine value for ncclMaxRedOp

Values:

enumerator ncclNumOps_dummy#
enum ncclRedOp_t#

Reduction operation selector.

Enumeration used to specify the various reduction operations ncclNumOps is the number of built-in ncclRedOp_t values and serves as the least possible value for dynamic ncclRedOp_t values constructed by ncclRedOpCreate functions.

ncclMaxRedOp is the largest valid value for ncclRedOp_t and is defined to be the largest signed value (since compilers are permitted to use signed enums) that won’t grow sizeof(ncclRedOp_t) when compared to previous RCCL versions to maintain ABI compatibility.

Values:

enumerator ncclSum#

Sum

enumerator ncclProd#

Product

enumerator ncclMax#

Max

enumerator ncclMin#

Min

enumerator ncclAvg#

Average

enumerator ncclNumOps#

Number of built-in reduction ops

enumerator ncclMaxRedOp#

Largest value for ncclRedOp_t

enum ncclDataType_t#

Data types.

Enumeration of the various supported datatype

Values:

enumerator ncclInt8#
enumerator ncclChar#
enumerator ncclUint8#
enumerator ncclInt32#
enumerator ncclInt#
enumerator ncclUint32#
enumerator ncclInt64#
enumerator ncclUint64#
enumerator ncclFloat16#
enumerator ncclHalf#
enumerator ncclFloat32#
enumerator ncclFloat#
enumerator ncclFloat64#
enumerator ncclDouble#
enumerator ncclBfloat16#
enumerator ncclFloat8e4m3#
enumerator ncclFloat8e5m2#
enumerator ncclNumTypes#
group Custom Reduction Operator

API calls relating to creation/destroying custom reduction operator that pre-multiplies local source arrays prior to reduction

Enums

enum ncclScalarResidence_t#

Location and dereferencing logic for scalar arguments.

Enumeration specifying memory location of the scalar argument. Based on where the value is stored, the argument will be dereferenced either while the collective is running (if in device memory), or before the ncclRedOpCreate() function returns (if in host memory).

Values:

enumerator ncclScalarDevice#

Scalar is in device-visible memory

enumerator ncclScalarHostImmediate#

Scalar is in host-visible memory

group Collective Communication Operations

Collective communication operations must be called separately for each communicator in a communicator clique.

They return when operations have been enqueued on the HIP stream. Since they may perform inter-CPU synchronization, each call has to be done from a different thread or process, or need to use Group Semantics (see below).

group MSCCL Algorithm

API calls relating to the optional MSCCL algorithm datapath

Typedefs

typedef int mscclAlgoHandle_t#

Opaque handle to MSCCL algorithm.

group Group semantics

When managing multiple GPUs from a single thread, and since RCCL collective calls may perform inter-CPU synchronization, we need to “group” calls for different ranks/devices into a single call.

Grouping RCCL calls as being part of the same collective operation is done using ncclGroupStart and ncclGroupEnd. ncclGroupStart will enqueue all collective calls until the ncclGroupEnd call, which will wait for all calls to be complete. Note that for collective communication, ncclGroupEnd only guarantees that the operations are enqueued on the streams, not that the operation is effectively done.

Both collective communication and ncclCommInitRank can be used in conjunction of ncclGroupStart/ncclGroupEnd, but not together.

Group semantics also allow to fuse multiple operations on the same device to improve performance (for aggregated collective calls), or to permit concurrent progress of multiple send/receive operations.

page Deprecated List

Global mscclLoadAlgo  (const char *mscclAlgoFilePath, mscclAlgoHandle_t *mscclAlgoHandle, int rank)

This function has been removed from the public API.

Global mscclRunAlgo  (const void *sendBuff, const size_t sendCounts[], const size_t sDisPls[], void *recvBuff, const size_t recvCounts[], const size_t rDisPls[], size_t count, ncclDataType_t dataType, int root, int peer, ncclRedOp_t op, mscclAlgoHandle_t mscclAlgoHandle, ncclComm_t comm, hipStream_t stream)

This function has been removed from the public API.

Global mscclUnloadAlgo  (mscclAlgoHandle_t mscclAlgoHandle)

This function has been removed from the public API.

dir src
page RCCL Documentation

Introduction#

RCCL (pronounced “Rickle”) is a stand-alone library of standard collective communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, gather, scatter, and all-to-all. There is also initial support for direct GPU-to-GPU send and receive operations. It has been optimized to achieve high bandwidth on platforms using PCIe, xGMI as well as networking using InfiniBand Verbs or TCP/IP sockets. RCCL supports an arbitrary number of GPUs installed in a single node or multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.

The collective operations are implemented using ring and tree algorithms and have been optimized for throughput and latency. For best performance, small operations can be either batched into larger operations or aggregated through the API.

RCCL API Contents#

RCCL API File#