How to run NVSHMEM with slurm

Question

I’m getting started with using NVSHMEM and I wanted to start from a simple example, with not much success.

#include <nvshmem.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
    // Initialize the NVSHMEM library
    nvshmem_init();

    int mype = nvshmem_my_pe();
    int npes = nvshmem_n_pes();

    fprintf(stdout, "PE %d of %d has started ...\n", mype, npes);

    // end shmem
    nvshmem_finalize();

    return 0;
}

Being run with the following sbatch file:

#!/bin/bash -l
#SBATCH --nodes=2                          # number of nodes
#SBATCH --ntasks=8                         # number of tasks
#SBATCH --ntasks-per-node=4                # number of tasks per node
#SBATCH --gpus-per-task=1                  # number of gpu per task
#SBATCH --cpus-per-task=1                  # number of cores per task
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=gpu                    # partition
#SBATCH --account=p200301                  # project account
#SBATCH --qos=default                      # SLURM qos

module load NCCL OpenMPI CUDA NVSHMEM && nvcc -rdc=true -ccbin g++ -I $NVSHMEM_HOME/include test.cu -o test -L $NVSHMEM_HOME/lib -lnvshmem_host -lnvshmem_device -lucs -lucp && srun -n 8 ./test

The expected output would be something like:

PE 0 of 8 has started ...
PE 1 of 8 has started ...
PE 2 of 8 has started ...
.....

Instead the output I get is:

PE 0 of 1 has started ...
PE 0 of 1 has started ...
PE 0 of 1 has started ...
PE 0 of 1 has started ...
PE 0 of 1 has started ...
PE 0 of 1 has started ...
PE 0 of 1 has started ...
PE 0 of 1 has started ...

I think I am missing something crucial but simple, can somebody enlighten me?

Leave a Comment Cancel reply