Design and implementation of a multi-GPU concurrent queue system using MPI + CUDA + NVSHMEM. The Bellman-Ford algorithm is used as a case study to evaluate the performance of the proposed concurrent FIFO queue, with this multi-GPU implementation being the first known instance of its kind. Experimental results demonstrate that the multi-GPU queue implementation achieves a maximum speedup of 3.92x and an average speedup of 3.04x over the single- GPU baseline on four NVIDIA A100 GPUs. When applied to the Bellman-Ford algorithm, the multi-GPU system achieves a maximum speedup of 3.03× and an average speedup of 2.65× compared to the single-GPU implementation, tested on 10 graphs of different kinds taken from the SuiteSparse Matrix Collection.