High-Performance Deep Learning: Integrating OpenMP, MPI and CUDA
Monali B. Suthar *
Department of Computer Engineering, Silver Oak University, Ahmedabad, Gujarat, India.
Satvik V. Khara
Department of Computer Engineering, Silver Oak University, Ahmedabad, Gujarat, India.
Gaurav D. Tivari
Department of Computer Engineering, Silver Oak University, Ahmedabad, Gujarat, India.
*Author to whom correspondence should be addressed.
Abstract
The proliferation of deep learning algorithms in areas like computer vision, cybersecurity and big data analytics from the Internet of Things (IoT) has led to a tremendous rise in computational and memory requirements, which has made it imperative to employ high-performance computing (HPC) infrastructure. This paper examines the performance of hybrid parallel programming strategies by incorporating OpenMP, MPI and CUDA in order to enhance deep learning processes. An experimental setup is devised to test shared memory parallelism (OpenMP), distributed memory parallelism (MPI), GPU computing (CUDA) and an MPI-CUDA hybrid configuration in an HPC system. A CNN training process using a multi-core cluster with GPU support serves as the workload for the experiments. From the experiments, it can be seen that OpenMP offers efficient intra-node parallelisation but not distributed scalability beyond shared memory computing environments. The scalability of MPI is highly distributed; yet, the communication cost rises as the number of nodes grows, which affects efficiency. CUDA achieves significant speedups in computationally intensive tasks but does not scale efficiently across multiple nodes. The hybrid MPI-CUDA framework performs optimally by ensuring better scalability and efficiency by offering the best possible tradeoff between computations and communications, and offering reduced training times. With the inclusion of OpenMP, the framework allows better coordination between the GPU and CPU.
Keywords: Deep Learning, high-performance computing (HPC), hybrid parallel programming, OpenMP, MPI, CUDA, GPU acceleration, distributed training, scalability analysis, performance evaluation