High-Performance Deep Learning: Integrating OpenMP, MPI and CUDA

Monali B. Suthar; Satvik V. Khara; Gaurav D. Tivari

doi:10.9734/bpi/nhstc/v10/7584

High-Performance Deep Learning: Integrating OpenMP, MPI and CUDA

Review History

DOI: 10.9734/bpi/nhstc/v10/7584

Page: 97-108

Issue: - Volume [Issue ]

Monali B. Suthar *

Department of Computer Engineering, Silver Oak University, Ahmedabad, Gujarat, India.

Satvik V. Khara

Department of Computer Engineering, Silver Oak University, Ahmedabad, Gujarat, India.

Gaurav D. Tivari

Department of Computer Engineering, Silver Oak University, Ahmedabad, Gujarat, India.

*Author to whom correspondence should be addressed.

Abstract

The proliferation of deep learning algorithms in areas like computer vision, cybersecurity and big data analytics from the Internet of Things (IoT) has led to a tremendous rise in computational and memory requirements, which has made it imperative to employ high-performance computing (HPC) infrastructure. This paper examines the performance of hybrid parallel programming strategies by incorporating OpenMP, MPI and CUDA in order to enhance deep learning processes. An experimental setup is devised to test shared memory parallelism (OpenMP), distributed memory parallelism (MPI), GPU computing (CUDA) and an MPI-CUDA hybrid configuration in an HPC system. A CNN training process using a multi-core cluster with GPU support serves as the workload for the experiments. From the experiments, it can be seen that OpenMP offers efficient intra-node parallelisation but not distributed scalability beyond shared memory computing environments. The scalability of MPI is highly distributed; yet, the communication cost rises as the number of nodes grows, which affects efficiency. CUDA achieves significant speedups in computationally intensive tasks but does not scale efficiently across multiple nodes. The hybrid MPI-CUDA framework performs optimally by ensuring better scalability and efficiency by offering the best possible tradeoff between computations and communications, and offering reduced training times. With the inclusion of OpenMP, the framework allows better coordination between the GPU and CPU.

Keywords: Deep Learning, high-performance computing (HPC), hybrid parallel programming, OpenMP, MPI, CUDA, GPU acceleration, distributed training, scalability analysis, performance evaluation

How to Cite

Suthar, M. B., Khara, S. V., & Tivari, G. D. (2026). High-Performance Deep Learning: Integrating OpenMP, MPI and CUDA. New Horizons of Science, Technology and Culture Vol. 10, 97–108. https://doi.org/10.9734/bpi/nhstc/v10/7584