Received February 24, 2018, accepted March 28, 2018, date of publication April 9, 2018, date of current version May 16, 2018. Digital Object Identifier 10.1109/ACCESS.2018.2823299 Performance and Power Efficient Massive Parallel Computational Model for HPC Heterogeneous Exascale Systems M. USMAN ASHRAF , FATHY ALBURAEI EASSA, AIIAD AHMAD ALBESHRI, AND ABDULLAH ALGARNI Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia Corresponding author: M. Usman Ashraf (

[email protected]

) This work was supported in part by the Deanship of Scientific Research at King Abdulaziz University, Jeddah, under Grant RG-3-611-38. ABSTRACT The emerging high-performance computing Exascale supercomputing system, which is anticipated to be available in 2020, will unravel many scientific mysteries. This extraordinary processing framework will accomplish a thousand-folds increment in figuring power contrasted with the current Petascale framework. The prospective framework will help development communities and researchers in exploring from conventional homogeneous to the heterogeneous frameworks that will be joined into energy efficient GPU devices along with traditional CPUs. For accomplishing ExaFlops execution through the Ultrascale framework, the present innovations are confronting several challenges. Huge parallelism is one of these challenges, which requires a novel low power consuming parallel programming approach for attaining massive performance. This paper introduced a new parallel programming model that achieves massive parallelism by combining coarse-grained and fine-grained parallelism over inter-node and intra- node computation respectively. The proposed framework is tri-hybrid of MPI, OpenMP, and compute unified device architecture (MOC) that compute input data over heterogeneous framework. We implemented the proposed model in linear algebraic dense matrix multiplication application, and compared the quantified metrics with well-known basic linear algebra subroutine libraries such as CUDA basic linear algebra subroutines library and KAUST basic linear algebra subprograms. MOC outperformed to all implemented methods and achieved massive performance by consuming less power. The proposed MOC approach can be considered an initial and leading model to deal emerging Exascale computing systems. INDEX TERMS Exascale computing, HPC, massive parallelism, super computing, energy efficiency, hybrid programming, CUDA, OpenMP, MPI. I. INTRODUCTION oil and gas due to massive cost of HPC systems. However, High Performance Computing (HPC) generally refers to the now a day HPC is being utilized in various areas such as practice of aggregating computing power in a way that deliv- data mining, social media services, education sectors and ers much higher performance than one could get out of a industries. From recent past, Many HPC applications such typical desktop computer or workstation in order to solve as climate and environmental modelling, computation fluid large problems in science, engineering, or business [1], [2]. dynamics (CFD) molecular nanotechnology [3], intelligent Supercomputing and parallel computing are the similar terms planetary spacecraft [4] and many other big data applications to HPC. demand for an extreme powerful computing system to deal The fundamental idea behind the HPC is that, for instance, such applications. The HPC pioneers and researchers asserted a single compute takes 100 hours to complete a job, and we that the emerging supercomputing systems ‘‘Exascale can solve the same task by using 100 computers in 1 hour. systems’’ will be presented till start of next decade [5], [6]. A single node in a supercomputer may not be a more powerful This heterogeneous architecture based HPC supercomputer but it can be when using all the resources together. framework will give a thousand-overlay increase in perfor- Initially, the utilization of HPC was limited to some par- mance over existing Petascale systems. Such massive per- ticular applications such as simple simulations, engineering, forming HPC system will enable the unscrambling of many 2169-3536 2018 IEEE. Translations and content mining are permitted for academic research only. VOLUME 6, 2018 Personal use is also permitted, but republication/redistribution requires IEEE permission. 23095 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. M. U. Ashraf et al.: Performance and Power Efficient Massive Parallel Computational Model scientific mysteries by achieving ExaFlops number of cal- TABLE 1. Exascale computing challenges. culations within secs [7]. According to development toward Exascale computing systems, it has been anticipated that it will be comprised of a huge number of heterogeneous nodes where each node will be configured with conventional multi- core CPUs and many core accelerated GPU devices [8], [9]. The arising of heterogeneity of HPC systems is leading in growingly complex platforms at a time when the demand for greater computing power continues to expand. Leading supercomputing systems, the key challenge is the massive power consumption during HPC data processing. According to current HPC supercomputing system architectures, up to 10 million number of cores are resided per node that consume about 25 to 60 MW power. Applying this power consuming ratio, Exascale computing systems will demand up to 130 MW power consumption to achieve Exaflops which is not affordable for maximum countries. However, HPC pioneers such as United States Department of Energy (US DoE), Intel, IBM and AMD defined some limitations for Exascale systems such as energy consumption ≈ 25-30 MW, capital cost ≈ 200 M USD, targeted time of delivery ≈ 2020-2022 and system cores should be up to 100 million [10]. It is very challenging for current HPC systems to achieve ExaFlops level performance under these hard limitations. However, a key element of the strategy as we move forward is the co-design of the applications, programming environ- ments, frameworks and architectures. In addition, hardware breakthrough under the power consumption limitations is also required. A. EXASCALE CHALLENGES The primary challenge for the Exascale system is that it we adopt the term tri-hybrid of ‘‘MPI+OMP+CUDA’’ does not exist yet. However, in trying to achieve ExaFlops abbreviated as ‘‘MOC’’. MOC provides three level of level performance under the defined strict limitations, cur- parallelism such as coarse grain, fine grain and finer granu- rent technologies are facing several fundamental challenges. larity parallelism by computing data over inter-node, intra- At a broad level, these challenges can be categorized in node and accelerated NVIDIA GPUs devices respectively. following themes in Table 1 [6]. The authors and HPC pioneers build over previous work and One conventional approach to upgrade an HPC system introduced CuBLAS (CUDA Basic Linear Algebra Subrou- framework is to enhance the clock speed. Due to extraordi- tines library) [14] and KBAS (KAUST Basic Linear Algebra nary heat dissipation, this approach will be fixed at 1 GHz Subprograms) [15] the more performance tuning knobs that and alternative approach to increase the number of cores will could maintain decent throughput across previous and current be adopted [12]. According to above indicated restrictions accelerated GPUs hardware generations, without code rewrit- for Exascale systems, we cannot increase the number of ing. Although these models provides a decent performance cores more than one hundred million. Ultimately, increase but cannot be considered for emerging Exascale computing in number cores can provide targeted performance level but systems due to massive power consumption. As discussed with massive power consumption. An alternate solution is above the characteristics of Exascale systems that will deal ‘massive parallelism’, which required improving the pro- with big data HPC applications. However, with respect to gramming environment. According to Jacobsen et al. [13], two fundamental HPC metrics ‘‘performance and power con- the performance of multi-level parallelism in tri-hierarchy sumption’’, MOC outperformed to existing state-of-the-art model can be promising for Exascale computing systems. on larger dataset computations. MOC attains asymptotically However, it should be implemented and investigated on larger up to 30% and 40% speedup against the best implementa- clusters with more than two GPUs and different domain tions on heterogeneous multiprocessor CPUs and accelerated decomposition strategies. NVIDA GPUs. This paper introduces hybrid of MPI+OMP+CUDA Further, the paper is organized in such way that section II (MOC), a new massive parallel programming model for carried out a detailed background of parallel program- largescale heterogeneous cluster systems. In this article, ming models used in MOC proposed model. In addition, 23096 VOLUME 6, 2018 M. U. Ashraf et al.: Performance and Power Efficient Massive Parallel Computational Model we have accomplished a comprehensive literature review program execution. According to latest build OpenMP 4.5, of these models with different hierarchies and reported the OMP also facilitate the programmer to run application code HPC facing challenges for today and future Exascale com- on accelerated GPU devices. Along with GPU computation, puting systems. Section III highlights our contribution in synchronization between host CPUs and GPU cores was current study. Section IV demonstrates the MOC model improved that is capable to run multiple task in a group in detail and presents the features and functionalities of format using ‘taskgroup’ construct. In addition, load balanc- inter-node, intra-node and GPU computations in MOC. ing in during loop parallelization was also improved using Section V describes the experimental setup including plat- ‘taskloop’ directive [22]. Although OpenMP has become a form, HPC metrics and measurement mechanism of these very famous model to deal parallel programming merely metrics. Furthermore, section VI shows the experimental it is implementable only for single node architectures but results of MOC and compared with existing state-of-the-art cannot be used for cluster systems having multiple nodes. implementations. Section VII describes the critical parame- By future perspectives, it has been observed that OpenMP can ters necessary to tune the MOC model for Exascale comput- be used in hybrid with MPI for future comping systems where ing systems. In addition, this section presents an anticipated OpenMP will perform intra-node execution. configuration model for Exascale system and we conclude in section VIII. 3) CUDA In the light of accelerated NVIDIA GPU programming, II. BACKGROUND MATERIAL AND LITERATURE REVIEW NVIDIA presented Compute Unified Device Architec- We now give a brief background of parallel programming ture (CUDA) a remarkable model that achieve massive par- models including MPI, OpenMP and CUDA that has been allelism by running user input data over accelerated GPUs used in MOC model followed by a comprehensive literature cores. CUDA architecture is also available in C++ and review in section 2.2. FORTRAN programming languages [23]. The recent stable CUDA release 8.0 introduced a new optimization schemes that improve the performance in the system. Accordingly, A. TECHNOLOGY BACKGROUND we can make grid and block level optimization by creating 1) MPI multiple stream-processors for each SM on GPU. Further MPI is a notable autonomous library that has been utilized new CUDA release 9.1 was presented that contained the for correspondence among the explicit procedures in dis- new profiling mechanisms. This new build was supportive tributed framework. Basically, the standard version MPI was for multiple GPU architectures including pascal and lambda introduced in 1994 [16]. Later on, number of modifications compilers [24]. In CUDA model, sequential code is paral- were in made to provide new features in different versions. lelized by executing through CUDA kernel. In recent past, some challenges including environmental According to basic structure of CUDA programming, layouts, message passing in heterogeneous cluster systems, before calling CUDA kernel, some pre-processing are per- blocking/non-blocking data distribution and receiving were formed where firstly memory allocation is performed for addressed in MPI 3.0 version [17]. Although the original GPU devices with equal number of variables used at host side. development in MPI was not for Exascale consideration but Further, data is transferred from host to GPU devices using a progressive improvement made it the promising consider- particular methods provided by CUDA. Once data transfor- ation for emerging HPC systems. In the light of Exascale mation is confirmed, CUDA kernel is called for GPU compu- computing systems, it still requires several considerations tation. At this stage, we can use multiple CUDA kernels to run such as low power consuming strategies in message pass- over multiple accelerated NVIDIA GPU devices. A detailed ing among the heterogeneous cores, synchronization han- overview has been presented in figure 1 [25]. dling in non-blocking strategy and memory management mechanisms [18], [19]. B. LITERATURE REVIEW Parallelism has brought about a great revolution to enhance 2) OPENMP the performance in the computer. Parallelism was intro- Open Specification for Multi-Processing (OpenMP) single duced firstly in the 90s and still being explored to deal instruction multiple data (SIMD) based a new model was targeted Exascale computing systems. The primary objective introduced for CPU thread level parallelism in 1997 [20]. of Exascale systems is to deal big data HPC applications OpenMP parallelize the code over CPU threads using differ- such as climate and environmental modelling, computation ent directives, library routines and clauses. Throughout the fluid dynamics (CFD), molecular nanotechnology, intelligent OpenMP development, these shared memory standards are planetary spacecraft and many other applications that are available in C++ and FORTRAN languages. In OpenMP 4.0 required to run on HPC systems. In order to deal these HPC version, a number of new features were introduced. These applications, a variety of PPMs such as High-Performance features includes ‘‘new atomic operations’’ that are used FORTRAN (HPF) [26] and an explicit message-passing inter- for fine grain synchronization [21]. It also contained many face (MPI) were introduced to attain TFlops. Terascale com- tasks extensions and error handling clauses that maintain the puting systems were based on coarse-grained parallelism that VOLUME 6, 2018 23097 M. U. Ashraf et al.: Performance and Power Efficient Massive Parallel Computational Model FIGURE 2. Hierarchy Navigation in the Programming Model. system in the spectral element method and performing effi- cient computations using multiple threads. Since a decade ago, a sensational change happened in hardware advancements in IT. Many-cores architecture based new energy-efficient and powerful devices has been introduced such as Graphical Processing Unit (GPU) by NVIDIA [33], Many Integrated Cores (MIC) by Intel [36]. FIGURE 1. Processing flow on CUDA. These GPUs devices were also introduced by different pioneers such as ADM-GPU [34], ARM [35]. These Single- Instruction Multiple-Data (SIMD)-architecture-based many- was accomplished at the inter-node level through single- cores devices contained thousands of cores in single chip that hierarchy models such as MPI. To increase the performance are much powerful than conventional CPUs. The legacy GPU in Terascale systems, many new approaches were intro- devices were applicable only for graphical processing but new duced, such as pure parallelism, in situ processing [27], and GPU models called General-Purpose Graphical Processing out-of-core and multi-resolution techniques. However, pure Unit (GPGPU) are available to compute general purpose data parallelism was conceived as a suitable paradigm. These sug- processing in HPC applications along with graphical com- gested models were not able to address the challenges of the putation. These GPGPUs can be programmed using several higher-order CFD applications and required thread-level par- accelerated programming models such as OpenCL [37], [38], allel computing in a cluster system. Later on, dual-hierarchy OpenACC [39], CUDA and OpenMP that can program GPUs. model (MPI + X) was introduced for Petascale supercomput- According to different experiences and consequences, ing systems [28]. The objective of the Petascale system was CUDA is most promising model for accelerated program- to achieve both coarse-grained and fine-grained parallelism ming which is optimizable at thread level. through inter-node and intra-node processing. Therefore, The previous hybrid models could deal only homogenous a hybrid model of MPI (to parallelize data at the inter-node systems but not heterogeneous cluster systems. However level) and OpenMP (to parallelize at the intra-node level) was by software perspectives, new hybrid programming models proposed by Dong et al. [29]. This hybrid technique enhanced were required that could utilize these energy-efficient accel- the performance in solving HPC applications. The hybrid erated devices along with traditional CPUs [40], [41]. model of MPI and OpenMP [30] for coarse-grained paral- Pennycook et al. [42] proposed a new hybrid (MPI + lelism shows good scalability compared to single-hierarchy- CUDA) approach to implement in NAS LU benchmark. level parallelism (pure MPI and pure OpenMP 3.0) with Rakić et al. [43] introduced the similar MPI + CUDA respect to both the problem size and the number of pro- parallelization of a finite strip program for geometric non- cessors for a fixed problem size. Nevertheless, the use of linear analysis. The hybrid of MPI+CUDA is applica- multiple threading in a hybrid paradigm increases the thread ble on heterogeneous cluster system where multiple CPU management overhead in thread creation/destruction and syn- processors are configured along with accelerated NVIDIA chronization considerably with the increase in the number GPU devices. Likely, another hybrid of OpenMP + CUDA of threads [31]. To update the thread-level parallelism and approach was introduced to achieve massive performance address the overhead in thread creation/destruction and syn- by computing data over single node having heterogeneous chronization, OpenMP 4.0 was released in 2013 as discussed processors [44]–[46]. The similar functionally was achieved in last section. This new version was equipped with new fea- by Howison et al. [47] through hybrid of MPI+ PThread. tures for error handling, tasking extensions, atomics and sup- These hybrid approaches could achieve coarse grain and port for accelerated computation. The primary challenge with fine-grain parallelism through MPI and GPU computa- this hybrid model was massive power consumption while data tions respectively. To achieve massive parallelism in the transferring and communication among CPU processors [32]. system, the hierarchy level in PPMs was shifted from However, new hybrid PPM approaches and hardware devices dual- to tri-level by adding another layer of parallelism. were required for localizing the work from the distributed Figure 2 demonstrates the increasing hierarchy in parallel 23098 VOLUME 6, 2018 M. U. Ashraf et al.: Performance and Power Efficient Massive Parallel Computational Model programming models. Toward massive parallel computing evaluated HPC metrics including performance and using tri-hierarchy level, symmetric multiprocessor (SMP) power consumption. based new model was introduced to run on larger clus- - We implemented the same problem in CuBLAS and ter systems in [48]. In this model, communication among KAUST basic linear algebra subprograms (KBLAS) the SMP nodes was made using message passing interface. most famous Linear Algebra Subroutines libraries. Fur- Further second level of parallel computing starts over thermore, we compare the results with MOC suggested CPU threads within the node that was accomplished by model. OpenMP. Continuing parallel computing it starts third level - Based on MOC consequences, anticipated configura- of parallelism by vectorization for each processing ele- tions and predictive performance and power consump- ment. The primary aim of this approach was to achieve tion have been presented for future Exascale computing massive parallelism by combining coarse and fine grains system. within each SMP. Hybrid approach doesn’t allow the mes- sage passing in SMP nodes which was the main advantage IV. TRI-HYBRID MPI+OPENMP+CUDA (MOC) over flat MPI. Further this tri-hybrid model was used to PARALLEL PROGRAMMING MODEL solve several three dimensional linear elastic problems and In this section, we have presented the proposed tri- achieved 3.80 TFlops. From smaller executions, both tri- hybrid parallel programming model for Exascale comput- hybrid and flat MPI achieve better performance but hybrid ing system. Based on the hierarchy navigation in previous model outperformed to flat MPI for larger systems with parallel programming models, the proposed approach is a multiple SMP nodes. Although the performance in tri-hybrid hybrid of MPI, OpenMP and CUDA and abbreviated as model was good enough but huge power consumption was MOC. MOC contains three major level of computations such the biggest challenge for HPC technologies. According to as inter-node, intra-node and accelerated GPU devices. The Amarasinghe et al. [50], unanimous implementation of exist- detailed workflow of these three parallel computing level has ing models and powerful GPU devices for better perfor- been illustrated in Figure 3. Each computation level has been mance of the system should be reinvestigated. On the road discussed in detail as follows: toward Exascale computing systems, it has been antici- pated that in tri-hybrid third level of parallel computing A. INTER-NODE COMPUTATION model X will be replaced by accelerated processing through Before interacting with MOC model, some prerequisites are energy efficient GPU computations. In order to decide this necessary to determine about targeted system that includes X model, several models were recommended in different host CPU cores and its architecture, number of racks if studies [51]–[54]. These studies recommended OpenACC, targeted system is larger cluster, total number of nodes in CUDA at top consideration where OpenACC exceeded the the system, the GPU devices for accelerated computing and performance of the Compute Unified Device Architecture type of GPUs, memory type and levels. Once these speci- (CUDA) by approximately 50%. Moreover, it exceeded fications are determined, parallel computing zones started. CUDA’s performance by up to 98%. Conversely, metrics such MOC provides basically three levels of parallel zones where as optimization and program flexibility, thread synchroniza- first and top level is obtained through inter-node computation. tion and other advanced features are attainable in CUDA Inter-node computation was achieved by MPI that commu- but not in OpenACC. Under these metrics, HPC hetero- nicate among host CPUS processors in all connected nodes. geneous systems prevents unnecessary usage of resources. MPI defines two types of processes such as master pro- Eventually, we finalized the X model as CUDA to com- cess and slave process where master process is indicated pute accelerated GPU devices in current studies, which with rank ‘0’ and slave processes are represented with non- is expected the promising model for Exascale computing zero ranks. Before data distribution over processes, there system. are some fundamental MPI statements that are necessary to define these ranks and communication size over MPI world. Continuing the parallel computing, MPI master processes III. CONTRIBUTIONS distribute the data over all connected nodes through slave Our contribution in this paper can be summarized as processes. In order to distribute and receive the data, several follows: methods are available to use. For MOC model, we imple- - Proposed a new tri-hybrid MPI + OpenMP + CUDA mented blocking methods MPI_Send() and MPI_Recv() for (MOC) massive parallel computing model for Exascale sending and receiving data. Although these methods are not computing system that combine coarse-grain, fine grain as much efficient as non-blocking Isend() and Irec() but and finer granularity through inter-node, intra-node and we blocking methods maintain the synchronization. In our accelerated GPU computations. implementation, we didn’t use any optimization during data - We proposed a tri-hybrid algorithm and theoretical distribution however, this level of parallelism provides only model to evaluate MOC model complexity. coarse grain parallelism. After distrusting data over CPU pro- - We implemented MOC in linear algebra dense cesses, the next parallel computing zone started as described matrix multiplication using different kernel sizes and below. VOLUME 6, 2018 23099 M. U. Ashraf et al.: Performance and Power Efficient Massive Parallel Computational Model FIGURE 3. Workflow of the Hybrid Parallel Programming Model. B. INTRA-NODE COMPUTATION (START GPU device. At this stage, data is computed over thousands of FROM HERE TO NEXT) cores in parallel and obtained finer granularity. For a cluster Intra-node computation is second level of parallelism where system having larger number of GPU devices, it’s difficult to distributed data over host CPU cores is computed within write the kernels each time. However, MOC model contained the node. This computation is performed over CPU threads. a generic form of CUDA kernel that receive/return data in These threads can be parallelized through different paral- template format and execute accordingly. Once data compu- lel programming models. Once of the most famous parallel tation over GPU devices is completed, it transferred back over programming model to parallelize CPU threads is OpenMP. host cores and controlled by OpenMP threads from where As discussed above, OpenMP can be used to program both it was initiated. Similarly, OpenMP complete its execution CPU cores and GPU devices as well. In MOC implementa- within the pragma and return data to MPI slave processes. tion, we used OpenMP to program parallelize CPU threads After receiving data from all these levels, MPI master thread and achieved fine grain parallelism. OpenMP programming collect data from slave processes and return the results back model contains one main outer pragma that initiate the par- to user call. In such way, we achieve three level of parallelism allel zone. Each statement written within this pragma is from MOC model. computed in parallel. However, to achieve fine grain par- Algorithm in Listing 1, describes the individual role of allelism, we implemented multiple looping directives and MPI, OpenMP and CUDA in MOC model that provide three sections directives, and refine the parallelism. Within these levels of parallelism as discussed above. Analysing compu- pragmas, we defined the third level of parallelism called GPU tation and communication cost of an algorithm depicts that computation. In order to optimize the resource, we reserved whether the algorithm is useful or not. Generally, running the similar number OpenMP threads as number of available time of any algorithm depends on multiple factors such as GPU devices. single/multi-processor system, read/write speed to memory, bit system (32/64 bits) and the input data. Theoretically, C. ACCELERATED GPU COMPUTATION an algorithm is evaluated by calculating the time and space The third level of parallelism in MOC model was performed complexity in it [63], [64]. Space complexity is related to through data processing over accelerated GPU devices. Each memory types used in the system. Now a day, we have CPU process was reserved for every GPU device. Therefore, advanced memory devices that overcome the space issues a looping statement reserve a specific GPU device every time and consequently ignore the space complexity considera- and transfer data from host to GPU device. This data is further tions. Generally, a parallel algorithm is analysed by following computed in CUDA kernel that run the code on specific factors. 23100 VOLUME 6, 2018 M. U. Ashraf et al.: Performance and Power Efficient Massive Parallel Computational Model In order to evaluate these parameters in MOC model, let’s assume that N is the problem size in bytes. In MOC model, each process pi will process these bytes equally as w = N /p where w is the workload and f is considered the percentage of workload w that cannot be processed within the parallel region. According to Amdahl’s law, we consider the Tt the time factor for processing by adding t threads on each p. Using asymptotic analysis O(N ) approach, the required work N amount for N bytes can be expressed as O( pT t ). During processing of any parallel algorithm, it incurs some amount of communication overhead [65]. In MOC implan- tation, we tried to reduce the communication rounds from three phases including sending, computing and receiving; and assume the overhead cost as To . During sending round, let us assume that s bytes data will be send by a pro- cess pi from working region. Therefore, the communication complexity for sending s bytes will be as O( SNp ). Similarly, the multithreaded processes can compute C bytes of data using share memory. During data computation over pro- cesses, many overhead possibilities exist there such as wait time during accessing shared data, processes synchronization etc. However, these overheads during data computation can be expressed as o = N /M where M is the output of data computation. Thus the complexity of data computation round N can be determined as O( oCp ). Once the program execution is done by all the processes, the third round of data receiving takes start that can be described as O(log (p + Rec)) where Rec is timestamps for receiving processed data. Overall the total time complexity of MOC algorithm can be summarized as (Tm = Tc + To ) where Tc represents the computation cost of input data and To is the communication overhead cost. N Tc = O( ) (a) pTt N N To = O( ) + O( ) + O(log(p + Rec)) (b) Sp oCp N 1 1 1 Tm ≈ O + + + O(log(p + Rec)) (c) p Tt S oC Equation (c) elaborates that the complexity in MOC algo- rithm depend on multiple parameters under data computation and communication among the processes. V. EXPERIMENTAL SETUP This section explains the selected experimental platform for proposed MOC model. In experiments, we measured dif- ferent HPC metrics that includes performance factors such as execution time, number of achieved flops, power con- sumption and the energy efficiency in the system. A detailed description of these metrics is explained in following section. A. EXPERIMENTAL PLATFORM In order to evaluate the suggested MOC model, all experi- List. 1. The Tri-Hybrid MOC Algorithm ments were performed on Aziz supercomputer available in VOLUME 6, 2018 23101 M. U. Ashraf et al.: Performance and Power Efficient Massive Parallel Computational Model HPC centre, King Abdulaziz University, Jeddah, Saudi Ara- programming approaches are being introduced to program bia. Aziz-Fujistu Primergy CX400 Intel Xeon Truescale QDR accelerated energy efficient devices that can minimize the supercomputer is manufactured by Fujistu [55]. According power consumption level. Generally, a system is evaluated to top-500 supercomputing list, Aziz was ranked at 360th according to its energy consumption, which indicates the position [56]. Originally Aziz was develop to deal HPC appli- power rate at which processing was executed, as described cations in Saudi Arabia and collaborated projects. It con- in equation (2). tained 492 number of nodes which are interlinked within the Z t racks through InfiniBand where 380 are regular and 112 are E(kWh) = V · I (dt) (2) large nodes with additional specifications. Previously Aziz 0 was capable to run the applications only on homogeneous From above equation, we can calculate the total energy con- node but due to requirements of massive parallel comput- sumption of a system by integrating the energy consumption, ing, it was upgraded by adding two SIMD-architecture-based which is composed of the bandwidth, memory contention, accelerated NVIDA Tesla k-20 GPU devices where each parallelism and behavior of the application in the HPC paral- device has 2496 CUDA cores. Moreover, two MIC devices lel system, as described in equation (3). with an Intel Xeon Phi Coprocessor with 60 cores were also Z t installed to upgrade homogeneous computational architec- Esystem = BandW (dt) + MemC (dt) + Prll (dt) 0 ture. In such way, Aziz contains total 11904 number of core + Bhv(dt) (3) in it. Regarding memory, each regular node contained by 96 GB and larger (FAT) nodes with 256 GB. Each node has On the basis of the dictated factors and the fundamental Intel E5-2695v2 processor that contains twelve physical cores energy evaluation, we quantified these factors in the current with 2.4 GHz processing power. Overall Aziz all nodes and study with respect to system performance and power con- accelerated devices are interlinked through three different sumption. For any heterogeneous cluster system, the power networks including user, management and InfiniBand net- consumption can be calculated by summation of the prod- work. For Aziz, user network and management networks are ucts of the power of each component and the corresponding specifically used for the login and job submission handling duration [59]. Generally, Power consumption in a system is whereas InfiniBand to parallelize the file system. Accord- categories in two types. ing to LINKPACK benchmarks, Aziz’s peak performance 1) System Specification was measured with 211.3 Tflops/s and 228.5 Tflops/s as 2) Application Specification theoretical performance [57]. Regarding software speficica- Since the system specification has GPU devices installed in tions, it run using Cent OS with release 6.4. For accelerated it, the power consumption is calculated by equation (4): programming CUDA recent toolkit 9.1 is installed. It also N M contained many other compilers required for HPC libraries. X X Psystem (w) = PiGPU (wi ) + PCPU ( (wj )) i=1 j B. PERFORMANCE MEASUREMENT + Pmainboard (w) (4) System performance is the primary aim of current and emerg- ing HPC systems. The performance metric contains different From equation 4, it can be speculated that the approximate factors that evaluates a system’s performance such as execu- power consumption of a system is the sum of the products tion time, number of achieved flops. Usually, in HPC systems, of the installed GPUs, CPUs and motherboard. The power the number of flops are calculated by dividing the number of consumption varies with the workload; however, on the appli- floating point operations (FPOps) by parallel execution time cation side, it can be quantified using equation (5): PEt as given in equation (1). Napp M X X FPOps Papp = PiGPU (wi ) + PCPU ( (wj )) Flops = (1) PE t i=1 j + Pmainboard (wapp ) (5) Following equation (1), we measured the number of achieved flops by executing different datasets of DMM algorithm in According to equations (4) and (5), the power consumption in MOC model. watts was measured at the idle state of the system, where only 5 watts of power were consumed by the motherboard and the C. POWER MEASUREMENT remaining power was consumed by the cores of system. Energy consumption is the primary challenge toward emerg- ing HPC systems. Although this challenges is somehow VI. EXPERIMENTAL RESULTS addressed in current computing systems but we cannot Tri-hybrid MOC model was implemented in linear algebraic increase the system performance under defined power con- dense matrix multiplication (DMM) algorithm [60]. We per- sumption limitations. However, the main theme of Exascale formed all the implementation on Aziz supercomputing avail- systems is to minimize the power consumption by selection able in King Abdulaziz University, Jeddah Saudi Arabia as optimal hardware and software frameworks [58]. Many novel the specification are described above. During experiments, 23102 VOLUME 6, 2018 M. U. Ashraf et al.: Performance and Power Efficient Massive Parallel Computational Model TABLE 2. Naïve code and parameters of implemented DMMA. we observed two fundamental HPC metrics including perfor- mance and power consumption that are biggest challenges for current and emerging HPC systems. As Aziz contained mul- tiple GPU devices, however we computed different matrix sizes on multiple CUDA kernels (four, eight and twelve). FIGURE 4. Performance in DMM through multiple kernel configurations. During experiments, it was observed that multiple kernels outperformed to single kernel call with respect to perfor- mance by consuming low power. For a heterogeneous system, we noticed that larger than four kernels couldn’t perform well. This happens due to unnecessary utilization of resources and communication among the devices and host cores. According to new CUDA build 9.1, the overhead of communication can be minimized by using additional CUDA statements that allow to communicate with host cores only when necessary. In contrast, four kernels outperformed throughout the execu- tions due to an optimized resources utilization. A naïve code along with specific parameters of implemented DMM has been presented in table 2 as follows. Above code described in table 2 is not explained com- pletely due to space restrictions. However, readers can approach to Tiwari et al. [61] that has described DMM FIGURE 5. Energy efficiency in DMM for different multiple-kernel implementation and optimization strategy in details. In our configurations. implementation, z array was reused in the buffer register and x, y arrays were stored in caches to utilize in efficient way. However, in order to quantify the performance in selected performance. Conversely, bigger kernel sizes also achieved linear algebraic dense matrix multiplication application, an adequate performance but consumed more power due the number of floating point (FP) operations of DMM algo- to unnecessary communication among heterogeneous cores. rithm were determined. Generally, in dense matrix multi- Alongside performance, we evaluated another essential met- plication algorithm (DMMA), it is apparent that the total ric, to be specific, energy consuming in the system that number (TN) of FP operations to compute a product of two was 28 Joules. At most extreme DMM for a dataset of size square matrices can be calculated using a simple formula as 10000 through an enhanced four kernel setup, the evaluated follows in equation (6). energy proficiency was 8.3 Gflops/W. The addition of assets TN FP Ops = (N 3 ) + ((N 2 ) ∗ (N − 1)) (6) influenced energy effectiveness significantly and diminished it to 5.6 Gflops/W, as appeared in Figure 5. Once the TN of FPOps are determined, we can calcu- Performance and energy efficiency are directly propor- late the number of flops using equation (1). According to tional to each other, however the trade-off between both DMM algorithm implementation, it was observed the metrics can be determined [62] as given: peak performance for datasets 1000-10000 reached up Performance Execution within the time unit to 1 Teraflops in four kernels. For all kernels implementation = Power Energy during the execution time unit the observed performance was achieved by 716 Gflops as an work average as shown in Figure 4. = We noticed that the implementation of four CPU threads energy per node along with equal number of CUDA kernels outper- Therefore trade-off between these metrics provides the rate of formed to all other kernel configurations and consequently accomplishable performance under given energy efficiency accomplished approximately 1 Teraflops with 68% peak as presented in Figure 6 where horizontal and vertical lines VOLUME 6, 2018 23103 M. U. Ashraf et al.: Performance and Power Efficient Massive Parallel Computational Model FIGURE 8. Energy efficiency in DMM for MOC vs (CuBLAS and KBLAS). FIGURE 6. Performance-Energy Efficiency tradeoff. MOC outperformed throughout to KBLAS and CuBLAS in all executions matrix sizes. During power consumption quantification, there consequences were different now where MOC consumed less power as compared to all other imple- mentations as shown in figure 8. It was observed that MOC achieved 6.1 Gflops/w during smaller data computation whereas CuBLAS and KBLAS attained 5.2 and 5.7 number of Gflops/w respectively. By increasing matrix size, Gflops/w gradually changed in all implementations. We noticed that CuBLAS and KBLAS could achieved the maximum number of Gflops/w as 6.5-6.7 against maximum matrix size computations. In con- trast, MOC attained the similar number of Gflops/w at ini- tial matrix size executions. By increasing the matrix size, FIGURE 7. MOC performance comparison with KBLAS and CuBLAS number of Gflops/w increased gradually and reached up implementations. to 8.3. Although KBLAS and CuBLAS are also optimized approaches in solving linear algebraic systems whereas MOC outperformed by depicting massive parallelism. Furthermore, presents the entropy of energy efficiency and performance the additional factor was utilization of NVIDIA GPU that pro- respectively. Any interacting point in the graph confronts cessed the data in mili-seconds within small power consump- the peak values of achieved performance and energy effi- tion. However, we evaluated that MOC model accomplished ciency. We can settle the arrangement and parameters at any 1086 Gflops by consuming 130 w total power consumption crossing point to give most extreme execution and vitality during larger matrix execution. productivity. These assessments discovered that the best per- formance and energy efficiency which is accomplishable by VII. EXASCALE COMPUTING SYSTEM DEMAND utilizing the proposed demonstrate on the Aziz supercom- The major challenge for future HPC supercomputing Exas- puter was 1086 Gflops, which relates to energy efficiency cale systems is that it doesn’t exist yet. However, the devel- of 8.3 Gflops/W. opment toward Exascale systems are being performed based In MOC implementations with different kernel sizes, on predictions and statistics in existing consequences. This we concluded that system performance is directly propor- section demonstrates a statistical analysis of performed tional to resource optimization and utilization during any experiments in current study. This factual investigation dataset processing. Sometime, larger number of resources explore two primary HPC metrics including performance become the reason of performance decreasing in the and power consumption which has been considered the system as (8,12 kernels) used in experiments. In order challenging factors for Exascale computing systems. The to compare the evaluated MOC performance and power current study was conducted on heterogeneous architecture consumption, we computed the similar DMM application based system that contained 11904 number of cores inte- matrix size in KBLAS and CuBLAS libraries. According to grated in 494 number of homogenous nodes. Moreover, two figure 7, KBLAS and CuBLAS could achieve maximum k-20 NVIDA GPU devices were configured in the sys- 810 and 630 Gflops/sec respectively whereas MOC achieved tem for accelerated computing. The peak performance of up to one Tflops during the same executions. selected platform was 211.3 Tflops/s and 228.5 Tflops/s with 23104 VOLUME 6, 2018 M. U. Ashraf et al.: Performance and Power Efficient Massive Parallel Computational Model TABLE 3. Exascale computing system configurations. computing systems if we just scale the existing HPC system with fundamental resources required for Exascale systems. The determinations from current study, elicited several chal- lenging that open new research directions and thoughts as follows: • As HPC systems are not confirmed about its architec- ture, it may homogeneous / heterogeneous, however it must be investigated that how and which layer can manage the dynamic behaviour of the system and code irregularity as well. • In different studies, we noticed that algorithms enhance the system performance by consuming less power; it TABLE 4. An analysis of measuring parameters based on different must be investigated that which optimized approach can platforms. adopt this trade-off. • As describe in top ten challenges, memory management is one of those, however, what additional hooks can be used to increase system efficiency by reducing commu- nication cost. • Resource optimization should also be considered as sometime small input data occupy large number of resources. • How we can maintain the power consumption during larger executions within a particular environment? Linpack and theoretical performance respectively. Using this • How the communication overhead among the hetero- platform, MOC model achieved 1086 Gflops/s and consumed geneous cores can be reduced to dilute the power less than 28 joules energy for larger datasets. This rate of consumption? performance and energy consumption conceded the energy efficiency as 8.3 Gflops/Sec. This efficiency factor can be The above concluded facts open new research directions and determined by using formula given in equation 7. challenges for development communities and researchers. Based on these facts, the suggested MOC model must be E(j) implemented in different complex HPC applications and P (w) = t(s) observe the system behaviour. However, VIII. CONCLUSIONS Joule watt = or W = J / S (7) HPC innovation is being moved from the Petascale to the Second extraordinary ‘‘Exascale’’ processing framework. This pow- Following equation (7), the peak performance was achieved erful system will required massive power consumption to under the power consumption with 130 W. According to pre- provide Exaflops number of calculation in secs. However, dictive Exascale systems configuration given in Table 3, if we HPC pioneers, researchers and development communities enhance the selected platform with thousand-fold increase, defined some hard limitations that should be considered it will be capable to provide Exaflops number of calculations for any Exascale system. These limitations includes majorly per second. power consumption, performance level, delivery time and Based on the ratio of current computation and required number of configured cores. Based on these limitations, cur- resources, the predictive performance and power consump- rent technologies are facing several challenges where accom- tion were calculated, which are presented in Table 4. plishing massive parallelism under energy constrains is one Table 4 demonstrates the HPC current and emerging plat- of those. Current study proposed a new tri-hybrid MOC forms. Based on consequences, we categorized the selected (MPI + OpenMP + CUDA) parallel programming model that platform Aziz supercomputer into two domains such as attained massive performance through monolithic parallelism accomplished and prescient. However, the accomplished in the system. In order to evaluate MOC model, we imple- results against each metric demonstrates that if Aziz is scaled mented in linear algebraic dense matrix multiplication appli- with Exascale configurations, MOC model can provide the cation and observed different metrics such as performance predictive performance level and power consumption in it. and power consumption during different dataset executions. As scalability in current architecture doesn’t requires any It was observed that MOC with four kernels outperformed additional frameworks that however it can achieve the pre- against eight and twelve kernels implementations. Further, dictive figures by using MOC. Therefore, MOC model can MOC with peak performance was compared with other most be considered as promising model for emerging Exascale prominent implementations including KLBAS and CuBLAS. VOLUME 6, 2018 23105 M. U. Ashraf et al.: Performance and Power Efficient Massive Parallel Computational Model We observed that MOC achieved a tremendous performance [19] (Jan. 18, 2018). OpenMPI: Open Source High Performance Computing. and reached up to 1 Teraflops within 130 W power consump- Accessed: Feb. 10, 2018. [Online]. Available: https://www.open-mpi.org/ [20] (Jan. 3, 2018). OpenMP. Accessed: Feb. 11, 2018. [Online]. Available: tion. Based on experimental consequences, we presented a http://www.openmp.org/ predictive performance and power consumption of selected [21] I. Karlin et al., ‘‘Early experiences porting three applications to platform if we increase it up to Exascale configurations. OpenMP 4.5,’’ in Proc. Int. Workshop OpenMP, 2016, pp. 281–292. [22] A. Podobas and S. Karlsson, ‘‘Towards unifying OpenMP under Although these predictive results were not meeting the Exas- the task-parallel paradigm,’’ in Proc. Int. Workshop OpenMP, 2016, cale figures but enhance the performance by decreasing three pp. 116–129. time power consumption as in current computation systems. [23] NVIDIA CUDA Compute Unified Device Architecture Programming Guide. NVIDIA Corp., Santa Clara, CA, USA, 2017. However, the suggested MOC model can be conceived as a [24] (Jan. 24, 2018). NVIDIA Accelerated Computing. Accessed: Feb. 15, 2018. promising model for emerging Exascale computing systems. [Online]. Available: https://docs.nvidia.com/cuda/cuda-toolkit-release- By future perspectives, we must rethink and fix the deter- notes/index.html [25] (Jan. 30, 2018). CUDA. Accessed: Feb. 15, 2018. [Online]. Available: mined challenges toward Exascale systems. The major chal- https://en.wikipedia.org/wiki/CUDA lenge for Exascale is that it doesn’t exist yet. However, [26] H. Jin, D. Jespersen, P. Mehrotra, R. Biswas, L. Huang, and B. Chapman, we cannot assure that it will be homogeneous or hetero- ‘‘High performance computing using MPI and OpenMP on multi- core parallel systems,’’ Parallel Comput., vol. 37, no. 9, pp. 562–575, geneous architecture based systems. Therefore, we need 2011. an adaptive hybrid programming model that can deal both [27] K.-L. Ma, C. Wang, H. Yu, and A. Tikhonova, ‘‘In-situ processing and homogenous and heterogeneous architecture systems. visualization for ultrascale simulations,’’ J. Phys., Conf. Ser., vol. 78, no. 1, p. 012043, 2007. [28] P. D. Mininni, D. Rosenberg, R. Reddy, and A. Pouquet, ‘‘A hybrid MPI– REFERENCES OpenMP scheme for scalable parallel pseudospectral computations for [1] D. Eddelbuettel, ‘‘CRAN task view: High-performance and parallel com- fluid turbulence,’’ Parallel Comput., vol. 37, nos. 6–7, pp. 316–326, 2011. puting with R,’’ Comprehensive R Arch. Netw., Tech. Rep. 2018-03-20, [29] S. Dong and G. E. Karniadakis, ‘‘Dual-level parallelism for high-order 2018. CFD methods,’’ Parallel Comput., vol. 30, no. 1, pp. 1–20, 2004. [2] Inside HPC. What is High Performance Computing. [30] M. U. Ashraf and F. E. Eassa, ‘‘Hybrid model based testing tool architec- Accessed: Jan. 10, 2018. [Online]. Available: http://insidehpc.com/ ture for exascale computing system,’’ Int. J. Comput. Sci. Secur., vol. 9, hpc-basictraining/what-is-hpc/ no. 5, pp. 245–252, 2015. [3] M. Zhou, ‘‘Petascale adaptive computational fluid dynamics,’’ [31] S. Jin and D. P. Chassin, ‘‘Thread group multithreading: Accelerating the Ph.D. dissertation, Rensselaer Polytech. Inst., Troy, NY, USA, 2009. computation of an agent-based power system modeling and simulation tool–C GridLAB-D,’’ in Proc. IEEE 47th Hawaii Int. Conf. Syst. Sci., [4] J. J. Dongarra and D. W. Walker, ‘‘The quest for petascale computing,’’ Jan. 2014, pp. 2536–2545. Comput. Sci. Eng., vol. 3, no. 3, pp. 32–39, May 2001. [32] M. Hennecke, W. Frings, W. Homberg, A. Zitz, M. Knobloch, and H. Böt- [5] R. Brower et al. (Oct. 2017). ‘‘Lattice QCD application development tiger, ‘‘Measuring power consumption on IBM Blue Gene/P,’’ Comput. within the US DOE exascale computing project.’’ [Online]. Available: Sci.-Res. Develop., vol. 27, no. 4, pp. 329–336, 2012. https://arxiv.org/abs/1710.11094 [33] T. Hoegg, G. Fiedler, C. Koehler, and A. Kolb, ‘‘Flow driven GPGPU [6] J.-L. Vay et al. (Jan. 2018). ‘‘Warp-X: A new exascale computing platform programming combining textual and graphical programming,’’ in Proc. for beam-plasma simulations.’’ [Online]. Available: https://arxiv.org/abs/ ACM 7th Int. Workshop Program. Models Appl. Multicores Manycores, 1801.02568 2016, pp. 88–97. [7] S. Perarnau, R. Gupta, and P. Beckman, ‘‘Argo: An exascale operating [34] N. Rajovic, L. Vilanova, C. Villavieja, N. Puzovic, and A. Ramirez, system and runtime,’’ in Proc. Int. Conf. High Perform. Comput., Netw., ‘‘The low power architecture approach towards Exascale computing,’’ Storage Anal. (SC), 2015. J. Comput. Sci., vol. 4, no. 6, pp. 439–443, 2013. [8] J. Shalf, S. Dosanjh, and J. Morrison, ‘‘Exascale computing technology [35] N. Rajovic, A. Rico, N. Puzovic, C. Adeniyi-Jones, and A. Ramirez, challenges,’’ in Proc. Int. Conf. High Perform. Comput. Comput. Sci., 2010, ‘‘Tibidabo1 : Making the case for an ARM-based HPC system,’’ Future pp. 1–25. Generat. Comput. Syst., vol. 36, pp. 322–334, Jul. 2014. [9] D. A. Reed and J. Dongarra, ‘‘Exascale computing and big data,’’ Commun. [36] A. Duran and M. Klemm, ‘‘The Intel many integrated core architecture,’’ ACM, vol. 58, no. 7, pp. 56–68, 2015. in Proc. IEEE Int. Conf. High Perform. Comput. Simulation (HPCS), [10] F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer, and M. Snir, ‘‘Toward Jul. 2012, pp. 365–366. exascale resilience,’’ Int. J. High Perform. Comput. Appl., vol. 23, no. 4, [37] J. E. Stone, D. Gohara, and G. Shi, ‘‘OpenCL: A parallel program- pp. 374–388, 2009. ming standard for heterogeneous computing systems,’’ Comput. Sci. Eng., [11] ASCAC Subcommittee for the Top Ten Exascale Research Challenges, vol. 12, no. 3, pp. 66–73, 2010. U.S. Dept. Energy State, Washington, DC, USA, 2014. [38] M. U. Ashraf and F. E. Eassa, ‘‘OpenGL based testing tool architec- [12] T. N. Theis and H.-S. P. Wong, ‘‘The end of Moore’s Law: A new beginning ture for exascale computing,’’ Int. J. Comput. Sci. Secur., vol. 9, vol. 5, for information technology,’’ Comput. Sci. Eng., vol. 19, no. 2, pp. 41–50, pp. 238–244, 2015. 2017. [39] M. Wolfe et al., ‘‘Implementing the OpenACC data model,’’ in Proc. [13] D. A. Jacobsen and I. Senocak, ‘‘Multi-level parallelism for incompressible IEEE Int. Parallel Distrib. Process. Symp. Workshops (IPDPSW), flow computations on GPU clusters,’’ Parallel Comput., vol. 39, no. 1, May/Jun. 2017, pp. 662–672. pp. 1–20, 2013. [40] J. R. Humphrey, D. K. Price, K. E. Spagnoli, A. L. Paolini, and [14] CUDA Toolkit 4.0 CUBLAS Library, Nvidia Corp., Santa Clara, CA, USA, E. J. Kelmelis, ‘‘CULA: Hybrid GPU accelerated linear algebra routines,’’ 2011, pp. 59–60. Proc. SPIE, vol. 7705, p. 770502, Apr. 2010. [15] A. Abdelfattah, D. Keyes, and H. Ltaief, ‘‘KBLAS: An optimized library [41] S. Tomov, J. Dongarra, and M. Baboulin, ‘‘Towards dense linear algebra for dense matrix-vector multiplication on GPU accelerators,’’ ACM Trans. for hybrid GPU accelerated manycore systems,’’ Parallel Comput., vol. 36, Math. Softw., vol. 42, no. 3, 2016, Art. no. 18. nos. 5–6, pp. 232–240, 2010. [16] J. Dongarra, ‘‘MPI: A message-passing interface standard version 3.0,’’ [42] S. J. Pennycook, S. D. Hammond, S. A. Jarvis, and G. R. Mudalige, High Performance Computing Center Stuttgart (HLRS), 2013. ‘‘Performance analysis of a hybrid MPI/CUDA implementation of the [17] J. Dinan, P. Balaji, D. Buntinas, D. Goodell, W. Gropp, and R. Thakur, NASLU benchmark,’’ ACM SIGMETRICS Perform. Eval. Rev., vol. 38, ‘‘An implementation and evaluation of the MPI 3.0 one-sided communi- no. 4, pp. 23–29, 2011. cation interface,’’ Concurrency Comput., Pract. Exper., vol. 28, no. 17, [43] P. S. Rakić, D. D. Milašinović, Ž. Živanov, Z.Suvajdžin, M. Nikolić, and pp. 4385–4404, 2016. M. Hajduković, ‘‘MPI–CUDA parallelization of a finite-strip program [18] (Jun. 20, 2017.). Message Passing Interface. Accessed: Aug. 3, 2017. for geometric nonlinear analysis: A hybrid approach,’’ Adv. Eng. Softw., [Online]. Available: https://computing.llnl.gov/tutorials/mpi/ vol. 42, no. 5, pp. 273–285, 2011. 23106 VOLUME 6, 2018 M. U. Ashraf et al.: Performance and Power Efficient Massive Parallel Computational Model [44] J. Guan, S. Yan, and J.-M. Jin, ‘‘An openMP-CUDA implementation M. USMAN ASHRAF was born in Paddali, of multilevel fast multipole algorithm for electromagnetic simulation on Sialkot, Pakistan, in 1988. He received the B.Sc. multi-GPU computing systems,’’ IEEE Trans. Antennas Propag., vol. 61, degree in mathematics from the University of the no. 7, pp. 3607–3616, Jul. 2013. Punjab, Pakistan, in 2007, and the M.S. degree in [45] F. Lu, J. Song, F. Yin, and X. Zhu, ‘‘Performance evaluation of hybrid pro- computer science from the University of Lahore, gramming patterns for large CPU/GPU heterogeneous clusters,’’ Comput. Pakistan, in 2014. He is currently pursuing the Phys. Commun., vol. 183, no. 6, pp. 1172–1181, 2012. Ph.D. degree in computer science from King [46] R. Reyes and F. de Sande, ‘‘Optimization strategies in different CUDA Abdulaziz University, Jeddah, Saudi Arabia. From architectures using llCoMP,’’ Microprocess. Microsyst., vol. 36, no. 2, 2010 to 2014, he was a Senior Software Engi- pp. 78–87, 2012. [47] M. Howison, E. W. Bethel, and H. Childs, ‘‘Hybrid parallelism for volume neer with Coeus Software Solutions, GmbH. He rendering on large-, multi-, and many-core systems,’’ IEEE Trans. Vis. is currently a member of the Software Engineering Group, King Abdulaziz Comput. Graphics, vol. 18, no. 1, pp. 17–29, Jan. 2012. University Jeddah, Saudi Arabia. His research interests include high per- [48] K. Nakajima, ‘‘Three-level hybrid vs. flat MPI on the Earth Simulator: formance computing, parallel computing, exascale computing, and software Parallel iterative solvers for finite-element method,’’ Appl. Numer. Math., engineering. vol. 54, no. 2, pp. 237–255, 2005. [49] T. Nguyen-Thoi, G. R. Liu, K. Y. Lam, and G. Y. Zhang, ‘‘A face-based smoothed finite element method (FS-FEM) for 3D linear and geometrically non-linear solid mechanics problems using 4-node tetrahedral elements,’’ Int. J. Numer. Methods Eng., vol. 78, no. 3, pp. 324–353, 2009. FATHY ALBURAEI EASSA received the B.Sc. [50] S. Amarasinghe et al., ‘‘ASCR programming challenges for exascale com- degree in electronics and electrical communication puting,’’ Rep. Workshop Exascale Program. Challenges, 2011. engineering from Cairo University, Egypt, in 1978, [51] T. Hoshino, N. Maruyama, S. Matsuoka, and R. Takaki, ‘‘CUDA vs Ope- and the M.Sc. and Ph.D. degrees in computers and nACC: Performance case studies with kernel benchmarks and a memory- systems engineering from Al-Azhar University, bound CFD application,’’ in Proc. 13th IEEE/ACM Int. Symp. Cluster, Cairo, Egypt, in 1984 and 1989, respectively, with Cloud Grid Comput. (CCGrid), May 2013, pp. 136–143. [52] J. A. Herdman et al., ‘‘Accelerating hydrocodes with OpenACC, OpenCL joint supervision with the University of Colorado, and CUDA,’’ in Proc. IEEE SC Companion High Perform. Comput., Netw., USA, in 1989. He is currently a Full Professor Storage Anal. (SCC), Nov. 2012, pp. 465–471. with the Computer Science Department, Faculty [53] A. Lashgar, A. Majidi, and A. Baniasadi. (Dec. 2014) ‘‘IPMACC: Open of Computing and Information technology, King source OpenACC to CUDA/OpenCL translator.’’ [Online]. Available: Abdulaziz University, Saudi Arabia. His research interests include agent https://arxiv.org/abs/1412.1127 based software engineering, cloud computing, software engineering, big [54] S. Christgau, J. Spazier, B. Schnor, M. Hammitzsch, A. Babeyko, and data, distributed systems, and exascale system testing. J. Waechter, ‘‘A comparison of CUDA and OpenACC: Accelerating the tsunami simulation EasyWave,’’ in Proc. Workshop Archit. Comput. Syst. (ARCS), Feb. 2014, pp. 1–5. [55] (Sep. 22, 2014). Fujitsu to Provide High-Performance Computing and Services Solution to King Abdulaziz University. Accessed: Jul. 6, 2017. AIIAD AHMAD ALBESHRI received the B.S. [Online]. Available: http://www.fujitsu.com/global/about/resources/news/ degree in computer science from King Abdulaziz press-releases/2014/0922-01.html University, Jeddah, Saudi Arabia, and the Ph.D. [56] (Jun. 2015). King Abdulaziz University. Accessed: Aug. 3, 2017. [Online]. degree in computer science from the Queens- Available: https://www.top500.org/site/50585 [57] Aziz—Fujitsu PRIMERGY CX400, Intel Xeon E5-2695v2 12C 2.4GHz, land University of Technology, Australia, in 2013. Intel TrueScale QDR. Accessed: Aug. 3, 2017. [Online]. Available: He is currently an Assistant Professor with the https://www.top500.org/system/178571 Department of Computer Science, King Abdu- [58] L. A. Barroso, ‘‘The price of performance,’’ Queue, vol. 3, no. 7, pp. 48–53, laziz University, Saudi Arabia. His research inter- Sep. 2005. ests include cloud computing, security in cloud [59] D. Ren and R. Suda, ‘‘Power efficient large matrices multiplication by load computing, storage in cloud computing, parallel scheduling on multi-core and GPU platform with CUDA,’’ in Proc. IEEE computing, and big data. Int. Conf. Comput. Sci. Eng. (CSE), vol. 1. Aug. 2009, pp. 424–429. [60] K. A. Gallivan, R. J. Plemmons, and A. H. Sameh, ‘‘Parallel algorithms for dense linear algebra computations,’’ SIAM Rev., vol. 32, no. 1, pp. 54–135, 1990. ABDULLAH ALGARNI received the bachelor’s [61] A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth, ‘‘A scal- degree from King Abdulaziz University Jeddah, able auto-tuning framework for compiler optimization,’’ in Proc. IPDPS, Saudi Arabia, and the master’s and Ph.D. degrees Rome, Italy, May 2009, pp. 1–12. from the College of Natural Sciences, Colorado [62] H. Anzt, B. Haugen, J. Kurzak, P. Luszczek, and J. Dongarra, ‘‘Experiences State University, USA, in 2016, all in computer in autotuning matrix multiplication for energy minimization on GPUs,’’ Concurrency Comput., Pract. Exper., vol. 27, no. 17, pp. 5096–5113, science. He is currently an Assistant Professor and 2015. the Chairman of the Computer Science Depart- [63] J. Y.-T. Leung, Ed., Handbook of Scheduling: Algorithms, Models, and ment, King Abdulaziz University, Jeddah, Saudi Performance Analysis. Boca Raton, FL, USA: CRC Press, 2004. Arabia. His research interest includes software [64] G. C. Fox, R. D. Williams, and G. C. Messina, Parallel Computing Works! vulnerabilities, software risk management and mit- New York, NY, USA: Elsevier, 2014. igation, quantitative evaluation, software risk assessment, software engineer- [65] Introduction to Parallel Computing. Accessed: Feb. 15, 2018. [Online]. ing, and software security. Available: https://computing.llnl.gov/ tutorials/parallel_comp/ VOLUME 6, 2018 23107