40 PROC. OF THE 20th PYTHON IN SCIENCE CONF. (SCIPY 2021) MPI-parallel Molecular Dynamics Trajectory Analysis with the H5MD Format in the MDAnalysis Python Package Edis Jakupovic‡ , Oliver Beckstein‡∗ F Abstract—Molecular dynamics (MD) computer simulations help elucidate de- pattern of the file, but more so the synergy between access pattern and the tails of the molecular processes in complex biological systems, from protein layout of the file on disk. dynamics to drug discovery. One major issue is that these MD simulation files are now commonly terabytes in size, which means analyzing the data from Index Terms—Molecular Dynamics Simulations, High Performance Computing, these files becomes a painstakingly expensive task. In the age of national Python, MDAnalysis, HDF5, H5MD, MPI I/O supercomputers, methods of parallel analysis are becoming a necessity for the efficient use of time and high performance computing (HPC) resources but for Introduction any approach to parallel analysis, simply reading the file from disk becomes the performance bottleneck that limits overall analysis speed. One promising The molecular dynamics (MD) simulation approach [HBD+ 19] way around this file I/O hurdle is to use a parallel message passing interface is widely used across the biomolecular and materials sciences, (MPI) implementation with the HDF5 (Hierarchical Data Format 5) file format to accounting for more than one quarter of the total computing time access a single file simultaneously with numerous processes on a parallel file [FQC+ 19] in the Extreme Science and Engineering Discovery system. Our previous feasibility study suggested that this combination can lead Environment (XSEDE) network of national supercomputers in to favorable parallel scaling with hundreds of CPU cores, so we implemented a the US [TCD+ 14]. MD simulations, especially in the realm of fast and user-friendly HDF5 reader (the H5MDReader class) that adheres to studying protein dynamics, serve an important purpose in charac- H5MD (HDF5 for Molecular Dynamics) specifications. We made H5MDReader (together with a H5MD output class H5MDWriter) available in the MDAnalysis terizing the dynamics, and ultimately the function of a protein library, a Python package that simplifies the process of reading and writing vari- [Oro14]. For example, recent award-winning work [CDG+ 21] ous popular MD file formats by providing a streamlined user-interface that is in- involving the SARS-CoV-2 spike protein was able to use all- dependent of any specific file format. We benchmarked H5MDReader’s parallel atom MD simulations to elucidate the dynamics of the virus-to- file reading capabilities on three HPC clusters: ASU Agave, SDSC Comet, and human cell interaction that was inaccessible to experiment. While PSC Bridges. The benchmark consisted of a simple split-apply-combine scheme the parameters involved in fine tuning the physics driving these of an I/O bound task that split a 90k frame (113 GiB) coordinate trajectory simulations continue to improve, the computational demand of into N chunks for N processes, where each process performed the commonly longer, more accurate simulations increases [DDG+ 12]. As high used RMSD (root mean square distance after optimal structural superposition) performance computing (HPC) resources continue to improve in calculation on their chunk of data, and then gathered the results back to the root process. For baseline performance, we found maximum I/O speedups at 2 performance, the size of MD simulation files are now commonly full nodes, with Agave showing 20x, and a maximum computation speedup on terabytes in size, making serial analysis of these trajectory files Comet of 373x on 384 cores (all three HPCs scaled well in their computation impractical [CR15]. Parallel analysis is a necessity for the effi- task). We went on to test a series of optimizations attempting to speed up cient use of both HPC resources and a scientist’s time [BFJ18], I/O performance, including adjusting file system stripe count, implementing a [FQC+ 19]. MD trajectory analysis can be parallelized using task- masked array feature that only loads relevant data for the computation task, based or MPI-based (message passing interface) approaches, each front loading all I/O by loading the entire trajectory into memory, and manually with their own advantages and disadvantages [PLK+ 18]. Here adjusting the HDF5 dataset chunk shapes. We found the largest improvement in we investigate parallel trajectory analysis with the MDAnalysis I/O performance by optimizing the chunk shape of the HDF5 datasets to match the iterative access pattern of our analysis benchmark. With respect to baseline Python library [MADWB11], [GLB+ 16]. MDAnalysis is a widely serial performance, our best result was a 98x speedup at 112 cores on ASU used package in the molecular simulation community that can Agave. In terms of absolute time saved, the analysis went from 4623 seconds read and write over 25 popular MD trajectory file formats while in the baseline serial run to 47 seconds in the parallel, properly chunked run. providing a common object-oriented interface that makes data Our results emphasize the fact that file I/O is not just dependent on the access available as numpy arrays [HMvdW+ 20]. MDAnalysis aims to bridge the entrenched user communities of different MD packages, ‡ Arizona State University allowing scientists to more easily (and productively) move be- * Corresponding author:
[email protected]tween these entrenched communities. Previous work that focused on developing a task-based approach to parallel analysis found Copyright © 2021 Edis Jakupovic et al. This is an open-access article dis- that an I/O bound task only scaled to 12 cores due to a file I/O tributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, pro- bottleneck [SFMLIP+ 19]. Our recent feasibility study suggested vided the original author and source are credited. that parallel reading via MPI-IO and the HDF5 file format could MPI-PARALLEL MOLECULAR DYNAMICS TRAJECTORY ANALYSIS WITH THE H5MD FORMAT IN THE MDANALYSIS PYTHON PACKAGE 41 lead to good scaling although only a reduced size custom HDF5 sion on I/O performance. In general we found that altering the trajectory was investigated and no usable implementation of a true stripe count and loading only necessary coordinates via masked MD trajectory reader was provided [KPF+ 20]. arrays provided little improvement in benchmark times. Loading H5MD, or "HDF5 for molecular data", is an HDF5-based file the entire trajectory into memory in one pass instead of iterating format that is used to store MD simulation data, such as particle through, frame by frame, showed the greatest improvement in coordinates, box dimensions, and thermodynamic observables performance. This was compounded by our results with HDF5 [dBCH14]. A Python reference implementation for H5MD exists chunking. Our baseline test file was auto-chunked with the auto- (pyh5md [dBCH14]) but the library is not maintained anymore, chunking algorithm in h5py. When we recast the file into a and with advice from the original author of pyh5md, we imple- contiguous form and a custom, optimized chunk layout, we saw mented native support for H5MD I/O in the MDAnalysis package. improvements in serial I/O on the order of 10x. Additionally, our HDF5 is a structured, binary file format that organizes data into results from applying gzip compression to the file showed no loss two objects: groups and datasets. It implements a hierarchical, in performance at higher processor counts, indicating H5MD files tree-like structure, where groups represent nodes of the tree, and can be compressed without losing performance in parallel analysis datasets represent the leaves [Col14]. An HDF5 file’s datasets tasks. can be stored either contiguously on disk, or scattered across the disk in different locations in chunks. These chunks must be defined on initialization of the dataset, and for any element to be Methods read from a chunk, the entire chunk must be read. The HDF5 HPC environments library can be built on top of a message passing interface (MPI) implementation so that a file can be accessed in parallel on a We tested the parallel MPI I/O capabilities of our H5MD imple- parallel file system such as Lustre or BeeGFS. We implemented a mentation on three supercomputing environments: ASU Agave, parallel MPI-IO capable HDF5-based file format trajectory reader PSC Bridges, and SDSC Comet. The Agave supercomputer offers into MDAnalysis, H5MDReader, that adheres to the H5MD spec- 498 compute nodes. We utilized the Parallel Compute Nodes that ifications. H5MDReader interfaces with h5py, a high level Python offer 2 Intel Xeon E5-2680 v4 CPUs (2.40GHz, 14 cores/CPU, package that provides a Pythonic interface to the HDF5 format 28 cores/node, 128GB RAM/node) with a 1.2PB scratch BeeGFS [Col14]. In h5py, accessing a file in parallel is accomplished file system that uses an Intel OmniPath interconnect system. The by passing a keyword argument into h5py.File, which then Bridges supercomputer offers over 850 compute nodes that supply manages parallel disk access. 1.3018 Pf/s and 274 TiB RAM. We utilized the Regular Shared The BeeGFS and Lustre parallel file systems are well suited Memory Nodes that offer 2 Intel Haswell E5-2695 v3 CPUs (2.3- for multi-node MPI parallelization. One key feature of a Lustre 3.3GHz, 14 cores/CPU, 28 cores/node, 128GB RAM/node) with parallel file systems is file striping, which is the ability to store a 10PB scratch Lustre parallel file system that uses an InfiniBand data from a file across multiple physical locations, known as object interconnect system. The Comet supercomputer offers 2 Pf/s with storage targets (OSTs), where "stripe count" refers to the number 1944 standard compute nodes. We utilized the Intel Haswell of OSTs to which a single file is striped across. Thinking carefully Standard Compute Nodes that offer 2 Intel Xeon E5-2680 v3 about the synchronization of chunk shape and stripe settings CPUs (2.5GHz, 12 cores/CPU, 24 cores/node, 128GB RAM/node) can be crucial to establishing optimal I/O performance [How10]. with a 13PB scratch Lustre parallel file system that also uses an We tested various algorithmic optimizations for our benchmark, InfiniBand interconnect system. including using various stripe counts (1, 48, 96), loading only Our software library stacks were built with conda environ- necessary coordinate information with numpy masked arrays ments. Table 1 gives the versions of each library involved in [HMvdW+ 20], and front loading all I/O by loading the entire the stack. We used GNU C compilers on Agave and Bridges trajectory chunk into memory prior to the RMSD calculation. and the Intel C-compiler on Comet for MPI parallel jobs as We benchmarked H5MDReader’s parallel reading capabilities recommended by the Comet user guide. We used OpenMPI as the with MDAnalysis on three HPC clusters: ASU Agave at Arizona MPI implementation on all HPC resources as this was generally State University, and SDSC Comet and PSC Bridges, which are the recommended environment and in the past we found it also part of XSEDE [TCD+ 14]. The benchmark consisted of a simple the easiest to build against [KPF+ 20]. The mpi4py [DPKC11] split-apply-combine scheme [Wic11] of an I/O-bound task that package was used to make MPI available in Python code, as split a 90k frame (113 GiB) trajectory into N chunks for N required by h5py. In general, our software stacks were built in processes, where each process performed a computation on their the following manner: chunk of data, and the results were finally gathered back to the root process. For the computational task, we computed the time • module load anaconda3 series of the root mean squared distance (RMSD) of the positions • create new conda environment of the Cα (alpha carbon) atoms in the protein to their initial • module load parallel hdf5 build coordinates at the first frame of the trajectory. At each frame (time • module load OpenMPI implementation step) in the trajectory, the protein was optimally superimposed • install mpi4py with env MPICC=/path/to/mpicc on the reference frame to remove translations and rotations. The pip install mpi4py RMSD calculation is a very common task performed to analyze • install h5py with CC="mpicc" HDF5_MPI="ON" the dynamics of the structure of a protein [MM14]. Because it is HDF5_DIR=/path/to/parallel-hdf5 pip a fast computation that is bounded by how quickly data can be install --no-binary=h5py h5py read from the file it is a suitable task to test the I/O capabilities of • install development MDAnalysis as outlined in the MD- H5MDReader. Analysis User Guide We tested the effects of HDF5 file chunking and file compres- 42 PROC. OF THE 20th PYTHON IN SCIENCE CONF. (SCIPY 2021) System ASU Agave PSC Bridges SDSC Comet line number id description Python 3.8.5 3.8.5 3.6.9 11 t init_top load topology file C compiler gcc 4.8.5 gcc 4.8.5 icc 18.0.1 12 t init_traj load trajectory file HDF5 1.10.1 1.10.2 1.10.3 38 t I/O read data from time step into memory OpenMPI 3.0.0 3.0.0 3.1.4 39 t compute perform rmsd computation h5py 2.9.0 3.1.0 3.1.0 42 t wait wait for processes to synchronize mpi4py 3.0.3 3.0.3 3.0.3 47 t comm_gather combine results back into root process MDAnalysis 2.0.0-dev0 2.0.0-dev0 2.0.0-dev0 TABLE 3: All timings collected from the example benchmark code. TABLE 1: Library versions installed for each HPC environment. id gives the reference name used in this paper to reference the corresponding line number and timing collected. description gives name format file size (GiB) a short description of what that specific line of code is doing in the benchmark. H5MD-default H5MD 113 H5MD-chunked H5MD 113 H5MD-contiguous H5MD 113 H5MD-gzipx1 H5MD 77 H5MD-gzipx9 H5MD 75 4 import numpy as np 5 DCD DCD 113 XTC XTC 35 6 comm = MPI.COMM_WORLD TRR TRR 113 7 size = comm.Get_size() 8 rank = comm.Get_rank() 9 TABLE 2: Data files benchmarked on all three HPCS. name is the 10 def benchmark(topology, trajectory): name that is used to identify the file in this paper. format is the 11 u = mda.Universe(topology) format of the file, and file size gives the size of the file in gibibytes. 12 u.load_new(trajectory, H5MD-default original data file written with pyh5md which uses the 13 driver="mpio", auto-chunking algorithm in h5py. H5MD-chunked is the same file 14 comm=comm) but written with chunk size (1, n atoms, 3) and H5MD-contiguous is 15 CA = u.select_atoms("protein and name CA") the same file but written with no HDF5 chunking. H5MD-gzipx1 and 16 x_ref = CA.positions.copy() 17 H5MD-gzipx9 have the same chunk arrangement as H5MD-chunked 18 # make_balanced_slices divides n_frames into but are written with gzip compression where 1 is the lowest level of 19 # equally sized blocks and returns start:stop compression and 9 is the highest level. DCD, XTC, and TRR are 20 # indices for each block copies H5MD-contiguous written with MDAnalysis. 21 slices = make_balanced_slices(n_frames, 22 size, 23 start=0, Benchmark Data Files 24 stop=n_frames, 25 step=1) The test data files used in our benchmark consist of a topol- 26 start = slices[rank].start ogy file YiiP_system.pdb with 111,815 atoms and a tra- 27 stop = slices[rank].stop 28 bsize = stop - start jectory file YiiP_system_9ns_center100x.h5md with 29 90100 frames. The initial trajectory data file (H5MD-default 30 # sendcounts is used for Gatherv() to know how in Table 2) was generated with pyh5md [dBCH14] using the 31 # many elements are sent from each rank XTC file YiiP_system_9ns_center.xtc [SFMLIP+ 19], 32 sendcounts = np.array([ 33 slices[i].stop - slices[i].start [LRFK+ 21], using the "ChainReader" facility in MDAnalysis with 34 for i in range(size)]) the list 100 * ["YiiP_system_9ns_center.xtc"] as 35 input. The rest of the test files were copies of H5MD-default and 36 rmsd_array = np.empty(bsize, dtype=float) 37 for i, frame in enumerate(range(start, stop)): were written with MDAnalysis using different HDF5 chunking 38 ts = u.trajectory[frame] arrangements and compression settings. Table 2 gives all of the 39 rmsd_array[i] = rmsd(CA.positions, files benchmarked with how they are identified in this paper as 40 x_ref, well as their corresponding file size. 41 superposition=True) 42 comm.Barrier() 43 rmsd_buffer = None Parallel Algorithm Benchmark 44 if rank == 0: We implemented a simple split-apply-combine parallelization 45 rmsd_buffer = np.empty(n_frames, 46 dtype=float) algorithm [Wic11], [SFMLIP+ 19], [KPF+ 20] that divides the 47 comm.Gatherv(sendbuf=rmsd_array, number of frames in the trajectory evenly among all available 48 recvbuf=(rmsd_buffer, sendcounts), root=0) processes. Each process receives a unique start and stop for which to iterate through their section of the trajectory. As the The HDF5 file is opened with the mpio driver and the computational task, the root mean square distance (RMSD) of the MPI.COMM_WORLD communicator to ensure the file is accessed protein Cα atoms after optimal structural superposition [MM14] in parallel via MPI I/O. The topology and trajectory initialization is computed at each frame with the QCProt algorithm [The05], as times must be analyzed separately because the topology file is described in our previous work [SFMLIP+ 19], [KPF+ 20]. not opened in parallel and represents a fixed cost each process In order to obtain detailed timing information we instrumented must pay to open the file. MDAnalysis reads data from MD code as follows below. Table 3 outlines the specific lines in the trajectory files one frame, or "snapshot" at a time. Each time code that were timed in the benchmark. the u.trajectory[frame] is iterated through, MDAnalysis reads the file and fills in numpy arrays [HMvdW+ 20] correspond- 1 import MDAnalysis as mda 2 from MDAnalysis.analysis.rms import rmsd ing to that time step. Each MPI process runs an identical copy 3 from mpi4py import MPI of the script, but receives a unique start and stop variable MPI-PARALLEL MOLECULAR DYNAMICS TRAJECTORY ANALYSIS WITH THE H5MD FORMAT IN THE MDANALYSIS PYTHON PACKAGE 43 such that the entire file is read in parallel. Gathering the results is A I/O B RMSD C Wait done collectively by MPI, which means all processes must finish 1043 10 their iteration blocks before the results can be returned. Therefore, 102 it is important to measure t wait as it represents the existence of 1010 10 "straggling" processes. If one process takes substantially longer 10 12 10 than the others to finish its iteration block, all processes are slowed 10 3 down. These 6 timings are returned and saved as an array for each 10 4 Time(s) benchmark run. D E F Initialize Topology Initialize Trajectory Communication We applied this benchmark scheme to H5MD test files on 1043 Agave, Bridges, and Comet. Each benchmark run received a 10 102 unique, freshly copied test file that was only used once so as 1010 10 to avoid any caching effects of the operating system or file 10 12 system. We also tested three algorithmic optimizations: Lus- 10 10 3 tre file striping, loading the entire trajectory into memory, and 10 4 1 28 56 112 1 28 56 112 1 28 56 112 using masked arrays in numpy to only load the Cα coor- dinates required for the RMSD calculation. For striping, we NProcesses ran the benchmark on Bridges and Comet with a file stripe Agave Bridges Comet count of 48 and 96. For the into memory optimization, we used MDAnalysis.Universe.transfer_to_memory() Fig. 1: Benchmark timings breakdown for the ASU Agave, PSC Bridges, and SDSC Comet HPC clusters. The benchmark was run to read the entire file in one go and pass all file on up to 4 full nodes on each HPC, where Nprocesses was 1, 28, 56, I/O to the HDF5 library. For the masked array optimiza- and 112 for Agave and Bridges, and 1, 24, 48, and 96 on Comet. tion, we allowed u.load_new() to take a list or ar- The H5MD-default file was used in the benchmark, where the ray of atom indices as an argument, sub, so that the trajectory was split in N chunks for each corresponding N process MDAnalysis.Universe.trajectory.ts arrays are in- benchmark. Points represent the mean over three repeats with the stead initialized as numpy.ma.masked_array instances and standard deviation shown as error bars. only the indices corresponding to sub are read from the file. Performance was quantified by measuring the I/O timing A B C Total Benchmark Time Scaling Efficiency returned from the benchmarks, and strong scaling was assessed 104 120 1.0 100 0.8 by calculating the speedup S(N) = t1 /tN and the efficiency E(N) = S(N)/N 103 80 S(N) = t1/tN 0.6 Time(s) E(N) = S(N)/N. 60 102 40 0.4 20 0.2 Data Sharing 101 0 0.0 All of our SLURM submission shell scripts and Python bench- 1 28 56 112 1 28 56 112 1 28 56 112 mark scripts for all three HPC environments are available in the NProcesses IO Agave Bridges Comet RMSD repository https://github.com/Becksteinlab/scipy2021-mpiH5MD- data and are archived under DOI 10.5281/zenodo.5083858. Fig. 2: Strong scaling I/O and RMSD performance of the RMSD analysis task of the H5MD-default data file on Agave, Bridges, Results and Discussion and Comet. Nprocesses ranged from 1 core, to 4 full nodes on each HPC, and the number of trajectory blocks was equal to the number Baseline Benchmarks of processes involved. Points represent the mean over three repeats We first ran benchmarks with the simplest parallelization scheme where the error bars are derived with the standard error propagation of splitting the frames of the trajectory evenly among all partici- from the standard deviation of absolute times. pating processes. The H5MD file involved in the benchmarks was written with the pyh5md library, the original Python reference implementation for the H5MD format [dBCH14]. The datasets in often highly variable in a crowded HPC environment [How10] and the data file were chunked automatically by the auto-chunking therefore we focus our analysis on the speedup and efficiency of algorithm in h5py. File I/O remains the largest contributor to each benchmark run. The maximum total I/O speedup observed is the total benchmark time, as shown by Figure 1 (A). Figure 1 only 15x and efficiencies at around 0.2 (Fig. 2 B, C). The RMSD (B, D-F) also show that the initialization, computation, and MPI computation scaling, on the other hand, remains high, with nearly communication times are negligible with regards to the overall ideal scaling on Bridges and Comet, with Agave trailing behind at analysis time. t wait , however, becomes increasingly relevant as the 71x speedup at 122 cores. Therefore, for a computationally bound number of processes increases (Figure 1 C), indicating a growing analysis task, our parallel H5MD implementation will likely scale variance in the iteration block time across all processes. In effect, well. t wait is measuring the occurrence of "straggling" processes, which has been previously observed to be an issue on busy, multi-user Effects of Algorithmic Optimizations on File I/O HPC environments [KPF+ 20]. We found that the total benchmark We tested three optimizations aimed at shortening file I/O time time continues to decrease as the number of processes increases for the same data file. In an attempt to optimize I/O, we tried to over 100 (from 4648 ± 319 seconds at N = 1 to 315.6 ± 59.8 to minimize "wasted I/O". For example, in any analysis task, seconds at N = 112 on Agave) (Fig. 2 A). While the absolute time not all coordinates in the trajectory may be necessary for the of each benchmark is important in terms of measuring the actual computation. In our analysis test case, the RMSD was calculated amount of time saved with our parallelization scheme, results are for only the Cα atoms of the protein backbone, therefore the 44 PROC. OF THE 20th PYTHON IN SCIENCE CONF. (SCIPY 2021) A I/O B RMSD C Wait A Total Benchmark Time B Scaling C Efficiency 1043 104 120 1.0 10 100 102 0.8 E(N) = S(N)/N 1010 103 80 S(N) = t1/tN 0.6 Time(s) 10 60 10 12 102 40 0.4 10 0.2 10 3 20 10 4 101 0 1 28 56 112 1 28 56 112 1 28 56 112 Time(s) D E F NProcesses IO Initialize Topology Initialize Trajectory Communication Agave Bridges Comet RMSD 1043 10 102 Fig. 4: Strong scaling performance of the RMSD analysis task with 1010 10 the masked_array optimization technique. The benchmark used the 10 12 H5MD-default data file on Agave, Bridges, and Comet. Nprocesses 10 10 3 ranged from 1 core, to 4 full nodes on each HPC, and the number 10 4 of trajectory blocks was equal to the number of processes involved. 1 28 56 112 1 28 56 112 1 28 56 112 Points represent the mean over three repeats where the error bars NProcesses are derived with the standard error propagation from the standard Agave Bridges Comet deviation of absolute times. Fig. 3: Benchmark timings breakdown for the ASU Agave, PSC Bridges, and SDSC Comet HPC clusters for the masked_array Agave’s serial I/O performance was boosted from 4623s to 891s optimization technique. The benchmark was run on up to 4 full nodes (Fig. 5 A) by loading the data into memory in one slurp rather on each HPC, where N processes was 1, 28, 56, and 112 for Agave than iterating through the trajectory frame by frame. Similarly, and Bridges, and 1, 24, 48, and 96 on Comet. The H5MD-default file was used in the benchmark, where the trajectory was split in N Comet’s serial I/O performance went from 4101s to 1740s, with chunks for each corresponding N process benchmark. Points represent multi-node performance continuing to show improvement versus the mean over three repeats with the standard deviation shown as the baseline numbers (excluding the peak at N = 48). Agave error bars. steady improvements in performance all the way to 4 full nodes, where the I/O time reached 73s (Fig. 5 A, Fig. 6 A). Figure 7 gives a direct comparison on Agave of the baseline benchmark coordinates of all other atoms read from the file is essentially performance with the two optimization methods outlined. With wasted I/O. To circumvent this issue, we implemented the use of respect to the baseline serial performance, loading into memory NumPy ma.masked_array [HMvdW+ 20], where the arrays of gives a 91x speedup (4658s at 1 core to 73s at 112 cores) (Figure 7, coordinate data are instead initialized as masked arrays that only A). This result was interesting in that the only difference between fill data from selected coordinate indices. We found that Bridges the two was the access pattern of the data - in one case, the file showed the best scaling with the masked array implementation, was read in small repeated bursts, while in the other the file was with a total scaling of 23x at 4 full nodes (1642 ± 115 seconds at read from start to finish with HDF5. We hypothesized that this N = 1 to 71 ± 33 seconds at N = 112 cores) as seen in Figure 4 was due to layout of the file itself on disk. (A, B). Agave showed a maximum scaling of 11x at 2 full nodes, Also, we found that the t wait does not increase as the number of while Comet showed 5x scaling at 4 full nodes (Figure 4 B). In processes increases as in all of the other benchmark cases (Figure some cases, the masked array implementation resulted in slower 5 C). In the other benchmarks, t wait was typically on the order of I/O times. For example, Agave went from 4623 seconds in the 10-200 seconds, whereas t wait on the order of 0.01 seconds for the baseline 1 core run to 5991 seconds with masked arrays. This memory benchmarks (Figure 7 C). This indicates that the cause could be due to the HDF5 library not being optimized to work of the iteration block time variance among processes stems from with masked arrays as with numpy arrays. On the other hand, for MPI rank coordination when many small read requests are made. Bridges and Comet, we observed an approximate 5x speedup in To investigate MPI rank competition, we increased the stripe I/O time (Fig. 4 B) for the masked array case when compared count on Bridge’s and Comet’s Lustre file system up to 96, where to the baseline benchmark. In terms of the RMSD computation found marginal I/O scaling improvements of 1.2x on up to 4 full scaling, we once again found all three systems scaled well, with nodes (not shown). While our data showed no improvement with Comet displaying ideal scaling all the way to 4 full nodes, while altering the stripe count, this may have been a byproduct the poor Agave and Bridges hovering around 85x at 112 cores. chunk layout of the original file on disk. In the next section we With an MPI implementation, processes participating in par- discuss the effects of HDF5 chunking on I/O performance. allel I/O communicate with one another. It is commonly un- derstood that repeated, small file reads performs worse than a Effects of HDF5 Chunking on File I/O large, contiguous read of data. With this in mind, we tested To test the hypothesis that the increase in serial file I/O between this concept in our benchmark by loading the entire trajectory the baseline performance in loading into memory performance was into memory prior to the RMSD task. Modern super computers caused by the layout of the file on disk, we created H5MDWriter, make this possible as they contain hundreds of GiB of memory an MDAnalysis file format writer class that gives one the ability per node. On Bridges, loading into memory strangely resulted in to write H5MD files with the MDAnalysis user interface. These slower I/O times (1466s baseline to 2196s at N = 1 and 307s files can be written with user-decided custom chunk layouts, file baseline to 523s at N = 112, Fig. 1 A and Fig. 5 A). Agave and compression settings, and can be opened with MPI parallel drivers Comet, on the other hand, showed surprisingly different results. that enable parallel writing. We ran some initial serial writing They both performed substantially better for the N = 1 core case. tests and found that writing from DCD, TRR, and XTC to H5MD MPI-PARALLEL MOLECULAR DYNAMICS TRAJECTORY ANALYSIS WITH THE H5MD FORMAT IN THE MDANALYSIS PYTHON PACKAGE 45 A I/O B RMSD C Wait A I/O B RMSD C Wait 1043 1043 10 10 102 102 1010 1010 10 10 10 12 10 12 10 10 10 3 10 3 10 4 10 4 Time(s) Time(s) D E F D E F Initialize Topology Initialize Trajectory Communication Initialize Topology Initialize Trajectory Communication 1043 1043 10 10 102 102 1010 1010 10 10 10 12 10 12 10 10 10 3 10 3 10 4 10 4 1 28 56 112 1 28 56 112 1 28 56 112 1 28 56 112 1 28 56 112 1 28 56 112 NProcesses NProcesses Agave Bridges Comet Baseline Masked Array Memory Fig. 5: Benchmark timings breakdown for the ASU Agave, PSC Fig. 7: Benchmark timings on ASU Agave comparing the baseline Bridges, and SDSC Comet HPC clusters for the loading-into-memory benchmark with the masked array and loading into memory opti- optimization technique. The benchmark was run on up to 4 full nodes mizations. Each benchmark was run on up to 4 full nodes where N on each HPC, where N processes was 1, 28, 56, and 112 for Agave processes was 1, 28, 56, and 112. The H5MD-default test file was and Bridges, and 1, 24, 48, and 96 on Comet. The H5MD-default used in all benchmarks. Points represent the mean over three repeats file was used in the benchmark, where the trajectory was split in N with the standard deviation shown as error bars. chunks for each corresponding N process benchmark. Points represent the mean over three repeats with the standard deviation shown as error bars. for every time step of data read, an exorbitant amount of excess data was being read and discarded at each iteration step. Before A B C approaching the parallel tests, we tested how the chunk layout Total Benchmark Time Scaling Efficiency 104 120 2.0 affects baseline serial I/O performance for the file. We found 100 I/O performance strongly depends on the chunk layout of the 1.5 E(N) = S(N)/N 103 80 S(N) = t1/tN file on disk. The auto-chunked H5MD-default file I/O time was Time(s) 60 1.0 102 40 4101s, while our custom chunk layout resulted in an I/O time of 0.5 20 460s (Figure 8). So, we effectively saw a 10x speedup just from 101 0 0.0 1 28 56 112 1 28 56 112 1 28 56 112 optimizing the chunk layout alone, where even the file with no NProcesses IO chunking applied showed similar improvements in performance. Agave Bridges Comet RMSD In our previous serial I/O tests, we found that H5MD performed worse than other file formats, so we repeated those tests with Fig. 6: Strong scaling I/O performance of the RMSD analysis task our custom chunked file, H5MD-chunked. We found for our test with the loading-into-memory optimization technique. The benchmark file of 111,815 atoms and 90100 frames, H5MD outperformed used the H5MD-default data file on Agave, Bridges, and Comet. XTC and TRR, while performing equally well to the DCD file, an Nprocesses ranged from 1 core, to 4 full nodes on each HPC, and the number of trajectory blocks was equal to the number of processes encouraging result (Fig. 9). involved. Points represent the mean over three repeats where the Next, we investigated what effect the chunk layout had on error bars are derived with the standard error propagation from the parallel I/O performance. We repeated our benchmarks on Agave standard deviation of absolute times. (at this point, Bridges had been decommissioned and our Comet allocation had expired) but with the H5MD-chunked and H5MD- contiguous data files. For the serial one process case, we found typically took ~360 seconds on Agave. For the 113 GiB test file, a similar result in that the I/O time was dramatically increased this was a 0.31 GiB/s write bandwidth. We rewrote the H5MD- with an approximate 10x speedup for both the contiguous and default test file and tested two cases: one in which the file is chunked file, with respect to the baseline benchmark (Figure 10 written with no chunking applied (H5MD-contiguous), and one A). The rest of the timings remained unaffected (Figure 10 B-F). in which we applied a custom chunk layout to match the access Although the absolute total benchmark time is much improved pattern on the file (H5MD-chunked). Our benchmark follows a (Figure 11 A), the scaling remains challenging, with a maximum common MD trajectory analysis scheme in that it iterates through observed speedup of 12x for the contiguous file (Figure 11 B). the trajectory one frame at a time. Therefore, we applied a chunk The N = 112 H5MD-contiguous run’s I/O time was 47s (Fig. 10 shape of (1, n atoms, 3) which matched exactly the shape A). When compared to the 4623s baseline serial time, this is a of data to be read at each iteration step. An important feature of 98x speedup. Similarly, the H5MD-chunked 4 node run resulted HDF5 chunking to note is that, for any element in a chunk to in an I/O time of 83s, which is a 56x speedup when compared to be read, the entire chunk must be read. When we investigated baseline serial performance. Therefore, the boost in performance the chunk shape of the H5MD-default that was auto-chunked with seen by loading the H5MD-default trajectory into memory rather h5py’s chunking algorithm, we found that each chunk contained than iterating frame by frame is indeed most likely due to the data elements from multiple different time steps. This means, original file’s chunk layout. This emphasizes the point that one 46 PROC. OF THE 20th PYTHON IN SCIENCE CONF. (SCIPY 2021) A I/O B RMSD C Wait Serial IO Time 1043 10 5000 102 1010 10 4000 10 12 10 10 3 3000 10 4 Time (s) Time(s) D E F 2000 Initialize Topology Initialize Trajectory Communication 1043 10 102 1000 1010 10 10 12 10 0 10 3 H5MD-chunked H5MD-contiguous H5MD-default 10 4 1 28 56 112 1 28 56 112 1 28 56 112 NProcesses Fig. 8: Serial I/O time for H5MD-default, H5MD-contiguous, and H5MD-default H5MD-contiguous H5MD-chunked H5MD-chunked data files. Each file contained the same data (113 GiB, 90100 frames) but was written with a different HDF5 chunk arrangement, as outlined in Table 2. Each bar represents the mean of Fig. 10: Benchmark timings breakdown on ASU Agave for the three 5 repeat benchmark runs, with the standard deviation shown as error chunk arrangements tested. The benchmark was run on up to 4 full bars. nodes on each HPC, where N processes was 1, 28, 56, and 112. H5MD-default was auto-chunked by h5py. H5MD-contiguous was written with no chunking applied, and H5MD-chunked was written with a chunk shape of (1, n atoms, 3). The trajectory was split Serial IO Time in N chunks for each corresponding N process benchmark. Points represent the mean over three repeats with the standard deviation 1400 shown as error bars. 1200 1000 A B C Total Benchmark Time Scaling Efficiency Time (s) 800 104 120 1.0 100 0.8 600 E(N) = S(N)/N 103 80 S(N) = t1/tN 0.6 Time(s) 60 400 102 40 0.4 20 0.2 200 101 0 0.0 1 28 56 112 1 28 56 112 1 28 56 112 0 NProcesses H5MD-chunked DCD TRR XTC IO H5MD-default H5MD-contiguous H5MD-chunked RMSD Fig. 9: Comparison of serial I/O time for various popular MD file Fig. 11: Strong scaling I/O performance of the RMSD analysis task formats. All files contain the same amount of data (90100 frames). with various chunk layouts tested on ASU Agave. Nprocesses ranged Each bar represents the mean of 10 repeat benchmark runs, with the from 1 core, to 4 full nodes, and the number of trajectory blocks was standard deviation shown as error bars. equal to the number of processes involved. Points represent the mean over three repeats where the error bars are derived with the standard error propagation from the standard deviation of absolute times. may garner substantial I/O improvements if one thinks carefully not only about how their algorithm accesses the file, but also how the file is actually stored on disk. The relationship between setting of 1 and a maximum setting of 9. In the serial 1 process layout on disk and disk access pattern is crucial for optimized I/O. case, we found that I/O performance is slightly hampered, with I/O Furthermore, as the auto-chunked layout of the H5MD-default times approximately 4x longer with compression applied (Figure file scattered data from a single time step across multiple chunks, 13 A) This is expected as you are giving up disk space for the time it is very likely that these chunks themselves were also scattered it takes to decompress the file, as is seen in the highly compressed across stripes. In this case, multiple processes are still attempting XTC format (Fig. 9). However, at increasing number of processes to read from the same chunk which would nullify any beneficial (N > 28), we found that this difference disappears (Figure 13 A effect striping has on file contention. We would have liked to and Figure 12 A). This shows a clear benefit of applying gzip further test the effects of striping with a proper chunk layout, compression to a chunked HDF5 file for parallel analysis tasks, as but our XSEDE allocation expired. the compressed file is ~2/3 the size of the original. Additionally we found speedups of up to 36x on 2 full nodes for the compressed Effects of HDF5 GZIP Compression on File I/O data file benchmarks (Figure 13 B), although we recognize this HDF5 files also offer the ability to compress the files. With our number is slightly inflated due to the slower serial I/O time. From writer, users are easily able to apply any of the compression this data we can safely assume that H5MD files can be compressed settings allowed by HDF5. To see how compression affected without fear of losing parallel I/O performance, which is a nice parallel I/O, we tested HDF5’s gzip compression with a minimum boon in the age of terabyte sized trajectory files. MPI-PARALLEL MOLECULAR DYNAMICS TRAJECTORY ANALYSIS WITH THE H5MD FORMAT IN THE MDANALYSIS PYTHON PACKAGE 47 A I/O B RMSD C Wait on Agave, which means carefully thinking not only about how 1043 your file is accessed, but also how the file is stored on disk can 10 102 result in a reduction of analysis time from 4623 to 47 seconds. 1010 To garner further improvements in parallel I/O performance, a 10 10 12 more sophisticated I/O pattern may be required, such as two- 10 10 3 phase MPI I/O or carefully synchronizing chunk sizes with Lustre 10 4 stripes. The addition of the HDF5 reader provides a foundation for Time(s) D E F the development of parallel trajectory analysis with MPI and the Initialize Topology Initialize Trajectory Communication 1043 MDAnalysis package. 10 102 1010 10 Acknowledgments 10 12 10 The authors thank Dr. Pierre de Buyl for advice on the im- 10 3 10 4 plementation of the h5md format reading code and acknowl- 1 28 56 112 1 28 56 112 1 28 56 112 edge Gil Speyer and Jason Yalim from the Research Computing NProcesses Core Facilities at Arizona State University for support with the H5MD-chunked H5MD-gzipx1 H5MD-gzipx9 Agave cluster and BeeGFS. This work was supported by the National Science Foundation through a REU supplement to award Fig. 12: Benchmark timings breakdown on ASU Agave for the ACI1443054 and used the Extreme Science and Engineering minimum gzip compression 1 and maximum gzip compression 9. The Discovery Environment (XSEDE), which is supported by Na- benchmark was run on up to 4 full nodes on each HPC, where N processes was 1, 28, 56, and 112. The trajectory was split in N chunks tional Science Foundation grant number ACI-1548562. The SDSC for each corresponding N process benchmark. Points represent the Comet computer at the San Diego Supercomputer Center and PSC mean over three repeats with the standard deviation shown as error Bridges computer at the Pittsburgh Supercomputing Center were bars. used under allocation TG-MCB130177. The authors acknowledge Research Computing at Arizona State University for providing A B C HPC and storage resources that contributed to the research results Total Benchmark Time Scaling Efficiency 104 120 reported within this paper. 100 1.25 1.00 E(N) = S(N)/N 103 80 S(N) = t1/tN Time(s) 60 0.75 102 40 0.50 R EFERENCES 20 0.25 [BFJ18] Oliver Beckstein, Geoffrey Fox, and Shantenu 101 0 0.00 1 28 56 112 1 28 56 112 1 28 56 112 Jha. Convergence of data generation and analysis NProcesses IO in the biomolecular simulation community. In H5MD-chunked H5MD-gzipx1 H5MD-gzipx9 RMSD Online Resource for Big Data and Extreme-Scale Computing Workshop, page 4, November 2018. URL: https://www.exascale.org/bdec/sites/www.exascale.org.bdec/ Fig. 13: Strong scaling I/O performance of the RMSD analysis task files/whitepapers/Beckstein-Fox-Jha_BDEC2_WP_0.pdf. with minimum and maximum gzip compression applied. Nprocesses [CDG+ 21] Lorenzo Casalino, Abigail C Dommer, Zied Gaieb, Emilia P ranged from 1 core, to 4 full nodes, and the number of trajectory Barros, Terra Sztain, Surl-Hee Ahn, Anda Trifan, Alexan- blocks was equal to the number of processes involved. Points represent der Brace, Anthony T Bogetti, Austin Clyde, Heng Ma, the mean over three repeats where the error bars are derived with the Hyungro Lee, Matteo Turilli, Syma Khalid, Lillian T standard error propagation from the standard deviation of absolute Chong, Carlos Simmerling, David J Hardy, Julio DC times. Maia, James C Phillips, Thorsten Kurth, Abraham C Stern, Lei Huang, John D McCalpin, Mahidhar Tatineni, Tom Gibbs, John E Stone, Shantenu Jha, Arvind Ra- manathan, and Rommie E Amaro. AI-driven multiscale Conclusions simulations illuminate mechanisms of SARS-CoV-2 spike dynamics. The International Journal of High Perfor- The growing size of trajectory files demands parallelization of mance Computing Applications, page 10943420211006452, trajectory analysis. However, file I/O has become a bottleneck April 2021. Publisher: SAGE Publications Ltd STM. in the workflow of analyzing simulation trajectories. Our im- URL: https://doi.org/10.1177/10943420211006452, doi:10. plementation of an HDF5-based file format trajectory reader in 1177/10943420211006452. [Col14] Andrew Collette. Python and hdf5. In Meghan Blanchette MDAnalysis can perform parallel MPI I/O, and our benchmarks and Rachel Roumeliotis, editors, Python and HDF5. O’Reilly on various national HPC environments show that speed-ups on Media, Inc., 1005 Gravenstein Highway North, Sebastopol, the order of 20x for 48 cores are attainable. Scaling up to CA 95472., 2014. [CR15] T. Cheatham and D. Roe. The impact of heterogeneous achieve higher parallel data ingestion rates remains challenging, computing on workflows for biomolecular simulation and so we developed several algorithmic optimizations in our analysis analysis. Computing in Science Engineering, 17(2):30–39, workflows that lead to improvements in I/O times. The results 2015. doi:10.1109/MCSE.2015.7. from these optimization attempts led us to find that the our original [dBCH14] Pierre de Buyl, Peter H. Colberg, and Felix Höfling. H5MD: A structured, efficient, and portable file format for molecular data file that was auto-chunked by h5py’s chunking algorithm had data. Computer Physics Communications, 185(6):1546 – 1553, an incredibly inefficient chunk layout of the original file. With a 2014. doi:10.1016/j.cpc.2014.01.018. custom, optimized chunk layout and gzip compression, we found [DDG+ 12] Ron O Dror, Robert M Dirks, J P Grossman, Huafeng Xu, maximum scaling of 36x on 2 full nodes on Agave. In terms and David E Shaw. Biomolecular simulation: a computa- tional microscope for molecular biology. Annu Rev Biophys, of speedup with respect to the file chunked automatically, our 41:429–52, 2012. doi:10.1146/annurev-biophys- properly chunked file led to I/O time speedups of 98x at 112 cores 042910-155245. 48 PROC. OF THE 20th PYTHON IN SCIENCE CONF. (SCIPY 2021) [DPKC11] Lisandro D. Dalcín, Rodrigo R. Paz, Pablo A. Kler, and Molecular Dynamics Analysis. In Chris Calloway, David Alejandro Cosimo. Parallel distributed computing using Lippa, Dillon Niederhut, and David Shupe, editors, Proceed- python. Advances in Water Resources, 34(9):1124 – 1139, ings of the 18th Python in Science Conference, pages 134 – 2011. New Computational Methods and Software Tools. 142, Austin, TX, 2019. SciPy. doi:10.25080/Majora- doi:10.1016/j.advwatres.2011.04.013. 7ddc1dd1-013. [FQC+ 19] Geoffrey Fox, Judy Qiu, David Crandall, Gregor von [TCD+ 14] J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, Laszewski, Oliver Beckstein, John Paden, Ioannis Paraske- A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. vakos, Shantenu Jha, Fusheng Wang, Madhav Marathe, Anil Peterson, R. Roskies, J. R. Scott, and N. Wilkins-Diehr. Vullikanti, and III Cheatham, Thomas E. Contributions to XSEDE: Accelerating scientific discovery. Computing in high-performance big data computing. In L. Grandinetti, Science & Engineering, 16(5):62–74, Sept.-Oct. 2014. doi: G. R. Joubert, K. Michielsen, S. L. Mirtaheri, M. Taufer, 10.1109/MCSE.2014.80. and R Yokota, editors, Future Trends of HPC in a Disruptive [The05] Douglas L Theobald. Rapid calculation of RMSDs using Scenario, volume 34 of Advances in Parallel Computing, pages a quaternion-based characteristic polynomial. Acta Crys- 34–81. IOS Press, 2019. doi:10.3233/APC190005. tallogr A, 61(Pt 4):478–80, Jul 2005. doi:10.1107/ [GLB+ 16] Richard J. Gowers, Max Linke, Jonathan Barnoud, Tyler J. E. S0108767305015266. Reddy, Manuel N. Melo, Sean L. Seyler, David L Dotson, [Wic11] Hadley Wickham. The split-apply-combine strategy for data Jan Domański, Sébastien Buchoux, Ian M. Kenney, and Oliver analysis. Journal of Statistical Software, 40(1):1–29, 2011. Beckstein. MDAnalysis: A Python package for the rapid doi:10.18637/jss.v040.i01. analysis of molecular dynamics simulations. In Sebastian Benthall and Scott Rostrup, editors, Proceedings of the 15th Python in Science Conference, pages 98–105, Austin, TX, 2016. SciPy. doi:10.25080/Majora-629e541a-00e. [HBD+ 19] David J. Huggins, Philip C. Biggin, Marc A. Dämgen, Jonathan W. Essex, Sarah A. Harris, Richard H. Hench- man, Syma Khalid, Antonija Kuzmanic, Charles A. Laughton, Julien Michel, Adrian J. Mulholland, Edina Rosta, Mark S. P. Sansom, and Marc W. van der Kamp. Biomolecular simulations: From dynamics and mechanisms to computa- tional assays of biological activity. Wiley Interdisciplinary Reviews: Computational Molecular Science, 9(3):e1393, 2019. doi:10.1002/wcms.1393. [HMvdW+ 20] Charles R Harris, K Jarrod Millman, Stéfan J van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H van Kerk- wijk, Matthew Brett, Allan Haldane, Jaime Fernández Del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E Oliphant. Array program- ming with numpy. Nature, 585(7825):357–362, 09 2020. doi:10.1038/s41586-020-2649-2. [How10] Mark Howison. Tuning HDF5 for Lustre File Systems. September 2010. URL: https://escholarship.org/uc/item/ 46r9d86r. [KPF+ 20] Mahzad Khoshlessan, Ioannis Paraskevakos, Geoffrey C. Fox, Shantenu Jha, and Oliver Beckstein. Parallel performance of molecular dynamics trajectory analysis. Concurrency and Computation: Practice and Experience, 32:e5789, 2020. doi:10.1002/cpe.5789. [LRFK+ 21] Maria Lopez-Redondo, Shujie Fan, Akiko Koide, Shohei Koide, Oliver Beckstein, and David L. Stokes. Zinc binding alters the conformational dynamics and drives the transport cy- cle of the cation diffusion facilitator YiiP. Journal of General Physiology, 153(8), July 2021. URL: https://doi.org/10.1085/ jgp.202112873, doi:10.1085/jgp.202112873. [MADWB11] Naveen Michaud-Agrawal, Elizabeth Jane Denning, Thomas B. Woolf, and Oliver Beckstein. MDAnalysis: A toolkit for the analysis of molecular dynamics simulations. J Comp Chem, 32:2319–2327, 2011. doi:10.1002/jcc.21787. [MM14] Cameron Mura and Charles E. McAnany. An introduction to biomolecular simulations and docking. Molecular Simulation, 40(10-11):732–764, 2014. doi:10.1080/08927022. 2014.935372. [Oro14] Modesto Orozco. A theoretical view of protein dynamics. Chem. Soc. Rev., 43:5051–5066, 2014. doi:10.1039/ C3CS60474H. [PLK+ 18] Ioannis Paraskevakos, Andre Luckow, Mahzad Khoshlessan, Goerge Chantzialexiou, Thomas E. Cheatham, Oliver Beck- stein, Geoffrey Fox, and Shantenu Jha. Task-parallel analysis of molecular dynamics trajectories. In ICPP 2018: 47th International Conference on Parallel Processing, August 13– 16, 2018, Eugene, OR, USA, page Article No. 49, New York, NY, USA, August 13–16 2018. Association for Computing Machinery, ACM. doi:10.1145/3225058.3225128. [SFMLIP+ 19] Shujie Fan, Max Linke, Ioannis Paraskevakos, Richard J. Gow- ers, Michael Gecht, and Oliver Beckstein. PMDA - Parallel