由于观测频率数有限,传统的基于频率划分的电磁法正反演并行计算方式不具有可扩展并行性,难以通过扩大集群规模以提高计算速度。为此,以大地电磁(MT)Occam反演为例,通过挖掘线性方程组求解、矩阵运算等细粒度并行分量,对传统的大粒度分频方式进行扩展,设计了MPI-OpenMP-CUDA多层次混合并行算法。第一层采用消息传递接口(MPI)以消息传递的方式实现节点间大粒度任务的分发,第二层采用OpenMP以共享内存方式实现节点内中小粒度任务的并行处理,底层采用统一计算设备架构(CUDA)实现节点内GPU的核心计算。给出了理论背景和并行性分析,设计了并行处理流程,对方案的适用性进行了讨论。多个理论模型的试算验证了代码的正确性,评估了计算精度,比较了加速性能。实验结果表明,所设计的算法合理高效,仅利用4个节点,就使较大规模的模型(文中类型2)反演平均加速比达到16倍,最高加速比可达23倍。
Owing to the limited observation frequencies,the conventional parallelization of electromagnetic forward modeling and inversion based on frequency division does not have extensible parallelism;even increasing the cluster nodes still makes it difficult to improve the officiency.This work is devoted to the development of an efficient algorithm for magnetotelluric (MT) Occam inversion on small GPU clusters.An MPI-OpenMP-CUDA multi-level hybrid parallel scheme has been designed,which extends the conventional coarse-grained decomposition by excavating the fine-grained parallelism components of linear algebraic computations and matrix operations.On the first level,the coarse-grained tasks over multiple compute nodes are distributed by MPI.On the second level,the intra-node medium-grained parallelization process is realized by OpenMP.Finally,the kernel computations on the GPUs are performed by using CUDA.The theoretical background,parallelism analysis,workflow design,and applicability discussions are presented in the different sections of this article.The scheme was tested on several synthetic models to validate the correctness of our code and to illustrate the precision evaluation and performance comparison.Results of the experiments on synthetic models showed that the performance for large-scale model (type 2 in this article) inversion was satisfactory,achieving an average speed of up to 16 times and a maximum speed of up to 23 times with only four nodes.
国家自然科学基金(41264005,41374079)联合资助。