通过部分页迁移实现CPU-GPU高效透明的数据通信

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

通过部分页迁移实现CPU-GPU高效透明的数据通信

详细信息查看全文 | 推荐本文 |

英文篇名：Efficient and transparent CPU-GPU data communication through partial page migration
作者：张诗情 ; 杨耀华 ; 沈立 ; 王志英
英文作者：ZHANG Shi-qing;YANG Yao-hua;SHEN Li;WANG Zhi-ying;School of Computer,National University of Defense Technology;
关键词：“CPU+GPU”异构系统 ; 数据通信 ; 页迁移
英文关键词：heterogeneous "CPU + GPU" system;;data communication;;page migration
中文刊名：JSJK
英文刊名：Computer Engineering & Science
机构：国防科技大学计算机学院;
出版日期：2019-07-15
出版单位：计算机工程与科学
年：2019
期：v.41;No.295
语种：中文;
页：JSJK201907004
页数：8
CN：07
ISSN：43-1258/TP
分类号：28-35

摘要

尽管对集成GPU和下一代互连的研究投入日益增加,但由PCI Express连接的独立GPU仍占据市场的主导地位,CPU和GPU之间的数据通信管理仍在不断发展。最初,程序员显式控制CPU和GPU之间的数据传输。为了简化编程,GPU供应商开发了一种编程模型,为"CPU+GPU"异构系统提供单个虚拟地址空间。此模型中的页迁移机制会自动根据需要在CPU和GPU之间迁移页面。为了满足高性能工作负载的需求,页面大小有增大趋势。受低带宽和高延迟互连的限制,较大的页面迁移延迟时间较长,这可能会影响计算和传输的重叠并导致严重的性能下降。提出了部分页迁移机制,它只迁移页面的所需部分,以缩短迁移延迟并避免页面变大时整页迁移的性能下降。实验表明,当页面大小为2 MB且PCI Express带宽为16 GB/s时,部分页迁移可以显著隐藏整页迁移的性能开销,相比于程序员控制数据传输,整页迁移有平均98.62%倍的减速,而部分页迁移可以实现平均1.29倍的加速。此外,我们测试了页面大小对快表缺失率的影响以及迁移单元大小对性能的影响,使设计人员能够基于这些信息做出决策。
Despite the increasing investment in integrated GPUs and next-generation interconnect research, discrete GPUs connected by PCI Express still dominate the market, and the management of data communication between CPUs and GPUs continues to evolve. Initially, the programmers control the data transfer between CPUs and GPUs explicitly. To simplify programming, GPU vendors have developed a programming model to provide a single virtual address space for "CPU + GPU" heterogeneous systems. The page migration engine in this model transfers pages between CPUs and GPUs on demand automatically. To meet the needs of high-performance workloads, the page size tends to be larger. Limited by low bandwidth and high latency interconnections, larger page migration has longer delay, which can reduce the overlap of computation and transmission and cause severe performance degradation. We propose a partial page migration mechanism that only transfers the requested part of a page to shorten the migration latency and avoid performance degradation of the whole page migration when the page becomes larger. Experiments show that the proposed partial page migration can well hide the performance overheads of the whole page migration when the page size is 2 MB and the PCI Express bandwidth is 16 GB/sec. Compared with data transmission controlled by the programmers, the whole page migration degrades the performance by 98.62 on average, while the partial page migration upgrades the performance by 1.29 on average. Additionally, we examine the impact of page size on TLB miss rate and the impact of migration unit size on execution time, enabling designers to make informed decisions based on this information.

引文

[1] Lindholm E,Nickolls J,Oberman S,et al.NVIDIA Tesla:A unified graphics and computing architecture[J].IEEE Micro,2008,28(2):39-55.
    [2] Di Carlo S,Gambardella G,Martella I,et al.Fault mitigation strategies for CUDA GPUs[C]//Proc of 2013 IEEE International Test Conference (ITC),2013:18.
    [3] Power J,Hill M D,Wood D A.Supporting x86-64 address translation for 100s of GPU lanes[C]//Proc of 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA),2014:568-578.
    [4] Zheng T,Nellans D,Zulfiqar A,et al.Towards high performance paged memory for GPUs[C]//Proc of 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA),2016:345-357.
    [5] Landaverde R,Zhang T,Coskun A K,et al.An investigation of unified memory access performance in CUDA[C]//Proc of IEEE High Performance Extreme Computing Conference (HPEC),2014:1-6.
    [6] Kirk D.NVIDIA CUDA software and GPU parallel computing architecture[C]//Proc of ISMM,2007:103-104.
    [7] The top 10 innovations in the new NVIDIA Fermi architecture,and the top 3 next challenges [EB/OL].[2009-09-30].https://www.nvidia.com/content/PDF/fermi_white_papers/D.Patterson_Top10InnovationsInNVIDIAFermi.pdf.
    [8] Hammarlund P,Martinez A J,Bajwa A A,et al.Haswell:The fourth-generation Intel core processor[J].IEEE Micro,2014,34(2):6-20.
    [9] Ghorpade J,Parande J,Kulkarni M,et al.GPGPU processing in CUDA architecture[J].Advanced Computing,2012,3(1):105.
    [10] Rogers P,Fellow A.Heterogeneous system architecture overview[C]//Proc of 2013 IEEE Hot Chips 25 Symposium,2013:1-41.
    [11] Kim Y,Lee J,Kim D,et al.ScaleGPU:GPU architecture for memory-unaware GPU programming[J].IEEE Computer Architecture Letters,2014,13(2):101-104.
    [12] Lustig D,Martonosi M.Reducing GPU offload latency via fine-grained CPU-GPU synchronization[C]//Proc of 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA),2013:354-365.
    [13] Cao Y,Chen L,Zhang Z.Flexible memory:A novel main memory architecture with block-level memory compression[C]//Proc of 2015 IEEE International Conference on Networking,Architecture and Storage (NAS),2015:285-294.
    [14] Agarwal N,Nellans D,Stephenson M,et al.Page placement strategies for GPUs within heterogeneous memory systems[C]//Proc of ACM SIGPLAN Notices,2015:607-618.
    [15] Agarwal N,Nellans D,Connor M O,et al.Unlocking bandwidth for GPUs in cc-numa systems[C]//Proc of 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA),2015:354-365.
    [16] GPGPUSim 3.x manual [EB/OL].[2017-06-13].http://gpgpu-sim.org/manual/index.php/Main_Page.
    [17] Ajanovic J.PCI Express 3.0 overview[C]//Proc of 2009 IEEE Hot Chips:21 Symposium,2009:1-61.
    [18] PCI Express 4.0 electrical previews[EB/OL].[2018-50-01].http://pcisig.com/sites/default/files/files/01_05_PCIe_4_0_Electrical_Previews_FROZEN.pdf.
    [19] Bakhoda A,Yuan G L,Fung W W,et al.Analyzing cuda workloads using a detailed GPU simulator[C]//Proc of 2009 IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS),2009:163-174.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700