层次存储的访问分析与优化方法研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

层次存储的访问分析与优化方法研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Analysis and Optimization of Data Access for Memory Hierarchy
副题名：重用性、相似性与亲和性
英文副题名：Reuse, Similarity and Affinity
作者：吴俊杰
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：存储墙 ; 层次存储 ; 数据访问 ; 高速缓存 ; 主存 ; 重用性 ; 相似性 ; 亲和性
英文关键词：memory wall ; memory hierarchy ; data access ; cache ; memory ; reuse ; similarity ; affinity
学位年度：2009
导师：杨学军
学科代码：081201
学位授予单位：国防科学技术大学
论文提交日期：2009-09-01

摘要

处理器和存储器之间的速度差距一直是计算机系统的性能瓶颈,这就是著名的“存储墙”问题。为了解决存储墙问题,几乎所有的计算机都采用层次存储系统。因此,层次存储系统的研究成为了充分发挥计算机系统性能的关键技术之一。
     层次存储的访问即数据访问,是连接存储墙问题中处理器和存储器的“桥梁”,因此,数据访问特性的研究是解决存储墙问题的基础。我们归纳了数据访问的六种重要性质：依赖性、重用性、相似性、亲和性、一致性和生存性。
     ●依赖性描述了包含写访问的数据单元访问之间的相对顺序关系,约束了程序执行的正确性。
     ●重用性描述了对同一个数据单元或相邻数据单元集合的多次访问之间的关系,是数据访问在存储层次中表现出局部性的前提。
     ●相似性描述了程序的多个执行体中对应的多个数据单元内容之间的关系,用于优化多个执行体对存储器的占用量。
     ●亲和性描述了数据单元在多个处理器中访问频度之间的关系,决定了数据分布对处理器访问性能的影响。
     ●一致性描述了数据的单个或多个副本访问的数据内容之间的关系,影响着程序执行的正确性。
     ●生存性描述了多个数据访问的活跃程度之间的关系,是资源分配类问题求解的重要约束。
     这六种数据访问特性相互关联、相辅相成,从不同侧面反映了数据访问的特性。我们将这六种性质分为两类,一类主要影响程序执行的正确性,对程序变换、体系结构设计等有着重要影响,包括依赖性和一致性；另一类则主要影响各种性能优化技术,是软硬件进行性能优化的依据,包括重用性、相似性、亲和性和生存性。在影响程序性能的四个性质中,重用性和相似性描述了具有资源相容特点的数据访问关系,它们分别从地址和值的角度描述了对资源的重复利用；而生存性和亲和性则描述了具有资源相斥特点的数据访问关系,它们分别从时间和空间的角度描述了对资源的互斥使用。按照正确性、性能,以及资源相容和资源相斥划分的三组数据访问特性中,每两种性质之间分别从时间与空间、地址与值两个角度相互正交。依赖性、重用性和生存性都描述了程序执行的时间维度上的数据访问性质；一致性、相似性和亲和性则分别描述了多数据副本、多数据单元、多数据位置等空间维度上的数据访问性质。同时,依赖性、重用性和亲和性都是从数据单元的角度刻画了数据访问性质,它们的依据是数据的地址；一致性、相似性和生存性则更多地从数据内容的角度刻画了数据访问性质,它们的依据是数据的值。
     本文主要研究了其中的重用性、相似性和亲和性三种性质的分析与优化方法。本文的创新工作主要体现在：
     1.提出了并行数据重用模型。该模型系统分析了OpenMP和OpenTM等并行程序中数据访问的重用性,给出了并行程序中数据重用的分类和求解方法。该模型将Wolf的串行数据重用模型扩展到并行领域,对面向共享存储结构的并行程序分析和编译优化技术的研究具有重要的指导意义。
     2.提出了面向数据对象Cache技术。通过软硬件合作管理,面向数据对象Cache技术将程序中的数据对象在Cache上进行分段管理,并为数据对象的不同重用性需求提供合适的Cache策略,包括段容量、相联度、块大小和一致性协议。实验结果表明,面向数据对象Cache技术能更好地适应多样的数据访问重用性特点,有效提高了Cache的利用率。
     3.提出了数据访问相似性的分析方法。该方法对程序中的差异传播进行了分类,研究了不同类型的差异传播行为,建立了差异传播模型,并通过差异传播模型给出了程序中相似数据的分析和求解方法。该方法系统、定量地研究了相似性,为面向相似性的各种编译优化技术研究奠定了重要的基础。
     4.提出了面向共享存储结构的相似页技术。通过编译器与操作系统的合作管理,相似页技术将相似进程间的相似数据合并到同一个物理页。实验结果表明,相似页技术有效减少了共享Cache、主存等共享存储结构中的数据占用量,优化了系统性能,提高了并行可扩展性。
     5.提出了数据访问亲和性的分析方法。该方法从纵直亲和度和水平亲和度两个角度对亲和性进行了定量研究,给出了纵直亲和度和水平亲和度的求解方法,给出了纵直亲和度与水平亲和度之间的定量关系,从亲和性的角度揭示了分布Cache结构研究的关键问题。该方法定量度量了亲和性,对分布存储结构下的数据分布、任务划分与调度技术等的研究具有重要的指导意义。
     6.提出了面向动态分布Cache的多种数据分布优化技术。针对动态分布Cache结构,分别提出了智能多跳提升技术、任意步长硬件预提升技术、软件预提升技术,以及面向共享数据竞争问题的Bank一致性技术。实验结果表明,这些技术有效优化了数据的存储位置,提高了动态分布Cache的访问性能。
The speed gap between processors and memories has always been one of the system performance bottlenecks, which is well-known as "Memory Wall" problem. To solve the memory wall problem, memory hierarchies have been used in almost every computer. Therefore, the study of memory hierarchy has been one of the key techniques for improving the performance of computer systems.
     Memory hierarchy access, or data access, is the "bridge" between processors and memories in the memory wall problem. Thus, the study of the characteristics of data access is the base of solving the memory wall problem. We summarize six kinds of data access characteristics including dependency, reuse, similarity, affinity, coherence/consistency and liveness.
     ●Dependency describes the order of data accesses where there is at least one write operation. It restricts the correctness of programs.
     ●Reuse describes two or more accesses to a data element or a data set which contains data elements in adjacent addresses. It is the precondition of locality in memory hierarchy.
     ●Similarity describes the relationship of the values of corresponding data ele-ments in multiple execution entities of one program. It is used to optimize the occupation of memory resource for multiple execution entities.
     ●Affinity describes the relationship of the accessed frequencies of a data element by multiple processors. It affects the access performance of processors due to data distribution.
     ●Coherence/consistency describes the relationship of multiple copies of one data. It affects the correctness of programs.
     ●Liveness describes the relationship of lifespans of data accesses. It is used to solve some resource allocation problems.
     These characteristics are not independent of each other, but in reality corre-lated. And they reflect different sides of data access. We divide them into two classes. One class, including dependency and coherence/consistency, mainly affects the correctness of program execution and is often used to guide program transfor-mations and architecture designs; the other class, including reuse, similarity, affinity and liveness, mainly restricts the performance and is the basis for software/hardware performance optimization. Besides, reuse and similarity show a kind of resources compatibility from the address and value sides respectively. Liveness and affinity express a kind of resources exclusiveness from the temporal and spatial sides re-spectively. Therefore, the data access characteristics are divided into three groups according to correctness and performance, resources compatibility and exclusive-ness. The two characteristics in each group are orthogonal in two aspects:temporal dimension and spatial dimension of execution, and address and value of accessed data. Dependency, reuse and liveness all describe the data access from the temporal dimension of program executions; and coherence/consistency, similarity and affin-ity describe the data access from multiple data copies, multiple data entities and multiple data positions respectively. Meanwhile, dependency, reuse, and affinity is defined from data addresses; and coherence/consistency, similarity and liveness is defined according to data values.
     This thesis mainly studies three of them:reuse, similarity and affinity. The main innovations in this thesis can be summarized as following:
     1. A parallel data reuse model is proposed. This model analyzes the reuse in parallel programs implemented in OpenMP and OpenTM, classifies the reuse in parallel programs, and gives the measurement of each class of data reuse. It extends the serial data reuse model proposed by Wolf to parallel computing, and has important guiding significance for shared-memory parallel program analysis and compiler optimization.
     2. Data-Object Oriented Cache (DOOC) is put forward. Under the co-management of software and hardware, DOOC allocates data-objects in programs into differ-ent segments in cache, and chooses suitable strategies for them. The segment strategies can vary in many aspects, including segment capacity, associativity, block size and coherence protocol. The experimental results show that DOOC matches the diversity of reuse in programs better, compared with traditional cache and therefore the efficiency of caches is improved.
     3. Analysis of similarity is given. This thesis classifies the difference spreading, studies the behavior of each kind of difference spreading, sets up the difference spreading model and solves the similar data set in programs according to the difference spreading model. The analysis of similarity studies the similarity sys-tematically and quantitatively, and is the basis for related compiler optimization techniques.
     4. Similar page for shared memory architecture is designed. It is an optimization technique co-managed by compilers and operating systems. The experimen-tal results show that the similar page technique can reduce the data amount in shared memory architecture, including shared cache and shared memory, through merging similar data in similar processes into one physical page. So, it can improve the system performance as well as parallel scalability.
     5. Analysis of affinity is given. This thesis defines vertical affinity degree and hor-izontal affinity degree and studies the measurement method of them. Besides, this thesis explains the key problem in the research of non-uniform cache ar-chitecture according to affinity. The approach proposed in this thesis measures the affinity quantitatively, and has very important guiding significance for many optimization techniques in distributed storage architecture, such as data layout, task partitioning, task scheduling and so on.
     6. Several optimization techniques of data distribution on dynamic non-uniform cache architecture (NUCA) are put forward. This thesis proposes smart multi-hop promotion, arbitrary stride hardware prepromotion and software prepromo-tion for dynamic NUCA in single-core platform, and bank coherence technique in multi-core platform. The experimental results show that these techniques optimize the data position in dynamic NUCA and improve the system perfor-mance.

引文

[1]John L. Hennessy, David A. Patterson. Computer architecture (4th ed.):a quan-titative approach. San Francisco, CA, USA:Morgan Kaufmann Publishers Inc., 2007.
    [2]Wikipedia. Random-access memory. http://en.wikipedia.org/wiki/Random-access_memory#Memory_wall,2009.9.14.
    [3]Katherine Yelick. Ten ways to waste a parallel computer. ISCA'09:Proceedings of the 36th annual international symposium on Computer architecture. New York, NY, USA:ACM,2009:1.
    [4]S. Przybylski. Cache and Memory Hierarchy Design:A Performance Directed Ap-proach. San Francisco, CA, USA:Morgan Kaufmann Publishers,1990.
    [5]Ruud van der Pas. Memory hierarchy in cache-based systems. Technical Report UCRL-53745, Sun Microsystems,2002.
    [6]Guohua Jin, Zhiyuan Li, Fujie Chen. An Efficient Solution to the Cache Thrashing Problem Caused by True Data Sharing. IEEE Transactions on Computers.1998, 47(5):527-543.
    [7]Jesse Zhixi Fang, Mi Lu. An Iteration Partition Approach for Cache or Local Memory Thrashing on Parallel Processing. Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing. Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Comput-ing, Springer-Verlag,1992:313-327.
    [8]Bradford M. Beckmann. Managing wire delay in chip multiprocessor caches. Ph.D. thesis, Madison, WI, USA,2006. Adviser-Wood, David A..
    [9]Randy Allen, Ken Kennedy. Optimizing compilers for modern architectures. San Francisco, CA, USA:Morgan Kaufmann Publishers,2002.
    [10]David L. Kuck. Structure of Computers and Computations. New York, NY, USA: John Wiley & Sons Inc.,1978.
    [11]Leslie Lamport. The parallel execution of DO loops. Commun ACM.1974,17(2): 83～93.
    [12]D. J. Kuck, Y. Muraoka, Shyh-Ching Chen. On the Number of Operations Simulta-neously Executable in Fortran-Like Programs and Their Resulting Speedup. IEEE Trans Comput.1972,21(12):1293～1310.
    [13]Yoichi Muraoka. Parallelism exposure and exploitation in programs. Ph.D. thesis, Champaign, IL, USA,1971.
    [14]Randy Allen, Ken Kennedy. Automatic translation of FORTRAN programs to vector form. ACM Trans Program Lang Syst.1987,9(4):491～542.
    [15]John Randal Allen. Dependence analysis for subscripted variables and its application to program transformations. Ph.D. thesis, Houston, TX, USA,1983.
    [16]Randy Allen, Ken Kennedy. Automatic translation of FORTRAN programs to vector form. ACM Trans Program Lang Syst.1987,9(4):491～542.
    [17]Michael Joseph Wolfe. High Performance Compilers for Parallel Computing. Boston, MA, USA:Addison-Wesley Longman Publishing Co. Inc.,1995.
    [18]Gina Goff, Ken Kennedy, Chau-Wen Tseng. Practical dependence testing. SIG-PLAN Not.1991,26(6):15～29.
    [19]Utpal Banerjee. Speedup of ordinary programs. Ph.D. thesis, Champaign, IL, USA, 1979.
    [20]Utpal K. Banerjee. Dependence Analysis for Supercomputing. Norwell, MA, USA: Kluwer Academic Publishers,1988.
    [21]Michael G. Burke, Ron K. Cytron. Interprocedural dependence analysis and paral-lelization. SIGPLAN Not.2004,39(4):139～154.
    [22]M. Wolfe, C.W. Tseng. The Power Test for Data Dependence. IEEE Transactions on Parallel and Distributed Systems.1992,3(5):591～601.
    [23]David Klappholz, Kleanthis Psarris, Xiangyun Kong. On the perfect accuracy of an approximate subscript analysis test. SIGARCH Comput Archit News.1990,18(3b): 201～212.
    [24]T. Gross, P. Steenkiste. Structured dataflow analysis for arrays and its use in an optimizing complier. Softw Pract Exper.1990,20(2):133～155.
    [25]A. Lichnewsky, F. Thomasset. Introducing symbolic problem solving techniques in the dependence testing phases of a vectorizer. ICS'88:Proceedings of the 2nd international conference on Supercomputing. New York, NY, USA:ACM,1988: 396～406.
    [26]Mohammad R. Haghighat, Constantine D. Polychronopoulos. Symbolic analysis for parallelizing compilers. ACM Trans Program Lang Syst.1996,18(4):477～518.
    [27]Lee-Chung Lu, Marina C. Chen. Subdomain dependence test for massive parallelism. Supercomputing'90:Proceedings of the 1990 conference on Supercomputing. Los Alamitos, CA, USA:IEEE Computer Society Press,1990:962～972.
    [28]M. Gikar, C. Polychronopoulos. Compiling issues for supercomputers. Supercom- puting'88:Proceedings of the 1988 ACM/IEEE conference on Supercomputing. Los Alamitos, CA, USA:IEEE Computer Society Press,1988:164～173.
    [29]Michael Joseph Wolfe. Optimizing Supercompilers for Supercomputers. Cambridge, MA, USA:MIT Press,1990.
    [30]D. Callahan. A global approach to detection of parallelism. Ph.D. thesis, Houston, TX, USA,1987.
    [31]r. Allen, D. Callahan, K. Kennedy. Automatic decomposition of scientific programs for parallel execution. POPL'87:Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages. New York, NY, USA:ACM, 1987:63～76.
    [32]Randy Allen, Ken Kennedy. Automatic loop interchange. SIGPLAN Not.2004, 39(4):75～90.
    [33]Michael Wolfe. Loop skewing:the wavefront method revisited. Int J Parallel Pro-gram.1986,15(4):279～293.
    [34]D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, M. Wolfe. Dependence graphs and compiler optimizations. POPL'81:Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages. ACM, 1981:207～218.
    [35]Walid Abdul-Karim Abu-Sufah. Improving the performance of virtual memory computers. Ph.D. thesis, Champaign, IL, USA,1979.
    [36]Joe Warren. A hierarchical basis for reordering transformations. POPL'84:Pro-ceedings of the 11th ACM SIGACT-SIGPLAN symposium on Principles of pro-gramming languages. New York, NY, USA:ACM,1984:272～282.
    [37]Michael E. Wolf, Monica S. Lam. A data locality optimizing algorithm. PLDI '91:Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation. New York, NY, USA:ACM,1991:30～44.
    [38]Kathryn S. McKinley, Steve Carr, Chau-Wen Tseng. Improving data locality with loop transformations. ACM Trans Program Lang Syst.1996,18(4):424～453.
    [39]Mahmut Kandemir, Prithviraj Banerjee, Alok Choudhary, J. Ramanujam, Eduard Ayguade. Static and Dynamic Locality Optimizations Using Integer Linear Pro-gramming. IEEE Trans Parallel Distrib Syst.2001,12(9):922～941.
    [40]Vincent Loechner, Benoit Meister, Philippe Clauss. Precise Data Locality Opti-mization of Nested Loops. J Supercomput.2002,21(1):37～76.
    [41]Uday Bondhugula, Albert Hartono, J. Ramanujam, P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not.2008, 43(6):101～113.
    [42]Stephanie Coleman, Kathryn S. McKinley. Tile size selection using cache organiza-tion and data layout. PLDI'95:Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation. New York, NY, USA:ACM, 1995:279～290.
    [43]MichalCierniak, Wei Li. Unifying data and control transformations for distributed shared-memory machines. PLDI'95:Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation. New York, NY, USA:ACM,1995:205～217.
    [44]Jennifer M. Anderson, Saman P. Amarasinghe, Monica S. Lam. Data and compu-tation transformations for multiprocessors. SIGPLAN Not.1995,30(8):166～178.
    [45]Scott McFarling. Cache replacement with dynamic exclusion. ISCA'92:Proceedings of the 19th annual international symposium on Computer architecture. New York, NY, USA:ACM,1992:191～200.
    [46]Tien-Fu Chen. An effective programmable prefetch engine for on-chip caches. MI-CRO 28:Proceedings of the 28th annual international symposium on Microarchi-tecture. Los Alamitos, CA, USA:IEEE Computer Society Press,1995:237～242.
    [47]Peter Petrov, Alex Orailglu. Towards effective embedded processors in codesigns: customizable partitioned caches. CODES'01:Proceedings of the ninth interna-tional symposium on Hardware/software codesign. New York, NY, USA:ACM, 2001:79～84.
    [48]Peter Petrov, Alex Orailoglu. Data cache energy minimizations through program-mable tag size matching to the applications. ISSS'01:Proceedings of the 14th international symposium on Systems synthesis. New York, NY, USA:ACM,2001: 113～117.
    [49]Rezaul Alam Chowdhury, Vijaya Ramachandran. The cache-oblivious gaussian elim-ination paradigm:theoretical framework, parallelization and experimental evalua-tion. SPAA'07:Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures. New York, NY, USA:ACM,2007:71-80.
    [50]Sriram Sellappa, Siddhartha Chatterjee. Cache-Efficient Multigrid Algorithms. Int J High Perform Comput Appl.2004,18(1):115-133.
    [51]Gayathri Venkataraman, Sartaj Sahni, Srabani Mukhopadhyaya. A blocked all-pairs shortest-paths algorithm. J Exp Algorithmics.2003,8:2.2.
    [52]The ACM Digital Library. A data locality optimizing algorithm. http://portal.acm.org,2009.9.17.
    [53]Wei Li. Compiling for numa parallel machines. Ph.D. thesis, Ithaca, NY, USA, 1993.
    [54]Javed Absar, Francky Catthoor. Reuse analysis of indirectly indexed arrays. ACM Trans Des Autom Electron Syst.2006,11(2):282-305.
    [55]Claudia Leopold. Exploiting non-uniform reuse for cache optimization. SAC'01: Proceedings of the 2001 ACM symposium on Applied computing. New York, NY, USA:ACM,2001:560-564.
    [56]Mahmut Kandemir, Alok Choudhary, Prith Banerjee, J. Ramanujam. On Reducing False Sharing While Improving Locality on Shared Memory Multiprocessors. In-ternational Conference on Parallel Architectures and Compilation Techniques. Los Alamitos, CA, USA:IEEE Computer Society,1999:203.
    [57]Mahmut Kandemir, Alok Choudhary, J. Ramanujam, Prith Banerjee. Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework. IEEE Transactions on Parallel and Distributed Systems.2003,14(4):337～354.
    [58]Uresh Vahalia. UNIX系统内幕(英文版).北京,中国：人民邮电出版社,2003.
    [59]Jonathan M. Smith, Gerald Q. Maguire. Effects of copy-on-write memory manage-ment on the response time of UNIX fork operations. Computing Systems.1988, 1(3):255～278.
    [60]Daniel G. Bobrow, Jerry D. Burchfiel, Daniel L. Murphy, Raymond S. Tomlinson. TENEX, a paged time sharing system for the PDP-10. Commun ACM.1972, 15(3):135～143.
    [61]E. Zayas. Attacking the process migration bottleneck. SIGOPS Oper Syst Rev. 1987,21(5):13～24.
    [62]Carl A. Waldspurger. Memory resource management in VMware ESX server. SIGOPS Oper Syst Rev.2002,36(SI):181～194.
    [63]Susmit Biswas, Diana Franklin, Alan Savage, Ryan Dixon, Timothy Sherwood, Fred-eric T. Chong. Multi-execution:multicore caching for data-similar executions. ISCA '09:Proceedings of the 36th annual international symposium on Computer archi-tecture. New York, NY, USA:ACM,2009:164～173.
    [64]Khalid Omar Thabit. Cache management by the compiler. Ph.D. thesis, Houston, TX, USA,1982.
    [65]Yutao Zhong, Xipeng Shen, Chen Ding. A Hierarchical Model of Reference Affinity.
    LCPC'03:Languages and Compilers for Parallel Computing,16th International Workshop.2003:48～63.
    [66]Yutao Zhong, Maksim Orlovich, Xipeng Shen, Chen Ding. Array regrouping and structure splitting using whole-program reference affinity. SIGPLAN Not.2004, 39(6):255～266.
    [67]Xipen Shen, Yaoqing Gao, Chen Ding, Roch Archambault. Lightweight reference affinity analysis. ICS'05:Proceedings of the 19th annual international conference on Supercomputing. New York, NY, USA:ACM,2005:131～140.
    [68]Trishul M. Chilimbi. Efficient representations and abstractions for quantifying and exploiting data reference locality. SIGPLAN Not.2001,36(5):191～202.
    [69]Trishul M. Chilimbi, Ran Shaham. Cache-conscious coallocation of hot data streams. SIGPLAN Not.2006,41(6):252～262.
    [70]Xianfeng Li, Hemendra Singh Negi, Tulika Mitra, Abhik Roychoudhury. Design space exploration of caches using compressed traces. ICS'04:Proceedings of the 18th annual international conference on Supercomputing. New York, NY, USA: ACM,2004:116～125.
    [71]Tushar Mohan, Bronis R. de Supinski, Sally A. McKee, Frank Mueller, Andy Yoo, Martin Schulz. Identifying and Exploiting Spatial Regularity in Data Memory Refer-ences. SC'03:Proceedings of the 2003 ACM/IEEE conference on Supercomputing. Washington, DC, USA:IEEE Computer Society,2003:49.
    [72]Trishul M. Chilimbi, Bob Davidson, James R. Larus. Cache-conscious structure definition. PLDI'99:Proceedings of the ACM SIGPLAN 1999 conference on Pro-gramming language design and implementation. New York, NY, USA:ACM,1999: 13～24.
    [73]Chen Ding, Ken Kennedy. Inter-array Data Regrouping. LCPC'99:Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Com-puting. London, UK:Springer-Verlag,2000:149～163.
    [74]Matthew L. Seidl, Benjamin G. Zorn. Segregating heap objects by reference behavior and lifetime. SIGOPS Oper Syst Rev.1998,32(5):12～23.
    [75]Rohit Chandra, Scott Devine, Ben Verghese, Anoop Gupta, Mendel Rosenblum. Scheduling and page migration for multiprocessor compute servers. ASPLOS-VI: Proceedings of the sixth international conference on Architectural support for pro-gramming languages and operating systems. New York, NY, USA:ACM,1994: 12～24.
    [76]Anoop Gupta, Andrew Tucker, Shigeru Urushibara. The impact of operating system scheduling policies and synchronization methods of performance of parallel applica-tions. SIGMETRICS Perform Eval Rev.1991,19(1):120-132.
    [77]Mark S. Squillante, Randolph D. Nelson. Analysis of task migration in shared-memory multiprocessor scheduling. SIGMETRICS Perform Eval Rev.1991,19(1): 143～155.
    [78]M. S. Squiillante, E. D. Lazowska. Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling. IEEE Trans Parallel Distrib Syst.1993, 4(2):131～143.
    [79]Rohit Chandra, Ding-Kai Chen, Robert Cox, Dror E. Maydan, Nenad Nedeljkovic, Jennifer M. Anderson. Data distribution support on distributed shared memory multiprocessors. PLDI'97:Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation. New York, NY, USA:ACM, 1997:334～345.
    [80]Silicon Graphics. MIPSpro Fortran-77 Programmer's Guide, Document number 007-2361-004,1996.
    [81]James R. Goodman. Using cache memory to reduce processor-memory traffic. ISCA '83:Proceedings of the 10th annual international symposium on Computer archi-tecture. New York, NY, USA:ACM,1983:124～131.
    [82]F. Baskett, T. Jermoluk, D. Solomon. The 4D-MP graphics superworkstation: computing+graphics=40 MIPS+MFLOPS and 100000 lighted polygons per second. Compcon Spring'88. Thirty-Third IEEE Computer Society International Confer-ence, Digest of Papers.1988:468～471.
    [83]A. W. Wilson, Jr. Hierarchical cache/bus architecture for shared memory multi-processors. ISCA'87:Proceedings of the 14th annual international symposium on Computer architecture. New York, NY, USA:ACM,1987:244～252.
    [84]Tom Lovett, Shreekant S. Thakkar. The Symmetry Multiprocessor System. ICPP 88: Proceedings of the International Conference on Parallel Processing.1988:303～310.
    [85]James Archibald, Jean-Loup Baer. Cache coherence protocols:evaluation using a multiprocessor simulation model. ACM Trans Comput Syst.1986,4(4):273～298.
    [86]C. K. Tang. Cache system design in the tightly coupled multiprocessor system. AFIPS'76:Proceedings of the June 7-10,1976, national computer conference and exposition. New York, NY, USA:ACM,1976:749～753.
    [87]Lucien M. Censier, Paul Feautrier. A new solution to coherence problems in multi- cache systems.2000:576-582.
    [88]Anant Agarwal, Richard Simoni, John Hennessy, Mark Horowitz. An evaluation of directory schemes for cache coherence. ISCA'98:25 years of the international symposia on Computer architecture (selected papers). New York, NY, USA:ACM, 1998:353-362.
    [89]Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf-Dietrich Weber, Anoop Gupta, John Hennessy, Mark Horowitz, Monica S. Lam. The Stanford Dash Multi-processor. Computer.1992,25(3):63-79.
    [90]Kourosh Gharachorloo, Madhu Sharma, Simon Steely, Stephen Van Doren. Archi-tecture and design of AlphaServer GS320. SIGPLAN Not.2000,35(11):13-24.
    [91]A.J. Hu, M. Fujita, C. Wilson. Formal verification of the HAL S1 System cache coherence protocol. Los Alamitos, CA, USA:IEEE Computer Society,1997:438.
    [92]Fong Pong, Michel Dubois. Verification techniques for cache coherence protocols. ACM Comput Surv.1997,29(1):82-126.
    [93]Fong Pong, Michel Dubois. Formal verification of complex coherence protocols using symbolic state models. J ACM.1998,45(4):557-587.
    [94]Daniel J. Sorin, Manoj Plakal, Anne E. Condon, Mark D. Hill, Milo M. K. Martin, David A. Wood. Specifying and Verifying a Broadcast and a Multicast Snooping Cache Coherence Protocol. IEEE Trans Parallel Distrib Syst.2002,13(6):556-578.
    [95]James Laudon, Daniel Lenoski. The SGI Origin:a ccNUMA highly scalable server. SIGARCH Comput Archit News.1997,25(2):241-251.
    [96]Rajarshi Mukherjee, Yozo Nakayama, Toshiya Mima. Verification of an Industrial CC-NUMA Server. ASP-DAC'02:Proceedings of the 2002 Asia and South Pacific Design Automation Conference. Washington, DC, USA:IEEE Computer Society, 2002:747.
    [97]L. Lamport. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Progranm. IEEE Trans Comput.1979,28(9):690-691.
    [98]Michel Dubois, Christoph Scheurich, Faye A. Briggs. Synchronization, Coherence, and Event Ordering in Multiprocessors. Computer.1988,21(2):9-21.
    [99]Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, John Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. ISCA'90:Proceedings of the 17th annual international symposium on Computer Architecture. New York, NY, USA:ACM,1990:15-26.
    [100]Sarita V. Adve, Kourosh Gharachorloo. Shared Memory Consistency Models:A Tutorial. Computer.1996,29(12):66-76.
    [101]Prosenjit Chatterjee, Hemanthkumar Sivaraj, Ganesh Gopalakrishnan. Shared Memory Consistency Protocol Verification Against Weak Memory Models:Refine-ment via Model-Checking. CAV'02:Computer Aided Verification.2002:123-136.
    [102]Albert Meixner, Daniel J. Sorin. Dynamic Verification of Sequential Consistency. ISCA'05:Proceedings of the 32nd annual international symposium on Computer Architecture. Washington, DC, USA:IEEE Computer Society,2005:482-493.
    [103]Anne E. Condon, Alan J. Hu. Automatable verification of sequential consistency. SPAA'01:Proceedings of the thirteenth annual ACM symposium on Parallel algo-rithms and architectures. New York, NY, USA:ACM,2001:113-121.
    [104]A. E. Condon, M. D. Hill, M. Plakal, D. J. Sorin. Using Lamport Clocks to Reason About Relaxed Memory Models. HPCA'99:Proceedings of the 5th International Symposium on High Performance Computer Architecture. Washington, DC, USA: IEEE Computer Society,1999:270.
    [105]胡伟武,夏培肃.顺序一致共享存储系统中的乱序执行技术——基本理论.计算机学报.1997,20(06)：481～490.
    [106]胡伟武,夏培肃.顺序一致共享存储系统中的乱序执行技术——模拟实现.计算机学报.1997,20(06)：491～500.
    [107]胡伟武.共享存储系统中的访存事件次序.Ph.D. thesis,北京,中国,1995.
    [108]Wikipedia. Live variable analysis. http://en.wikipedia.org/wiki/Live_variable_analysis, 2009.9.19.
    [109]G. J. Chaitin. Register allocation & spilling via graph coloring. SIGPLAN'82: Proceedings of the 1982 SIGPLAN symposium on Compiler construction.1982: 98-101.
    [110]Preston Briggs, Keith D. Cooper, Linda Torczon. Improvements to graph coloring register allocation. ACM Trans Program Lang Syst.1994,16(3):428-455.
    [111]Fred C. Chow, John L. Hennessy. The priority-based coloring approach to register allocation. ACM Trans Program Lang Syst.1990,12(4):501-536.
    [112]Lal George, Andrew W. Appel. Iterated register coalescing. ACM Trans Program Lang Syst.1996,18(3):300～324.
    [113]Alexandre E. Eichenberger, Kathryn O'Brien, Kevin O'Brien, Peng Wu, Tong Chen, Peter H. Oden, Daniel A. Prener, Janice C. Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao, Michael Gschwind. Optimizing Compiler for the CELL Processor. PACT'05:Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. Washington, DC, USA:IEEE Computer Society,2005:161～172.
    [114]A. E. Eichenberger, J. K. O'Brien, K. M. O'Brien, P. Wu, T. Chen, P. H. Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, M. K. Gschwind, R. Archambault, Y. Gao, R. Koo. Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture. IBM Syst J.2006,45(1):59～84.
    [115]Lian Li, Lin Gao, Jingling Xue. Memory Coloring:A Compiler Approach for Scratchpad Memory Management. PACT'05:Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques.2005:329～338.
    [116]Li Wang, Xuejun Yang, Jingling Xue, Yu Deng, Xiaobo Yan, Tao Tang, Quan Hoang Nguyen. Optimizing Scientific Application Loops on Stream Processors. LCTES'08: Proceedings of the International Conference on Language, Compilers, and Tools for Embedded Systems.2008:161～170.
    [117]Xuejun Yang, Li Wang, Jingling Xue, Yu Deng, Ying Zhang. Comparability Graph Coloring for Optimizing Utilization of Stream Register Files in Stream Processors. PPoPP'09:Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.2009:111～120.
    [118]OpenMP Architecture Review Board. OpenMP Application Program Interface Ver-sion 3.0. http://www.openmp.org,2008.
    [119]Eduard Ayguade, Nawal Copty, Alejandro Duran, Jay Hoeflinger, Yuan Lin, Fed-erico Massaioli, Xavier Teruel, Priya Unnikrishnan, Guansong Zhang. The Design of OpenMP Tasks. IEEE Trans Parallel Distrib Syst.2009,20(3); 404～418.
    [120]Woo-Chul Jeun, Yang-Suk Kee, Soonhoi Ha, Changdon Kee. Overcoming perfor-mance bottlenecks in using OpenMP on SMP clusters. Parallel Comput.2008, 34(10):570～592.
    [121]陈永健,舒继武,李建江,王鼎兴.OpenMP指导语句全局嵌套类型的静态分析及应用(英文).软件学报.2005,16(02)：194～204.
    [122]Maurice Herlihy, J. Eliot B. Moss. Transactional memory:architectural support for lock-free data structures. SIGARCH Comput Archit News.1993,21(2):289～300.
    [123]Woongki Baek, Chi Cao Minh, Martin Trautmann, Christos Kozyrakis, Kunle Olukotun. The OpenTM Transactional Application Programming Interface. PACT '07:Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques. Washington, DC, USA:IEEE Computer Society,2007: 376～387.
    [124]Vishal Aslot, Max J. Domeika, Rudolf Eigenmann, Greg Gaertner, Wesley B. Jones, Bodo Parady. SPEComp:A New Benchmark Suite for Measuring Parallel Com-puter Performance. WOMPAT'01:Proceedings of the International Workshop on OpenMP Applications and Tools. London, UK:Springer-Verlag,2001:1～10.
    [125]D. B. West,李建中,骆吉洲译.图论导引(原书第2版).机械工业出版社,2006.
    [126]D. Tarjan, S. Thoziyoor, N. Jouppi. Cacti 4.0:An integrated cache timing, power and area model. Technical report, HP Laboratories Palo Alto,2006.
    [127]Jiri Gaisler, Edvin Catovic, Marko Isomaki, Kristoffer Glembo, Sandi Habinc. GR-LIB IP Core User's Manual. http://wwwgaislercom/products/grlib/grippdf.2007.
    [128]Todd Austin, Eric Larson, Dan Ernst. SimpleScalar:An Infrastructure for Com-puter System Modeling. Computer.2002,35(2):59～67.
    [129]Laszlo A. Belady. A Study of Replacement Algorithms for Virtual-Storage Com-puter. IBM Systems Journal.1966,5(2):78～101.
    [130]吴俊杰,张百达,曾坤,周海芳,杨学军.SimOPT:一种高效的OPT Cache模拟器的设计与实现.中国计算机大会.2008：50.
    [131]Alan Jay Smith. Cache Memories. ACM Comput Surv.1982,14(3):473～530.
    [132]Peter S. Magnusson, Magnus Christensson, Jesper Eskilson, Daniel Forsgren, Gus-tav Hallberg, Johan Hogberg, Fredrik Larsson, Andreas Moestedt, Bengt Werner. Simics:A Full System Simulation Platform. Computer.2002,35(2):50～58.
    [133]D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, S. K. Weeratunga. The NAS Parallel Benchmarks. The Interna-tional Journal of Supercomputer Applications.1991,5(3):63～73.
    [134]C. Lee, M. Potkonjak, W. H. Mangione-Smith. MediaBench:a tool for evaluating and synthesizing multimedia and communications systems.1997:330～335.
    [135]Yonghong Song. Zhiyuan Li. New tiling techniques to improve cache temporal local-ity. PLDI'99:Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation. New York, NY, USA:ACM,1999:215～228.
    [136]Andre Seznec. A case for two-way skewed-associative caches. ISCA'93:Proceedings of the 20th annual international symposium on Computer architecture. New York, NY, USA:ACM,1993:169～178.
    [137]Jaume Abella, Antonio Gonzalez. Heterogeneous way-size cache. ICS'06:Proceed-ings of the 20th annual international conference on Super computing. New York, NY, USA:ACM,2006:239～248.
    [138]Moinuddin K. Qureshi, David Thompson, Yale N. Patt. The V-Way Cache:Demand Based Associativity via Global Replacement. ISCA'05:Proceedings of the 32nd annual international symposium on Computer Architecture. Washington, DC, USA: IEEE Computer Society,2005:544～555.
    [139]Keshavan Varadarajan, S. K. Nandy, Vishal Sharda, Amrutur Bharadwaj, Ravi Iyer, Srihari Makineni, Donald Newell. Molecular Caches:A caching structure for dynamic creation of application-specific Heterogeneous cache regions. MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microar-chitecture. Washington, DC, USA:IEEE Computer Society,2006:433～442.
    [140]Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, Joel Emer. Adaptive insertion policies for high performance caching. ISCA'07:Proceedings of the 34th annual international symposium on Computer architecture. New York, NY, USA:ACM,2007:381～391.
    [141]E. Witchel, K. Asanovic. The Span Cache:Software Controlled Tag Checks and Cache Line Size. Workshop on Complexity-Effective Design,28th ISCA.2001: 1～12.
    [142]Antonio Gonzalez, Carlos Aliagas, Mateo Valero. A data cache with multiple caching strategies tuned to different types of locality. ICS'95:Proceedings of the 9th international conference on Supercomputing. New York, NY, USA:ACM,1995: 338～347.
    [143]S. Kim, N. Vijaykrishnan, M. Kandemir, A. Sivasubramaniam, M. J. Irwin, E. Geethanjali. Power-aware partitioned cache architectures. ISLPED'01:Proceedings of the 2001 international symposium on Low power electronics and design. New York, NY, USA:ACM,2001:64～67.
    [144]Hsien-Hsin S. Lee, Mikhail Smelyanskiy, Gary S. Tyson, Chris J. Newburn. Stack Value File:Custom Microarchitecture for the Stack. HPCA'01:Proceedings of the 7th International Symposium on High-Performance Computer Architecture. Wash-ington, DC, USA:IEEE Computer Society,2001:5.
    [145]T. Juan, D. Royo, J. J. Navarro. Dynamic Cache Splitting. Proceedings of the XV International Conference of the Chilean Computer Society.1995:253～262.
    [146]Shekhar Srikantaiah, Mahmut Kandemir, Mary Jane Irwin. Adaptive set pinning: managing shared caches in chip multiprocessors. ASPLOS XIII:Proceedings of the 13th international conference on Architectural support for programming languages and operating systems. New York, NY, USA:ACM,2008:135～144.
    [147]David F. Bacon, Susan L. Graham, Oliver J. Sharp. Compiler transformations for high-performance computing. ACM Comput Surv.1994,26(4):345～420.
    [148]Gabriel Rivera, Chau Wen Tseng. Data transformations for eliminating conflict misses. SIGPLAN Not.1998,33(5):38～49.
    [149]J. Torrellas, H. S. Lam, J. L. Hennessy. False Sharing and Spatial Locality in Multiprocessor Caches. IEEE Trans Comput.1994,43(6):651～663.
    [150]Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman.编译原理技术与工具(第二版)(英文版).北京,中国：人民邮电出版社,2008.
    [151]Xuejun Yang, Yunfei Du, Panfeng Wang, Hongyi Fu, Jia Jia, Zhiyuan Wang, Guang Suo. The Fault Tolerant Parallel Algorithm:the Parallel Recomputing Based Failure Recovery. PACT'07:Proceedings of International Conference on Parallel Architec-tures and Compilation Techniques.2007:199～212.
    [152]Xuejun Yang, Yunfei Du, Panfeng Wang, Hongyi Fu, Jia Jia. FTPA:Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing. IEEE Transac-tions on Parallel and Distributed Systems.2009,20(10):1471～1486.
    [153]Andrew W. Appel, Maia Ginsburg.现代编译原理——C语言描述(英文版).北京,中国：人民邮电出版社,2005.
    [154]谢政,戴丽.组合图论.长沙,中国：国防科技大学出版社,2003.
    [155]郑宝东.线性代数与空间解析几何.哈尔滨,中国：哈尔滨工业大学出版社,2000.
    [156]Changkyu Kim, Doug Burger, Stephen W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS-X:Proceedings of the 10th international conference on Architectural support for programming languages and operating systems. New York, NY, USA:ACM,2002:211～222.
    [157]Jean-Loup Baer, Tien-Fu Chen. Effective Hardware-Based Data Prefetching for High-Performance Processors. IEEE Trans Comput.1995,44(5):609～623.
    [158]Naveen Muralimanohar, Rajeev Balasubramonian, Norm Jouppi. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. MICRO'07:Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. Washington, DC, USA:IEEE Computer Society,2007:3～14.
    [159]F. H. McMahon. Livermore fortran kernels:A computer test of numerical perfor-mance range. Technical Report UCRL-53745, Lawrence Livermore National Labo-ratory, Livermore, CA,1986.
    [160]John L. Henning. SPEC CPU2000:Measuring CPU Performance in the New Mil-lennium. Computer.2000,33(7):28～35.
    [161]A. Grbic, S. Brown, S. Caranci, R. Grindley, M. Gusat, G. Lemieux, K. Loveless, N. Manjikian, S. Srbljic, M. Stumm, Z. Vranesic, Z. Zilic. Design and implementa-tion of the NUMAchine multiprocessor. DAC'98:Proceedings of the 35th annual conference on Design automation. New York, NY, USA:ACM,1998:66～69.
    [162]Tao Mu, Jie Tao, Martin Schulz, Sally A. McKee. Interactive locality optimization on NUMA architectures. Soft Vis'03:Proceedings of the 2003 ACM symposium on Software visualization. New York, NY, USA:ACM,2003:133～141.
    [163]Jr. Richard Philip Larowe. Page placement for non-uniform memory access time (NUMA) shared memory multiprocessors. Ph.D. thesis, Duke University, Durham, NC, USA,1991.
    [164]Jr. Richard P. LaRowe, Mark A. Holliday, Carla Schlatter Ellis. An analysis of dynamic page placement on a NUMA multiprocessor. SIGMETRICS'92/PER-FORMANCE'92:Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems. New York, NY, USA:ACM,1992:23～34.
    [165]Zeshan Chishti, Michael D. Powell, T. N. Vijaykumar. Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures. MICRO 36: Proceedings of the 36th annual IEEE/ACM International Symposium on Microar-chitecture. Washington, DC, USA:IEEE Computer Society,2003:55.
    [166]Bradford M. Beckmann, David A. Wood. Managing Wire Delay in Large Chip-Multiprocessor Caches. MICRO 37:Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture. Washington, DC, USA:IEEE Com-puter Society,2004:319～330.
    [167]Zeshan Chishti, Michael D. Powell, T. N. Vijaykumar. Optimizing Replication, Communication, and Capacity Allocation in CMPs. ISCA'05:Proceedings of the 32nd annual international symposium on Computer Architecture. Washington, DC, USA:IEEE Computer Society,2005:357～368.
    [168]Javier Merino, Valentin Puente, Pablo Prieto, Jose Angel Gregorio. SP-NUCA:a cost effective dynamic non-uniform cache architecture. SIGARCH Comput Archit News.2008,36(2):64～71.
    [169]Mahmut Kandemir, Feihui Li, Mary Jane Irwin, Seung Woo Son. A novel migration-based NUCA design for chip multiprocessors. SC'08:Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA:IEEE Press, 2008:1～12.
    [170]Sangyeun Cho, Lei Jin. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. MICRO 39:Proceedings of the 39th Annual IEEE/ACM Interna-tional Symposium on Microarchitecture. Washington, DC, USA:IEEE Computer Society,2006:455～468.
    [171]A. Bardine, P. Foglia, G. Gabrielli, C. A. Prete, P. Stenstrom. Improving power efficiency of D-NUCA caches. SIGARCH Comput Archit News.2007,35(4):53～58.
    [172]Alessandro Bardine, Pierfrancesco Foglia, Giacomo Gabrielli, Cosimo Antonio Prete. Analysis of static and dynamic energy consumption in NUCA caches:initial results. MEDEA'07:Proceedings of the 2007 workshop on MEmory performance. New York, NY, USA:ACM,2007:105～112.
    [173]Alessandro Bardine, Manuel Comparetti, Pierfrancesco Foglia, Giacomo Gabrielli, Cosimo Antonio Prete, Per Stenstrom. Leveraging Data Promotion for Low Power D-NUCA Caches. DSD'08:Proceedings of the 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools. Washington, DC, USA: IEEE Computer Society,2008:307～316.
    [174]Jean-Loup Baer, Tien-Fu Chen. An effective on-chip preloading scheme to reduce data access penalty. Supercomputing'91:Proceedings of the 1991 ACM/IEEE conference on Supercomputing. New York, NY, USA:ACM,1991:176～186.
    [175]John W. C. Fu, Janak H. Patel, Bob L. Janssens. Stride directed prefetching in scalar processors. SIGMICRO Newsl.1992,23(1-2):102～110.
    [176]Ivan Sklenar. Prefetch unit for vector operations on scalar computers. SIGARCH Comput Archit News.1992,20(4):31～37.
    [177]Akio Kodama, Toshinori Sato. A Non-Uniform Cache Architecture on Networks-on-Chip:A Fully Associative Approach with Pre-Promotion. ISIC'04:10th Inter-national Symposium on Integrated Circuits, Devices and Systems, CD-ROM, Sep-tember 2004.
    [178]Sunil Kim, Alexander V. Veidenbaum. Stride-directed Prefetching for Secondary Caches. ICPP'97:Proceedings of the international Conference on Parallel Process-ing. Washington, DC, USA:IEEE Computer Society,1997:314.
    [179]Kyle J. Nesbit, James E. Smith. Data Cache Prefetching Using a Global History Buffer. IEEE Micro.2005,25(1):90～97.
    [180]Yong Chen, Surendra Byna, Xian-He Sun. Data access history cache and associ- ated data prefetching mechanisms. SC'07:Proceedings of the 2007 ACM/IEEE conference on Supercomputing. New York, NY, USA:ACM,2007:1～12.
    [181]David Bernstein, Doron Cohen, Ari Freund. Compiler techniques for data prefetch-ing on the PowerPC. PACT'95:Proceedings of the IFIP WG10.3 working con-ference on Parallel architectures and compilation techniques. Manchester, UK, UK: IFIP Working Group on Algol,1995:19～26.
    [182]Vatsa Santhanam, Edward H. Gornish, Wei-Chung Hsu. Data prefetching on the HP PA-8000. SIGARCH Comput Archit News.1997,25(2):264～273.
    [183]Kenneth C. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro. 1996,16(2):28～40.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700