用户名: 密码: 验证码:
利用程序分析和优化提高Cache性能
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
在近40年来,处理器运行速度的增长和存储器访问速度的增长之间存在着巨大的差距,这使得两者之间的速度差距越来越大,导致的“Memory Wall”问题变得越来越严重,已经成为整个系统性能最主要的瓶颈之一。现代计算机体系结构中广泛采用Cache来缓解这两者之间的速度差距,使得Cache已经成为影响处理器性能、能耗、价值的重要因素之一。Cache的充分利用很大程度上取决于程序自身的局部性,特别是访问数据的局部性,包括时间局部性和空间局部性。优化程序的数据局部性成为提高Cache性能的重要方法之一。
     由于无需硬件的支持,具有良好的跨平台性.利用程序分析和优化来改善程序的局部性,提高Cache性能成为当前研究的热点。过去在这方面的工作有两个重要的思路:一是针对程序运行时的访问数据,利用相关的Cache行为模型,建立一些程序分析工具,从源代码级给出程序Cache性能的瓶颈,指导程序员通过程序变换来优化程序的局部性,从而提高Cache性能;二是在编译器上开发编译优化过程,或者开发专门的程序优化工具,通过对程序进行分析,在此基础上自动进行程序变换,包括代码变换和数据变换两种,优化程序的局部性,从而提高Cache性能。
     本文根据上面的思路进行了下面的一些工作:
     (1)程序Cache行为模型的研究
     虽然复用距离已经成为程序Cache行为的一种重要度量标准,但是高复杂度和可能存在的内存溢出问题使得其难以被应用。在通过引入最大Cache大小,限制复用距离分析范围的基础上,本文提出了一种受限的复用距离模型。受限的复用距离舍弃少量访问的复用距离分析精度,有效地避免了进行复用距离分析时可能导致的内存溢出问题,同时使得复用距离的分析达到访问数据次数的线性时间复杂度。文章还通过实验说明了基于受限的复用距离进行Cache失效率分析的可行性和正确性。
     (2)程序Cache行为分析工具的设计与实现
     程序Cache行为的分析和理解能够帮助程序员寻找程序中Cache性能的瓶颈,指导程序员通过程序变换改善程序局部性,提高程序Cache性能。本文介绍一种基于复用距离的程序Cache行为分析工具的设计与实现,该工具利用一种基于中间信息栈的源代码级插桩收集程序的访存信息,然后基于受限的复用距离模型分析程序的Cache行为。同时还通过示例表明该分析工具不但能在源代码中给出Cache性能瓶颈的位置,指导进行代码变换;而且还能分析变量的Cache行为关系,指导进行数据变换。
     (3)基于复用距离分布的数据布局优化
     在利用程序Cache行为分析工具进行分析时,总结出程序中经常在一起访问的变量的复用距离在分布上具有一定的相似性。如果将这些变量进行重组,改变其布局,则能提高Cache性能。根据这个原理,本文设计并实现了一个数据布局优化框架,在框架中采用了一种基于复用距离分布的变量关系模型,用于寻找程序中经常在一起访问的变量。目前在框架中实现了一种动态数组的重组方法,用来重组利用变量关系模型发现的动态数组,以此来提高程序的数据局部性。针对SPEC CPU2000中部分测试程序的实验表明了该优化框架在优化数据局部性和提高数据Cache性能上具有一定的可行性和有效性。
     (4)利用结构拆分提高Cache性能
     结构拆分作为一种常用的通过数据变换提高Cache性能的方法,由于只需要分析结构属性域之间的相对关系,具有一定的特殊性。为了克服复用距离分析的复杂性,本文在引入一种较复用距离更为简单的距离的基础上,提出了一种属性域关系模型来度量结构中属性域之间的关系,然后寻找程序中的结构的优化布局,通过结构拆分来优化数据布局,从而提高数据Cache性能。相关的实验表明了这种基于属性域关系模型的结构拆分在提高Cache性能上有一定的有效性。
In the past 40 years, the ever-increasing speed gap between processor and memory forms the "Memory Wall" and has resulted in one of the primary bottlenecks in computer system. Now modern computer architecture widely employs cache to decrease the impact of the gap and cache performance has an increasing influence on system speed, cost and energy usage. The utility of cache mostly depends on program locality, especially program data locality, including the temporal locality and the spatial locality. Thus, optimizing program data locality has become an important method to improve cache performance.
     Due to its independence of hardware and platform support, optimizing program locality by program analysis and optimization has become an active area to improve cache performance. There used to be two primary methods to optimize program data locality: the first is to analyze cache behavior of program data, then basing on cache behavior model, build a program analysis tool which shows the bottlenecks of data cache performance and hence directs programmer to tune performance of data cache; the second is program transformation by an optimizing compiler or an optimizing tool, and data cache performance is optimized in the transformation.
     This thesis introduces the following works basing on the above two methods:
     (1) Research on Program Cache Behavior Model
     Though reuse distance has become an important metric of program cache behavior, problems such as high complexity and possible memory overflow have largely restricted its application. By limiting the max cache size and the area of reuse distance analysis, this paper introduces a limited reuse distance analysis method. This method efficiently avoids possible memory overflow problem in normal reuse distance analysis, and at the same time reduces complexity of reuse distance analysis to linear. Experiments on some integer and floating-point programs have demonstrated the feasibility and correctness of cache miss rate analyzed based on this reuse distance analysis.
     (2) Design and Implementation of Cache Behavior Analysis Tool
     Program cache behavior analysis and comprehension help programmer to find the bottlenecks of program cache performance, optimize locality by program trans- formation and improve program cache performance. This thesis introduces a reuse distance based program cache behavior analysis tool, employing a source level instrumentation method based on intermediate information stack to collect program data access information, and using a limited reuse distance model to analyze program cache behavior. An example shows that this cache behavior analysis tool is capable of not only locating the cache performance bottlenecks in source codes to guide the code transformation, but also analyzing the cache behavior relationship among variables, thus to direct the data reorganization.
     (3) Data Layout Optimization Using Reuse Distance Distribution
     During cache behavior analysis, much similarity among reuse distance signatures of variables which are often accessed together can be easily observed. If layout of this kind of variables is optimized by reorganization, the cache performance can be improved. Motivated by this consideration, this thesis outlines a data layout optimization framework. A variable relation model based on reuse distance signature is used to find variables that are often accessed together in the framework. The framework implements a dynamic array regrouping method to reorganize the dynamic arrays. Experiments on some benchmarks from SPEC CPU2000 show this framework is feasible and effective to optimize data locality and improve cache performance.
     (4) Improving Cache Performance by Structure Splitting
     Structure splitting is a common method to improve cache performance by data reorganization, and it is of certain specialities since structure splitting only needs to analyze the relative relation of fields. To overcome the complexity of reuse distance analysis, this thesis proposes a field relation model based on a simple distance to quantitate the relation of structure fields, then searches an optimized layout for structure, optimizes program data layout and improves program data cache performance by structure splitting. Experiments show that this method is effective in optimizing program data layout, improving program data cache performance and thus improving the performance of the whole program.
引文
[1] 付雄,张昱,and陈意云.基于复用距离的cache失效率分析.小型微型计算机系统,27 No.9:1777-1781,2006.
    [2] Olden benchmark, http://www.cs.princeton.edu/mcc/olden.html.
    [3] The spec website, http://www.spec.org.
    [4] 舒辉and康绯.一个新的循环分块算法.计算机研究与发展,39(10):1303-1306,October 2002.
    [5] 付雄and张昱.含xpath的表达式的解析与应用.小型微型计算机系统25 No.3:442-446.2004.
    [6] George S. Almasi, Calin Cascaval, and David A. Padua. Calculating stack distances efficiently. In Proceedings of the 2002 workshop on Memory system performance (MSP'02), pages 37-43, Berlin, Germany, March 2002. ACM Press.
    [7] J. M. Anderson, L. M. Berc, and et al J. Dean. Continuous profiling: Where have all the cycles gone? A CM Transactions on Computer Systems, 15(4):357-390, November 1997.
    [8] J.-L. Baer. 2k papers on caches by y2k: Do we need more? In Keynote address at the 6th International symposium on High-Performance Computer Architecture, January 2000.
    [9] Amol Bakshi, Jean-Luc Gaudiot, Wen-Yen Lin, Manil Makhija, Viktor K. Prasanna, Wonwoo Ro, and Chulho Shin. Memory latency: to tolerate or to reduce? In Proceedings of the 12th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'00), Sao Pedro, Brazil, October 2000.
    [10] B. T. Bennett and V. J. Kruskal. Lru stack processing. IBM Research and Development, pages 353-357, 1975.
    [11] Erik Berg. Methods for Run Time Analysis of Data Locality. PhD thesis, Uppsala University, 2003.
    [12] Erik Berg and Erik Hagersten. Statcache: A probabilistic approach to efficient and accurate data locality analysis. In Proceedings of International Parallel and Distributed Processing Symposium(ISPASS'04), pages 20-27, March 2004.
    [13] K. Beyls and E. D'Hollander. Compiler generated multithreading to alleviate memory latency. Journal of Universal Computer Science, 6(10): 968-993, oct 2000.
    [14] K. Beyls and E. D' Hollander. Reuse distance as a metric for cache behavior. In Proceedings of the Conference on Parallel and Distributed Computing and Systems(PDCS'01), pages 617-622, August 2001.
    [15] K. Beyls and E. D'Hollander. Reuse distance-based cache hint selection. In Proceedings of the 8th International Euro-Par Conference on Parallel Processing, number 2400 in Lecture Notes In Computer Science, pages 265-274, Paderborn, Germany, August 2002. Springer-Verlag.
    [16] K. Beyls, E. D'Hollander, and Y. Yu. Visualization enables the programmer to reduce cache misses. In Proceedings of Conference on Parallel and Distributed Computing and Systems(PDCS'02), 2002.
    [17] K. Beyls and E. H. D'Hollander. Platform-independent cache optimization by pinpointing low-locality reuse. In Proceedings of International Conference on Computational Science(ICCS'04), volume 3, pages 463-470, 2004.
    [18] M. Brehob and R. J. Enbody. An analytic model of locality and caching. Technical Report MSU-CSE-99-31, Michigan State University, August 1999.
    [19] Brendon Cahoon and Kathryn S. McKinley. Data flow analysis for software prefetching linked data structures in java. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT'01), pages 280-291, Barcelona, Spain, September 2001. IEEE Computer Society.
    [20] Brad Calder, Chandra Krintz, Simmi John, and Todd Austin. Cacheconscious data placement. In Proceedings of the eighth international conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS'98), pages 139-149, San Jose, California, United States, October 1998.
    [21] Chun Chen, Jacqueline Chame, and Mary Hall. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In Proceedings of the international symposium on Code generation and optimization (CGO'05), pages 111-122. IEEE Computer Society, March 2005.
    [22] Chun Chen, Jacqueline Chame, Mary W. Hall, and Kristina Lerman. A systematic approach to model-guided empirical search for memory hierarchy optimization. In Proceedings of the 18th International Workshop on Languages and Compilers for Parallel Computing (LCPC'05), pages 433-440, Hawthorne, New York, October 2005.
    [23] T. Chilimbi, J. Larus, and M. Hill. Improving pointer-based codes through cache-conscious data placement. Technical Report CS-TR-98-1365, University of Wisconsin—Madison, March 1998.
    [24] T. M. Chilimbi. Efficient representations and abstractions for quantifying and exploiting data reference locality. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI'01), Snowbird, Utah, June 2001. ACM Press.
    [25] T. M. Chilimbi, B. Davidson, and J. R. Larus. Cache-conscious structure definition. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI'99), Atlanta, Georgia, May 1999. ACM Press.
    [26] T. M. Chilimbi and M. Hirzel. Dynamic hot data stream prefetching for general-purpose programs. In Proceedings of the A CM SIGPLAN Conference on Programming Language Design and Implementation(PLDI'02), pages 199-209. ACM Press, 2002.
    [27] Trishul M. Chilimbi and James R. Larus. Using generational garbage collection to implement cache-conscious data placement. In Proceedings of the First International Symposium on Memory Management (ISMM'98), volume 34 of ACM SIGPLAN Notices, pages 37-48, Vancouver, British Columbia, Canada, October 1998. ACM Press.
    [28] Trishul M. Chilimbi and Ran Shaham. Cache-conscious coallocation of hot data streams. In Proceedings of the A CM SIGPLAN Conference on Programming Language Design and Implementation(PLDI'06), pages 252-262, Ottawa, Ontario, Canada, June 2006. ACM Press.
    [29] C. Ding and K. Kennedy. Improving cache performance of dynamic applications through data and computation reorganization at run time. In Proceedings of SIGPLAN Conference on Programming Languages Design and Implementation(PLDI'99), pages 229-241, May 1999.
    [30] C. Ding and Y. Zhong. Compiler-directed run-time monitoring of program data access. In Proceedings of the first ACM SIGPLAN Workshop on Memory System Performance(MSP'02), Berlin, Germany, June 2002.
    [31] Chen Ding and Ken Kennedy. Inter-array data regrouping. In Proceedings of The 12th International Workshop on Languages and Compilers for Parallel Computing (LCPC'99), La Jolla, California, August 1999.
    [32] Chen Ding and Yutao Zhong. Predicting whole-program locality with reuse distance analysis. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI'03), pages 245-257, San Diego, CA, June 2003.
    [33] Yi Feng and Emery D. Berger. A locality-improving dynamic memory allocator. In Proceedings of the 2005 workshop on Memory system performance (MSP'05), pages 68-77, Chicago, Illinois, USA, June 2005. ACM Press.
    [34] Robert E. Filman and Klaus Havelund. Source-code instrumentation and quantification of events. In Workshop on Foundations Of AspectOriented Languages (FOAL, AOSD'02), page 45-49, Twente, Netherlands, April 2002.
    [35] Xiong Fu, Yu Zhang, and Yiyun Chen. Data-layout optimization using reuse distance distribution. In Emerging Directions in Embedded and Ubiquitous Computing, the 1st International Embedded Software Optimization(ESO'06), volume 4097 of Lecture Notes of Computer Science, pages 858-867, Seoul, Korea, August 2006. Springer Berlin/Heidelberg.
    [36] D. Garbervetsky, C. Nakhli, S. Yovine, and H. Zorgati. Program instrumentation and run-time analysis of scoped memory in java. In ENTCS, volume 113, pages 105-121, Barcelona. Spain, April 2004.
    [37] K. Grimsrud, J. Archibald, R. Frost, and B. Nelson. Locality as a visualization tool. IEEE Transactions on Computer, 45(11):1319-1326, 1996.
    [38] Mostafa Hagog and Caroline Tice. Cache aware data layout reorganization optimization in gcc. In Proceedings of the GCC Developers' Summit, pages 69-92, June 2005.
    [39] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach(third edition). Morgan Kaufmann Publishers, 2002.
    [40] Xianglong Huang, Stephen M Blackburn, Kathryn S McKinley, J Eliot B Moss, Zhenlin Wang, and Perry Cheng. The garbage collection advantage: Improving program locality. In Proceedings of the ACM 2004 SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA'04), pages 69-80, Vancouver, British Columbia, Canada, October 2004. ACM Press.
    [41] Intel_Corporation. Vtune: a visual tuning environment. http://www.intel.com/software/products/vtune.
    [42] Lizy Kurian John, Vinod Reddy, Paul T.Hulina, and Lee D. Coraor. A comparative evaluation of software techniques to hide memory latency. In Proceedings of the 28th Hawaii International Conference on System Sciences, volume Ⅰ, pages 229-238, Junuary 1995.
    [43] Magnus Karlsson, Fredrik Dahlgren, and Per Stenstrom. A prefetching technique for irregular accesses to linked data structures. In Proceedings of the Sixth International Symposium on High Performance Computer Architectures (HPCA'00), pages 206-217, January 2000.
    [44] Wen ke Chen, Sanjay Bhansali, Trishul M. Chilimbi, Xiaofeng Gao, and Weihaw Chuang. Profile-guided proactive garbage collection for locality optimization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI'06), pages 332-340, Ottawa, Ontario, Canada, June 2006. ACM Press.
    [45] K. Kennedy and K. S. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In In Workshop on Languages and Compilers for Parallel Computing, number 768, pages 301-320, Portland, OR, 1993. Berlin: Springer Verlag.
    [46] Alexander C. Klaiber and Henry M. Levy. An architecture for softwarecontrolled data prefetching. In Proceedings of International Symposium on Computer Architecture (ISCA'91), pages 43-53, Toronto, Ontario, Canada, May 1991. ACM Press.
    [47] Beyls Kristof. Software Methods to Improve Data Locality and Cache Behavior. PhD thesis, Ghent University, 2004.
    [48] Beyls Kristof. Rdvis: A tool that visualizes the causes of low locality and hints program optimizations. In Proceedings of International Conference on Computational Science(ICCS'05), volume 3515 of Lecture Notes in Computer Science, pages 166-173, 2005.
    [49] Chris Lattner and Vikram Adve. Automatic pool allocation: improving performance by controlling data structure layout in the heap. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation (PLDI'05), pages 129-142, Chicago, IL, USA, June 2005. ACM Press.
    [50] A. R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg. A large, fast instruction window for tolerating cache misses. In Proceedings of the 29th International Symposium on Computer Architecture (ISCA'02), pages 59-70, Anchorage, Alaska, USA, May 2002. IEEE Computer Society.
    [51] J. S. Liptay. Structural aspects of the system/360 model 85, part ⅱ: The cache. IBM Systems Journal, 7(1): 15-21, 1968.
    [52] V. Loechner, B. Meister, and P. Clauss. Precise data locality optimization of nested loops. The Journal of Supercomputing, 21(1): 37-76, 2002.
    [53] C.-K. Luk and T. C. Mowry. Automatic compiler-inserted prefetching for pointer applications. IEEE Transactions on Computers, 48(2): 134-141, February 1999.
    [54] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI'05), pages 190-200, New York, NY, USA, 2005. ACM Press.
    [55] Chi-Keung Luk and Todd C. Mowry. Compiler-based prefetching for recursive data structures. In Proceedings of Architectural Support for Programming Languages and Operating Systems (ASPLOS'96), pages 222-233, Cambridge, MA, October 1996. ACM Press.
    [56] Victor De La Luz and Mahmut Kandemir. Array regrouping and its use in compiling data-intensive embedded applications. IEEE Transactions on Computers, 53(1): 1-19, January 2004.
    [57] N. Manjikian and T. S. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193-209, 1997.
    [58] Gabriel Matin and John Mellor-Crummey. Cross-architecture performance predictions for scientific applications using parameterized models. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'04), pages 2-13, New York, NY, USA, June 2004. ACM Press.
    [59] R. L. Mattson, J. Gecsei, D. Slutz,, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM System Journal, 9(2):78-117, 1970.
    [60] Scott McFarling. Program optimization for instruction caches. In Proceedings of Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'89), pages 183-191, Boston,MA, April 1989. ACM Press.
    [61] Nathaniel McIntosh, Sandya Mannarswamy, and Robert Hundt. Wholeprogram optimization of global variable layout. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques(PACT'06), Washington, DC, USA, September 2006.
    [62] K. S. McKinley and O. Temam. Quantifying loop nest locality using spec'95 and the perfect benchmarks. A CM Transactions on Computer Systems, 17(4):288-336, November 1999.
    [63] Kathryn S. McKinley, Steve Carr, and Chau-Wen Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4): 424-453, July 1996.
    [64] Wen mei Hwu and Pohua P. Chang. Achieving high instruction cache performance with an optimizing compiler. In Proceedings of the 16th Annual International Symposium on Computer Architecture (ISCA'89), pages 242-251, Jerusalem, Israel, June 1989.
    [65] J. Mellor-Crummey, R. Fowler, and G. Marin. Hpcview: a tool for topdown analysis of node performance. The Journal of Supercomputing, 23:81-104, 2002.
    [66] John Mellor-Crummey, David Whalley, and Ken Kennedy. Improving memory hierarchy performance for irregular applications using data and computation reorderings. International Journal of Parallel Programming, 29(3): 217-247, June 2001.
    [67] Abraham Mendelson, Shlomit S. Pinter, and Ruth Shtokhamer. Compile time instruction cache optimizations. In Proceedings of the 5th International Conference on Compiler Construction (CC'94), number 786 in Lecture Notes in Computer Science, pages 404-418, Edinburgh, Scotland, April 1994. Springer-Verlag.
    [68] T. Mowry. Tolerating Latency Through Software Controlled Data Prefetching. PhD thesis, Stanford University,, March 1994.
    [69] David Patterson, Thomas Anderson, and Neal Cardwell et al. A case for intelligent ram. In Proceedings of IEEE MICRO'97, pages 34—44, Apr. 1997.
    [70] P. Pereira, L. Heutte, and Y. Lecourtier. Source-to-source instrumentation for the optimization of an automatic reading system. The Journal of Supercomputing, 18:89-104, 2001.
    [71] K. Pettis and R. C. Hansen. Profile guided code positioning. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI'90), pages 16-27, June 1990.
    [72] Changwoo Pyo, Kyung-Woo Lee, Hye-Kyung Han, and Gyungho Lee. Reference distance as a metric for data locality. In Proceedings of the High-Performance Computing on the Information Superhighway (HPC-Asia'97), pages 151-156, Seoul, South Korea, May 1997. IEEE Computer Society.
    [73] Rodric M. Rabbah and Krishna V. Palem. Data remapping for design space optimization of embedded memory systems. ACM Transactions on Embedded Computing Systems, 2(2):186-218, May 2003.
    [74] Easwaran Raman, Robert Hundt, and Sandya Mannarswamy. Structure layout optimization for multithreaded programs. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'07), pages 271-282, San Jose, CA, USA, March 2007. IEEE Computer Society.
    [75] S. Rosen. Electronic computers: A historical survey. A CM Computing Reviews, 1(1):7-36, 1969.
    [76] Shai Rubin, Rastislav Bodik, and Trishul Chilimbi. An efficient profileanalysis framework for data-layout optimizations. In Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages(POPL'02), pages 140-153, 2002.
    [77] Andre Seznec, Dan N. Truong, and Francois Bodin. Improving cache behaviour of dynamically allocated data structure. In Proceedings of the 7th International Conference on Parallel Architectures and Compilation Techniques(PACT'98), pages 322-329, Paris, France, October 1998. IEEE Computer Society.
    [78] Xipen Shen, Yaoqing Gao, Chen Ding, and Roch Archambault. Lightweight reference affinity analysis. In Proceedings of The 19th A CM International Conference on Supercomputing (ICS'05), pages 131-140, Cambridge, Massachusetts, June 2005. ACM Press.
    [79] D. D. Sleator and R. E. Tarjan. Self adjusting binary search trees, the ACM, 32(3), 1985.
    [80] Byoungro So, Mary W. Hall, and Heidi E. Ziegler. Custom data layout for memory parallelism. In Proceedings of the international symposium on Code generation and optimization (CGO'04), pages 291-302, Palo Alto, California, March 2004. IEEE Computer Society.
    [81] Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'99), pages 215-228, Atlanta, Georgia, USA, May 1999.
    [82] Amitabh Srivastava, Andrew Edwards, and Hoi Vo. Vulcan: Binary transformation in a distributed environment. Technical Report MSR- TR-2001-50, Microsoft Research, April 2001.
    [83] Amitabh Srivastava and Alan Eustace. Atom-a system for building customized program analysis tools. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI'94), pages 196-205, Orlando, Florida, United States, June 1994. ACM Press.
    [84] Bjarne Steensgaard. Points-to analysis in almost linear time. In Proceedings of the Twenty Third Annual A CM SIGPLAN-SIGA CT Symposium on Principles of Programming Languages(POPL'96), St. Petersburg, FL, Jan. 1996.
    [85] R. A. Sugumar and S. G. Abraham. Efficient simulation of multiple cache configurations using binomial trees. Technical report, CSE Division, University of Michigan, 1991.
    [86] M. D. Hill T. M. Chilimbi and J. R. Larus. Cache-conscious structure layout. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI'99), Atlanta, Georgia, May 1999. ACM Press.
    [87] Monica S. Lam Todd C. Mowry and Anoop Gupta. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of Architectural Support for Programming Languages and Operating Systems (ASPLOS'92), pages 62-73, Boston, Massachusetts, United States, October 1992. ACM Press.
    [88] E. Vanderdeijl, O. Temam, E. Granston, and G. Kanbier. The cache visualization tool. IEEE Computer, 30(7):71, 1997.
    [89] S. VanderWiel and D. Lilja. Data prefetch mechanisms. ACM Computing Surveys, 2000.
    [90] X. Vera, J. LLosa, A. Gonzalez, and N. Bermudo. A fast and accurate approach to analyze cache memory behavior. In Proceedings of the 6-th International Euro-Par Conference, pages 194-198, 2000.
    [91] Xavier Vera and Jingling Xue. Let's study whole-program cache behaviour analytically. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture (HPCA'02), pages 175-186, Boston, MA, February 2002. IEEE Computer Society.
    [92] Viswanadha and S. Sankar. Java compiler compiler (javacc)-the java parser generator, https://javacc.dev.java.net.
    [93] P. H. Wang, H. Wang, J. D. Collins, E. Grochowski, R. M. Kling, and J. P. Shen. Memory latency-tolerance approaches for itanium processors: out-of-order execution vs. speculative precomputation. In Proceedings of the 8th International Symposium on High Performance Computer Architecture (HPCA'02), pages 187-196, Boston, Massachusettes, USA, February 2002. IEEE Computer Society.
    [94] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J. M.Anderson, S. W. K. Tjiang, S.-W. Liao, Tseng Chau-Wen, M. W. Hall, M. S. Lam, and J. L. Hennessy. Suif: an infrastructure for research on parallelizing and optimizing compilers. A CM SIGPLAN Notices, 29(12): 31-37, December 1994.
    [95] M. E. Wolf and M. S. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4): 452-471, October 1991.
    [96] Michael E. Wolf and Monica S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN 1991 conference on Programming Language Design and Implementation (PLDI'91), pages 30-44, Toronto, Ontario, Canada, June 1991.
    [97] Youfeng Wu. Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI'02), pages 210-221, June 2002.
    [98] Jingling Xue, Qingguang Huang, and Minyi Guo. Enabling loop fusion and tiling for cache performance by fixing fusion-preventing data dependences. In Proceedings of the 2005 International Conference on Parallel Processing (ICPP'05), pages 107-115, Oslo, Norway, June 2005. IEEE Computer Society.
    [99] Jingling Xue and Xavier Vera. Efficient and accurate analytical modeling of whole-program data cache behavior. IEEE Transactions on Computers, 53(5):547-566, May 2004.
    [100] Chen Yang, Yongjian Chen, Xiong Fu, Chu-Cheow Lim, and Roy Ju. A comparison of parallelization and performance optimizations for two ray-tracing applications. In Proceedings of the 2006 High Performance Computing (?) Simulation Conference (HPCC(?)S'06), May 2006.
    [101] Steven G. Dropsho Yutao Zhong and Chen Ding. Miss rate prediction across all program inputs. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques(PACT'03), New Orleans, Louisiana, September 2003.
    [102] Kevin Zatloukal, Adrian Corduneanu, Richard E. Ladner, Vinod Grover, and Simon Meacham. Improving cache performance by structure reordering. Technical report, Department of Computer Science and Engineering, University of Washington, Seattle, WA, November 1998.
    [103] Chengliang Zhang, Chen Ding, Mitsunori Ogihara, Yutao Zhong, and Youfeng Wu. A hierarchical model of data locality. In Proceedings of the the 33rd ACM SIGPLAN-SIGA CT Symposium on Principles of Programming Languages(POPL'06), pages 16-29, Charleston, South Carolina, USA, January 2006. ACM Press.
    [104] Y. Zhong, M. Orlovich, X. Shen, and C. Ding. Array regrouping and structure splitting using whole-program reference affinity. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI'04), pages 255-266, Washington, DC, USA., June 2004.
    [105] Yutao Zhong, Chen Ding, and Ken Kennedy. Reuse distance analysis for scientific programs. In Proceedings of 6th Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers (LCR'02), Washington D. C, March 2002.
    [106] Yutao Zhong, Xipeng Shen, and Chen Ding. A hierarchical model of reference affinity. In Proceedings of The 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC'03), volume 2958 of Lecture Notes in Computer Science, pages 48-63, College Station, Texas, October 2003. Springer-Verlag.
    [107] Huiyang Zhou and Thomas M. Conte. Performance modeling of memory latency hiding techniques. Technical report, ECE Department, N. C. State University, January 2003.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700