用户名: 密码: 验证码:
面向软件管理片上存储器的编译优化技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
由于处理器性能和存储器性能的巨大差异,导致了“存储墙”问题的出现,使得存储系统成为系统的瓶颈。传统计算机体系结构均采用硬件管理的cache来解决存储墙问题。然而,随着应用和工艺的发展,Cache逐渐暴露出一些问题。相比之下,软件管理的片上存储器以其面积、功耗和访问时间等方面的优势,被认为是解决存储墙问题的一个有效途径。目前,软件管理的片上存储器已被普遍运用于嵌入式系统、流处理器和图形处理器中,并被逐渐运用到新型高性能计算机体系结构中。与硬件管理的cache不同,软件管理的片上存储器需要由软件通过数据传输语句显式地管理所有片上与片外存储器之间的数据传输,决定数据进入存储器的时机和位置。软件管理的片上存储器给编译提出了重要的挑战。如何在保证程序正确性的基础上,尽可能提高有限的片上存储器空间的利用率,尽量避免存储器碎片;充分捕获数据复用,优化存储层次间的通信,从而最小化存储器带宽需求;开发计算与访存并行,有效隐藏存储器访问延迟,是提高基于软件管理片上存储器的系统上程序性能的关键。本文重点研究了面向软件管理片上存储器的编译优化问题。本文的主要工作和创新概述如下:
     (1)提出了基于置换图着色的便笺存储器分配算法。现代嵌入式系统中,广泛地将片上存储器组织为软件管理的便笺存储器(Scratchpad Memory, SPM)。本文深入研究了面向嵌入式应用的SPM分配问题,首次发现了大部分嵌入式应用的相干图(Interference Graph)为置换图(Permutation Graph),从而能在线性时间内获得最优的SPM分配。本文首次提出了一个基于置换图着色的SPM分配算法。理论分析和实验表明,基于置换图着色的SPM分配算法与国际上最新的基于超完美图(Superperfect Graph)的SPM分配算法相比,流程更简洁,复杂度更低,性能更优。
     (2)提出了基于存储器着色的流寄存器文件分配框架。流体系结构是一种新兴的面向流应用的高性能计算机体系结构。流体系结构采用软件管理的片上存储器,称为流寄存器文件(Stream Register File, SRF),作为数据的核心存储部件。SRF是不可旁路的存储层次,软件必须保证计算需要的输入流提前加载到SRF中,并为输出流分配足够的SRF空间。优良的SRF分配方案还应能在避免引入额外的片外存储器传输的前提下,有效地捕获流应用中广泛存在的生产者消费者局部性,并尽可能地开发计算与访存并行。本文提出了一套基于存储器着色(即存储器划分加上图着色寄存器分配)技术的SRF分配框架。本文研究的新颖之处在于将开发重用和并行巧妙地整合到传统的图着色寄存器分配框架中。此外,针对应用的特点,本文对传统的图着色寄存器分配技术做出了一些改进,如提出了渐增的联合技术,寄存器排序技术。实验表明基于存储器着色的SRF分配框架能够在不引入溢出的前提下,有效地开发复用和并行。
     (3)提出了基于最佳有向路径寻找的流寄存器文件分配算法。基于存储器着色的SRF分配框架能够有效地开发复用和并行。但是,存储器着色技术在划分SRF以及对相干图进行着色时有一定的缺陷,容易引入SRF空间浪费。本文的另一个研究重点是在相干图确定的情况下(即操作流相干图开发复用和并行后),如何最小化需要的SRF空间,避免引入存储碎片。本文首次发现了大部分的流应用的相干图为可比图(Comparability Graph),或可以降解为多个可比子图,从而能够获得多项式时间的最优SRF分配。本文首次将SRF分配问题建模为最佳有向路径寻找问题,提出了一个新颖的SRF分配算法。严格的理论分析和大量的实验表明,我们的算法能获得最优或近似最优的SRF分配。相对目前普遍采用的基于First-Fit的启发式算法,我们的算法具有更好的性能。
     (4)提出了基于层次图着色的软件管理多级存储层次分配算法。现代的高性能计算机体系结构中,为了更有效地实现计算与访存的平衡,优化访存带宽和延迟,越来越多地采用软件管理的多级存储层次来替代硬件管理的多级cache存储层次。传统的编译优化研究大都面向单一存储层次,缺乏对存储层次全局的综合考虑,对存储层次间通信等的优化不足。而最小化存储层次间的通信,能大大减少存储器带宽需求,是影响性能的一个重要因素。本文扩展了图着色寄存器分配算法,首次将其运用到多级存储层次分配上。通过将存储层次建模为一个带权图,我们的方法可以运用到任何多级软件管理存储层次组织上。我们对传统数据相干图进行扩展,提出路径合并和路径消解技术,有效地减少存储层次间通信。通过数据生存期扩展技术,还能有效地进行计算与访存并行的开发,从而隐藏存储访问延迟。以上的优化都跟扩展后的图着色寄存器分配框架巧妙地整合在一起。实验表明,我们的算法有良好的性能。
Due to the big gap between processor performance and memory performance, it results in the memory wall problem, namely, the memory system has become the bottleneck. The traditional computer architecture adopts hardware-managed cache to address memory wall problem. However, along with the fast developing of application and technology, cache incurs some problems. In contrast to cache, software-managed on-chip memory has advantages in area, power and access speed, it is widely used in embedded-system, stream processor and graphics processing unit, and it is increasingly used in high performance architecture. It is regarded as an effective solution to memory wall problem. Different with hardware-managed cache, software-managed on-chip memory relies on software to explicitly manage all the data transfers between on-chip memory and off-chip memory, and decide when and where the data will enter into memory. Software-managed memory poses the compiler a significant challenge. How to guarantee the program correctness, utilize the limited on-chip memory space, avoid the memory fragments, fully capture the data reuse, optimize the inter-level communication, so as to minimize the memory bandwidth requirement, exploit the parallelism between computation and memory access, so as to hide the memory access latency, which are crucial for good performance of the programs running on software-managed memory system. This thesis emphasizes on compiler optimizations for software-managed on-chip memory allocation. In summary, this thesis makes the following contributions:
     (1) Proposes permutation graph coloring algorithm for scratchpad memory allocation. In embedded systems, the on-chip memory is often organized as scratchpad memory (SPM). This thesis researches the algorithms for SPM allocation, for the first time, gives the proof that most of the interference graphs of embedded applications are permutation graphs, enabling the optimal SPM allocation to be obtained in linear time. The theoretical analysis and experiments show that our algorithm improves the SPM utilization than the superperfect graph based SPM allocation algorithm, the best in the literature.
     (2) Proposes memory coloring based stream register file allocation framework. Stream architecture is high performance architecture for stream applications. It adopts software-managed on-chip memory, named stream register file (SRF), as the nexus. SRF is non-bypassing, thus all streams accessed by the computation must be made available in the SRF before the computation is started. A perfect SRF allocation scheme should avoid introducing extra off-chip memory transfers, efficiently capture the widely existed producer-consumer locality, and exploit the parallelism between computation and memory access. The thesis presents a memory coloring (memory partition plus graph coloring register allocation) based SRF allocation framework. The novelty in our research lies in the integration of reuse exploitation and parallelism exploitation into the classic graph coloring register allocation framework. In addition, according to the application characteristics, this thesis presents some improvements, such as incremental coalescing, register ordering, to the classic framework. The experiments show that the memory coloring based framework could efficiently exploit reuse and parallelism, without introducing spills.
     (3) Proposes an optimal directed path searching based stream register file allocation algorithm. Memory coloring based SRF allocation framework could efficiently exploit reuse and parallelism. However, the memory coloring technique easily introduces the SRF space waste. Another research emphasis of this thesis is when the interference graph is fixed (after the exploitation of reuse and parallelism through operating on the interference graph), how to minimize the needed SRF space, avoid introducing memory fragments. For the first time, we prove that the interference graphs of most of stream applications are comparability graphs, or can be decomposed into multiple comparability subgraphs, thus the optimal SRF allocation could be achieved in polynomial time. For the first time, we formalize the SRF allocation into optimal directed path searching, give a novel SRF allocation algorithm. The theoretical analysis and experiments show that our algorithm could find the optimal or near optimal SRF allocation, improve the SRF utilization than the widely used First-Fit algorithm, the best in the literature.
     (4) Proposes a hierarchical graph coloring algorithm for allocation of software- managed multi-level memory hierarchy. In modern high performance architecture, to balance the computation and memory access, optimize the memory bandwidth and latency, software-managed memory hierarchy is increasingly used instead of hardware-managed cache hierarchy. The traditional researches focus on single level memory, with insufficient considerations for global optimizations such as inter-level communication, which tend to be crucial for good performance. This thesis extends the graph coloring algorithm, for the first time, applies it to multi-level memory allocation. By modeling the memory hierarchy as a weighted graph, our method is applicable to any software-managed memory hierarchy organization. We also extend the concept of data interference graph, propose the path merging and path reduction techniques to reduce the inter-level communication. By using data live range extension technique, the parallelism is exploited to hide the memory access latency. All the optimizations are integrated into the graph coloring framework. The experiments show that our algorithm achieves a good performance.
引文
[1] John L. Hennessy and David A. Patterson. Computer Architecture: A Quanti-tative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,2007.
    [2] D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle,A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak, M.Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and K.Yazawa. The design and implementation of a first-generation cell processor.In Solid-State Circuits Conference, 2005. Digest of Technical Papers. ISSCC.2005 IEEE International, pages 184–592 Vol. 1, 2005.
    [3] J. Makino, K. Hiraki, and M. Inaba. Grape-dr: 2-p?ops massively-parallelcomputer with 512-core, 512-g?ops processor chips for scientific computing. InSC’07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing,page 1–11. ACM, 2007.
    [4] J.d. Cuvillo, W. Zhu, H.u. Ziang, and G.R. Gao. Fast: A functionally accu-rate simulation toolset for the cyclops64 cellular architecture. In MoBS2005:Workshop on Modeling, Benchmarking, and Simulation, pages 11–20. ACMPress, 2005.
    [5] Keith Boland and Apostolos Dollas. Predicting and precluding problems withmemory latency. IEEE Micro, 14(4):59–67, 1994.
    [6] Philip Machanick. Approaches to Addressing the Memory Wall. Technicalreport, School of IT and Electrical Engineering, University of Queensland,2002.
    [7] Wm. A. Wulf and Sally A. McKee. Hitting the memory wall: implications ofthe obvious. SIGARCH Comput. Archit. News, 23(1):20–24, 1995.
    [8] Peter J. Denning. The locality principle. Communications of the ACM,48(7):19–24, 2005.
    [9] Steven Przybylski. Cache and Memory Hierarchy Design: A PerformanceDirected Approach. 1990.
    [10] Wei fen Lin, Steven K. Reinhardt, and Doug Burger. Reducing dram latencieswith an integrated memory hierarchy design. In HPCA’01: In Proceedings ofthe 7th International Symposium on High Performance Computer Architecture,2001.
    [11] Ruud van der Pas. Memory herarchy in cache-based systems. Technical report,2002.
    [12] A.J.Smith. Cache memories. ACM Computing Surveys, 14(3):473–530, 1982.
    [13] Goodman. Using cache memory to reduce processor/memory tra?c. In Pro-ceedings of the Tenth International Symposium on Computer Architecture,pages 124–131, 1983.
    [14] Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, M. Balakrishnan, and PeterMarwedel. Scratchpad memory: design alternative for cache on-chip memoryin embedded systems. In CODES’02: Proceedings of the tenth internationalsymposium on Hardware/software codesign, pages 73–78, New York, NY, USA,2002. ACM.
    [15] M. B. Kamble and K. Ghose. Analytical energy dissipation models for lowpower caches. In Proceedings of International Symposium on Low Power Elec-tronics and Design, pages 143–148, 1997.
    [16] Peter Marwedel, Lars Wehmeyer, Manish Verma, Stefan Steinke, and UrsHelmig. Fast, predictable and low energy memory references througharchitecture-aware compilation. In Proceedings of the 2004 Asia and SouthPacific Design Automation Conference, pages 4–11, 2004.
    [17] P. Marwedel and L. Wehmeyer. In?uence of memory hierarchies on predictabil-ity for time constrained embedded software. In Proceedings of the conferenceon Design, Automation and Test in Europe, pages 600–605, 2005.
    [18] J.D. Gee and A.J. Smith. The performance impact of vector processor cashes.In Proceedings of the Twenty-Fifth Hawaii International Conference on SystemSciences.
    [19] Scott Rixner, William J. Dally, Ujval J. Kapasi, Brucek Khailany, AbelardoLópez-Lagunas, Peter R. Mattson, and John D. Owens. A bandwidth-e?cientarchitecture for media processing. In Proceedings of the 31st Annual Interna-tional Symposium on Microarchitecture, pages 3–13, 1998.
    [20] Peter Raymond Mattson. A programming system for the imagine media pro-cessor. PhD thesis, Stanford University, Stanford, CA, USA, 2002. Adviser-William J. Dally.
    [21] William J. Dally, Ujval J. Kapasi, Brucek Khailany, Jung Ho Ahn, and Ab-hishek Das. Stream processors: Programmability and e?ciency. Queue,2(1):52–62, 2004.
    [22] Swagato Basumallick and Kelvin Nilsen. Cache issues in real-time systems,1994.
    [23] Richard M. Russell. Communications of the ACM, 21(1):63–72, 1978.
    [24] Srinivas K. Raman, Vladimir Pentkovski, and Jagannath Keshava. Imple-menting streaming simd extensions on the pentium iii processor. IEEE Micro,20(4):47–57, 2000.
    [25] Intel. Intel Pentium Processor with MMX Technology Documentation,http://www. intel. com/design/archives/processors/mmx.
    [26] Power ISA v.2.03, http://www.power.org/resources/downloads/powerisa 203.public.pdf.
    [27] John D. Owens, Ujval J. Kapasi, Peter Mattson, Brian Towles, Ben Serebrin,Scott Rixner, and William J. Dally. Media processing applications on theimagine stream processor. In Proceedings of the IEEE International Conferenceon Computer Design, pages 295–302, September 2002.
    [28] Xuejun Yang, Xiaobo Yan, Zuocheng Xing, Yu Deng, Jiang Jiang, and YingZhang. A 64-bit stream processor architecture for scientific applications. InISCA’07: Proceedings of the 34th annual international symposium on Com-puter architecture, pages 210–219. ACM, 2007.
    [29] Koch K. the new roadrunner supercomputer: What, when, and how. In Pro-ceedings of International Conference on High Performance Computing, 2006.
    [30] William J. Dally, Francois Labonte, Abhishek Das, Patrick Hanrahan, andJung-Ho Ahn et al. Merrimac: Supercomputing with streams. In SC’03:Proceedings of the 2003 ACM/IEEE conference on Supercomputing, page 35.IEEE Computer Society, 2003.
    [31] Mattan Erez, Jung Ho Ahn, Ankit Garg, William J. Dally, and Eric Darve.Analysis and performance results of a molecular modeling application on mer-rimac. In SC’04: Proceedings of the 2004 ACM/IEEE conference on Super-computing, page 42, Washington, DC, USA, 2004. IEEE Computer Society.
    [32] NVIDIA. CUDA, http://developer.nvidia.com/object/cuda.html.
    [33] Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi,Sain-Zee Ueng, John A. Stratton, and Wen-mei W. Hwu. Program optimiza-tion space pruning for a multithreaded gpu. In CGO’08: Proceedings of thesixth annual IEEE/ACM international symposium on Code Generation andoptimization, pages 195–204, New York, NY, USA, 2008. ACM.
    [34] Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J.Ramanujam, Atanas Rountev, and P. Sadayappan. A compiler frameworkfor optimization of a?ne loop nests for gpgpus. In ICS’08: Proceedings of the22nd annual international conference on Supercomputing, pages 225–234, NewYork, NY, USA, 2008. ACM.
    [35] Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorikhi, Sam S. Stone,David B. Kirk, and Wen-mei W. Hwu. Optimization principles and applicationperformance evaluation of a multithreaded gpu using cuda. In PPoPP’08:Proceedings of the 13th ACM SIGPLAN Symposium on principles and practiceof parallel programming, pages 73–82, New York, NY, USA, 2008. ACM.
    [36] Motorola/Freescale. MPC500 32-bit MCU Family, http://www.freescale.com/files/microcontrollers/doc/factsheet/mpc500fact.pdf.
    [37] Philips. LPC2290 16/32-bit Embedded CPU, http://www.semiconductors.philips.com/acrobatdownload/datasheets/lpc2290-01.pdf, 2004.
    [38] Intel. Intel XScale(R) PXA27x Processor Family, http://www.intel.com/des-ign/pca/probref/253820.htm.
    [39] Marvell. Marvell PXA320 Processor Series Brief, http://www.marvell.com/products/cellular/application/pxa320 pb r4.pdf.
    [40] Motorola/Freescale. Dragonball MC68SZ328 32-bit Embedded CPU, http://www. freescale.com/files/32bit/doc/fact sheet/mc68sz328fs.pdf.
    [41] Motorola. Motorola ColdFire MCF5 processor family, http://e-www.motorola.com.
    [42] ARM. ARM10E Family, http://www.arm.com/products/cpus/families/arm10e-family.html.
    [43] Motorola. CPU12 Reference Manual, http://e-www.motorola.com/brdata/pdfdb/microcontrollers/16bit/68hc12 family/ref mat/cpu12rm.pdf.
    [44] Motorola. M-CORE-MMC2001 Reference Manual, http://www.motorola.com/sps/mcore/info documentation.htm.
    [45] Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. E?cient uti-lization of scratch-pad memory in embedded processor applications. In EDTC’97: Proceedings of the 1997 European conference on Design and Test, page 7,Washington, DC, USA, 1997. IEEE Computer Society.
    [46] Jan Sjodin, Bo Froderberg, and Lindgren Thomas. Allocation of global dataobjects in on-chip ram. Compiler and Architecture Support for Embedded Com-puting Systems, 1998.
    [47] Jan Sjodin and Carl von Platen. Storage allocation for embedded processors.In CASES’01: Proceedings of the 2001 international conference on Compilers,architecture, and synthesis for embedded systems, pages 15–23, New York, NY,USA, 2001. ACM.
    [48] Oren Avissar, Rajeev Barua, and Dave Stewart. Heterogeneous memory man-agement for embedded systems. In CASES’01: Proceedings of the 2001 in-ternational conference on Compilers, architecture, and synthesis for embeddedsystems, pages 34–43, New York, NY, USA, 2001. ACM.
    [49] Oren Avissar, Rajeev Barua, and Dave Stewart. An optimal memory alloca-tion scheme for scratch-pad-based embedded systems. ACM Trans. Embed.Comput. Syst., 1(1):6–26, 2002.
    [50] Jason D. Hiser and Jack W. Davidson. Embarc: an e?cient memory bankassignment algorithm for retargetable compilers. In LCTES’04: Proceedingsof the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers,and tools for embedded systems, pages 182–191, New York, NY, USA, 2004.ACM.
    [51] Lian Li, Lin Gao, and Jingling Xue. Memory coloring: A compiler approachfor scratchpad memory management. In PACT’05: Proceedings of the 14th In-ternational Conference on Parallel Architectures and Compilation Techniques,pages 329–338, 2005.
    [52] Erik G. Hallnor and Steven K. Reinhardt. A fully associative software-managed cache design. SIGARCH Comput. Archit. News, 28(2):107–116, 2000.
    [53] Csaba Andras Moritz, Matthew Frank, and Saman P. Amarasinghe. Flex-cache: A framework for ?exible compiler generated data caching. In IMS’00: Revised Papers from the Second International Workshop on IntelligentMemory Systems, pages 135–146, London, UK, 2001. Springer-Verlag.
    [54] M. Kandemir, J. Ramanujam, J. Irwin, N. Vijaykrishnan, I. Kadayif, and A.Parikh. Dynamic management of scratch-pad memory space. In DAC’01:Proceedings of the 38th conference on Design automation, pages 690–695, NewYork, NY, USA, 2001. ACM.
    [55] Sumesh Udayakumaran and Rajeev Barua. Compiler-decided dynamic mem-ory allocation for scratch-pad based embedded systems. In CASES’03: Pro-ceedings of the 2003 international conference on Compilers, architecture andsynthesis for embedded systems, pages 276–286, New York, NY, USA, 2003.ACM.
    [56] Lian Li, Quan Hoang Nguyen, and Jingling Xue. Scratchpad allocation fordata aggregates in superperfect graphs. In Proceedings of the 2007 ACM SIG-PLAN/SIGBED conference on Languages, compilers, and tools for embeddedsystems, pages 207–216. ACM, 2007.
    [57] Michael D. Smith, Norman Ramsey, and Glenn Holloway. A generalized al-gorithm for graph-coloring register allocation. In PLDI’04: Proceedings ofthe ACM SIGPLAN 2004 conference on Programming language design andimplementation, pages 277–288. ACM, 2004.
    [58] G. J. Chaitin. Register allocation & spilling via graph coloring. In SIGPLAN’82: Proceedings of the 1982 SIGPLAN symposium on Compiler construction,pages 98–101. ACM Press, 1982.
    [59] Preston Briggs, Keith D. Cooper, and Linda Torczon. Improvements to graphcoloring register allocation. ACM Trans. Program. Lang. Syst., 16(3):428–455,1994.
    [60] Lal George and Andrew W. Appel. Iterated register coalescing. ACM Trans.Program. Lang. Syst., 18(3):300–324, 1996.
    [61] Fred C. Chow and John L. Hennessy. The priority-based coloring approach toregister allocation. ACM Trans. Program. Lang. Syst., 12(4):501–536, 1990.
    [62] Donald E. Knuth. The art of computer programming, volume 3: (2nd ed.) sort-ing and searching. Addison Wesley Longman Publishing Co., Inc., RedwoodCity, CA, USA, 1998.
    [63] Abhishek Das, William J. Dally, and Peter Mattson. Compiling for streamprocessing. In PACT’06: Proceedings of the 15th international conferenceon Parallel architectures and compilation techniques, pages 33–42, New York,NY, USA, 2006. ACM.
    [64]邓宇.基于图着色的存储层次优化技术研究.博士学位论文,长沙:国防科学技术大学, 2007.
    [65] Randy Allen and Ken Kennedy. Vector register allocation. IEEE Transactionson Computers, 41(10):1290–1317, 1992.
    [66] Marc Gonz`alez, Nikola Vujic, Xavier Martorell, Eduard Ayguad′e, AlexandreE. Eichenberger, Tong Chen, Zehra Sura, Tao Zhang, Kevin O’Brien, andKathryn O’Brien. Hybrid access-specific software cache techniques for the cellbe architecture. In PACT’08: Proceedings of the 17th international confer-ence on Parallel architectures and compilation techniques, pages 292–302, NewYork, NY, USA, 2008. ACM.
    [67] Tong Chen, Haibo Lin, and Tao Zhang. Orchestrating data transfer for thecell/b.e. processor. In ICS’08: Proceedings of the 22nd annual internationalconference on Supercomputing, pages 289–298, New York, NY, USA, 2008.ACM.
    [68] Tong Chen, Tao Zhang, Zehra Sura, and Mar Gonzales Tallada. Prefetchingirregular references for software cache on cell. In CGO’08: Proceedings of thesixth annual IEEE/ACM international symposium on Code generation andoptimization, pages 155–164, New York, NY, USA, 2008. ACM.
    [69] Tao Liu, Haibo Lin, Tong Chen, John Kevin O’Brien, and Ling Shao. Dbdb:optimizing dmatransfer for the cell be architecture. In ICS’09: Proceedings ofthe 23rd international conference on Supercomputing, pages 36–45, New York,NY, USA, 2009. ACM.
    [70] C. Benthin, I. Wald, M. Scherbaum, and H. Friedrich. Ray tracing on the cellprocessor. In IEEE Symposium on Interactive Ray Tracing 2006, page 15–23,2006.
    [71] S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Implicit andexplicit optimizations for stencil computations. In MSPC’06: Proceedingsof the 2006 Workshop on Memory System Performance and Correctness, page51–60, New York, NY, USA, 2006. ACM.
    [72] Mark Silberstein, Assaf Schuster, Dan Geiger, Anjul Patney, and John D.Owens. E?cient computation of sum-products on gpus through software-managed cache. In ICS’08: Proceedings of the 22nd annual internationalconference on Supercomputing, pages 309–318, New York, NY, USA, 2008.ACM.
    [73] Ozcan Ozturk, Mahmut Kandemir, Mary Jane Irwin, and Suleyman Tosun.Multi-level on-chip memory hierarchy design for embedded chip multiproces-sors. In ICPADS’06: Proceedings of the 12th International Conference onParallel and Distributed Systems, pages 383–390, Washington, DC, USA, 2006.IEEE Computer Society.
    [74] Hiroshige Hayashizaki, Yutaka Sugawara, Mary Inaba, and Kei Hiraki.Mcamp: communication optimization on massively parallel machines with hi-erarchical scratch-pad memory. In PACT’08: Proceedings of the 17th interna-tional conference on Parallel architectures and compilation techniques, pages102–111, New York, NY, USA, 2008. ACM.
    [75] Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem,Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken,William J. Dally, and Pat Hanrahan. Sequoia: programming the memoryhierarchy. In SC’06: Proceedings of the 2006 ACM/IEEE conference on Su-percomputing, page 83, New York, NY, USA, 2006. ACM.
    [76] Timothy J. Knight, Ji Young Park, and Manman Ren et al. Compilationfor explicitly managed memory hierarchies. In PPoPP’07: Proceedings ofthe 12th ACM SIGPLAN symposium on Principles and practice of parallelprogramming, pages 226–236, New York, NY, USA, 2007. ACM.
    [77] Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fa-tahalian, Alex Aiken, William Dally, and Pat Hanrahan. A portable runtimeinterface for multi-level memory hierarchies. In PPoPP’08: Proceedings ofthe 13th ACM SIGPLAN Symposium on Principles and practice of parallelprogramming, pages 143–152, New York, NY, USA, 2008. ACM.
    [78] Manman Ren, Ji Young Park, Mike Houston, Alex Aiken, and William J.Dally. A tuning framework for software-managed memory hierarchies. InPACT’08: Proceedings of the 17th international conference on Parallel ar-chitectures and compilation techniques, pages 280–291, New York, NY, USA,2008. ACM.
    [79] Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J.Ramanujam, Atanas Rountev, and P. Sadayappan. Automatic data movementand computation mapping for multi-level parallel architectures with explicitlymanaged memories. In PPoPP’08: Proceedings of the 13th ACM SIGPLANSymposium on Principles and practice of parallel programming, pages 1–10,New York, NY, USA, 2008. ACM.
    [80] Martin Charles Golumbic. Algorithmic Graph Theory and Perfect Graphs(Annals of Discrete Mathematics, Vol 57). North-Holland Publishing Co.,Amsterdam, The Netherlands, The Netherlands, 2004.
    [81] Michael R. Garey and David S. Johnson. Computers and Intractability: AGuide to the Theory of NP-Completeness. W. H. Freeman & Co., New York,NY, USA, 1979.
    [82] Janet Fabri. Automatic storage optimization. SIGPLAN Not., 14(8):83–91,1979.
    [83] H. A. Kierstead. The linearity of first-fit coloring of interval graphs. SIAM J.Discret. Math., 1(4):526–530, 1988.
    [84] H. A. Kierstead. A polynomial time approximation algorithm for dynamicstorage allocation. Discrete Math., 87(2-3):231–237, 1991.
    [85] Jordan Gergov. Approximation algorithms for dynamic storage allocations.In ESA’96: Proceedings of the Fourth Annual European Symposium on Algo-rithms, pages 52–61, London, UK, 1996. Springer-Verlag.
    [86] Jordan Gergov. Algorithms for compile-time memory optimization. In SODA’99: Proceedings of the tenth annual ACM-SIAM symposium on Discrete al-gorithms, pages 907–908, Philadelphia, PA, USA, 1999. Society for Industrialand Applied Mathematics.
    [87] Adam L. Buchsbaum, Howard Karlo?, Claire Kenyon, Nick Reingold, andMikkel Thorup. Opt versus load in dynamic storage allocation. In STOC’03:Proceedings of the thirty-fifth annual ACM symposium on Theory of computing,pages 556–564, New York, NY, USA, 2003. ACM.
    [88] Giuseppe Confessore, Paolo Dell’Olmo, and Stefano Giordani. An approxi-mation result for the interval coloring problem on claw-free chordal graphs.Discrete Appl. Math., 120(1-3):73–90, 2002.
    [89] Sriram V. Pemmaraju, Sriram Penumatcha, and Rajiv Raman. Approximatinginterval coloring and max-coloring in chordal graphs. J. Exp. Algorithmics,10:2.8, 2005.
    [90] Ross M. McConnell and Jeremy P. Spinrad. Modular decomposition and tran-sitive orientation. Discrete Math., 201(1-3):189–241, 1999.
    [91] Ross M. McConnell and Jeremy P. Spinrad. Linear-time transitive orientation.In SODA’97: Proceedings of the eighth annual ACM-SIAM symposium onDiscrete algorithms, pages 19–25, Philadelphia, PA, USA, 1997.
    [92] Amir Pnueli, Abraham Lempel, and Shimon Even. Transitive orientation ofgraphs and identification of permutation graphs. Canad. J. Math, 23:160–175,1971.
    [93] Scott Rixner. Stream processor architecture. Kluwer Academic Publishers,Norwell, MA, USA, 2002.
    [94] Michael Bedford Taylor and Jason Kim et al. The raw microprocessor: Acomputational fabric for software circuits and general-purpose programs. IEEEMicro, 22(2):25–35, 2002.
    [95] Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands,and Katherine Yelick. The potential of the cell processor for scientific com-puting. In CF’06: Proceedings of the 3rd conference on Computing frontiers,pages 9–20, New York, NY, USA, 2006. ACM.
    [96] Francois Labonte, Peter Mattson, William Thies, Ian Buck, ChristosKozyrakis, and Mark Horowitz. The stream virtual machine. In PACT’04:Proceedings of the 13th International Conference on Parallel Architectures andCompilation Techniques, pages 267–277, 2004.
    [97]杜静.流体系结构的编译技术研究.博士学位论文,长沙:国防科学技术大学,2008.
    [98] Jayanth Gummaraju and Mendel Rosenblum. Stream programming ongeneral-purpose processors. In MICRO 38: Proceedings of the 38th annualIEEE/ACM International Symposium on Microarchitecture, pages 343–354,Washington, DC, USA, 2005. IEEE Computer Society.
    [99] Jayanth Gummaraju, Joel Coburn, Yoshio Turner, and Mendel Rosen-blum. Streamware: programming general-purpose multicore processors usingstreams. In ASPLOS XIII: Proceedings of the 13th international conference onArchitectural support for programming languages and operating systems, pages297–307, New York, NY, USA, 2008. ACM.
    [100] Manjunath Kudlur and Scott Mahlke. Orchestrating the execution of streamprograms on multicore platforms. In PLDI’08: Proceedings of the 2008 ACMSIGPLAN conference on Programming language design and implementation,pages 114–124, New York, NY, USA, 2008. ACM.
    [101] Abhishek Udupa, R. Govindarajan, and Matthew J. Thazhuthaveetil. Syn-ergistic execution of stream programs on multicores with accelerators. InLCTES’09: Proceedings of the 2009 ACM SIGPLAN/SIGBED conference onLanguages, compilers, and tools for embedded systems, pages 99–108, NewYork, NY, USA, 2009. ACM.
    [102] Steven S. Muchnick. Advanced Compiler Design and Implementation. Aca-demic Press, 1997.
    [103] Michael R. Garey and David S. Johnson. Computers and Intractability; AGuide to the Theory of NP-Completeness. W. H. Freeman & Co., New York,NY, USA, 1990.
    [104] A. B. Kempe. On the geographical problem of the four colors. AmericanJournal of Mathemaics 2, pages 193–200, 1879.
    [105] Andrew W. Appel and Maia Ginsburg. Modern Compiler Implementation inC. Cambridge University Press, 1998.
    [106] D. Bernstein, M. Golumbic, y. Mansour, R. Pinter, D. Goldin, H. Krawczyk,and I. Nahshon. Spill code minimization techniques for optimizing compliers.In PLDI’89: Proceedings of the ACM SIGPLAN 1989 Conference on Pro-gramming language design and implementation, pages 258–263. ACM, 1989.
    [107] Preston Briggs, Keith D. Cooper, and Linda Torczon. Rematerialization. SIG-PLAN Not., 27(7):311–321, 1992.
    [108] Priyadarshan Kolte and Mary Jean Harrold. Load/store range analysis forglobal register allocation. In PLDI’93: Proceedings of the ACM SIGPLAN1993 conference on Programming language design and implementation, pages268–277, New York, NY, USA, 1993. ACM.
    [109] Peter Bergner, Peter Dahl, David Engebretsen, and Matthew O’Keefe. Spillcode minimization via interference region spilling. In PLDI’97: Proceedingsof the ACM SIGPLAN 1997 conference on Programming language design andimplementation, pages 287–295, New York, NY, USA, 1997. ACM.
    [110] Keith D. Cooper and L. Taylor Simpson. Live range splitting in a graphcoloring register allocator. In CC’98: Proceedings of the 7th InternationalConference on Compiler Construction, pages 174–187, London, UK, 1998.Springer-Verlag.
    [111] Preston Briggs, Keith D. Cooper, Ken Kennedy, and Linda Torczon. Coloringheuristics for register allocation. SIGPLAN Not., 39(4):283–294, 2004.
    [112] Takuya Nakaike, Tatsushi Inagaki, Hideaki Komatsu, and Toshio Nakatani.Profile-based global live-range splitting. In PLDI’06: Proceedings of the 2006ACM SIGPLAN conference on Programming language design and implemen-tation, pages 216–227, New York, NY, USA, 2006.
    [113] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F.Kenneth Zadeck. E?ciently computing static single assignment form and thecontrol dependence graph. ACM Trans. Program. Lang. Syst., 13(4):451–490,1991.
    [114] Jinpyo Park and Soo-Mook Moon. Optimistic register coalescing. ACM Trans.Program. Lang. Syst., 26(4):735–765, 2004.
    [115] Max Hailperin. Comparing conservative coalescing criteria. ACM Trans. Pro-gram. Lang. Syst., 27(3):571–582, 2005.
    [116] Florent Bouchez, Alain Darte, and Fabrice Rastello. On the complexity ofregister coalescing. In CGO’07: Proceedings of the International Symposiumon Code Generation and Optimization, pages 102–114, Washington, DC, USA,2007. IEEE Computer Society.
    [117] Florent Bouchez, Alain Darte, and Fabrice Rastello. Advanced conservativeand optimistic register coalescing. In CASES’08: Proceedings of the 2008 in-ternational conference on Compilers, architectures and synthesis for embeddedsystems, pages 147–156, New York, NY, USA, 2008. ACM.
    [118] Steven R. Vegdahl. Using node merging to enhance graph coloring. SIGPLANNot., 34(5):150–154, 1999.
    [119] David Callahan and Brian Koblenz. Register allocation via hierarchical graphcoloring. SIGPLAN Not., 26(6):192–203, 1991.
    [120] Brian R. Nickerson. Graph coloring register allocation for processors withmulti-register operands. SIGPLAN Not., 25(6):40–52, 1990.
    [121] Preston Briggs, Keith D. Cooper, and Linda Torczon. Coloring register pairs.ACM Lett. Program. Lang. Syst., 1(1):3–13, 1992.
    [122]张英.面向科学计算流处理器的编译存储优化技术研究.博士学位论文,长沙:国防科学技术大学, 2008.
    [123] Nuwan Jayasena, Mattan Erez, Jung Ho Ahn, and William J. Dally. Streamregister files with indexed access. In HPCA’04: Proceedings of the 10th In-ternational Symposium on High Performance Computer Architecture, page 60,Washington, DC, USA, 2004. IEEE Computer Society.
    [124] J. Xue. A Loop Tiling Theory for Optimizing Compilers. Kluwer AcademicPublishers, 2000.
    [125]许卓群,张乃孝,杨冬青,唐世渭.数据结构.北京:高等教育出版社, 1999.
    [126] Matthew B. Squire. Generating the acyclic orientations of a graph. J. Algo-rithms, 26(2):275–290, 1998.
    [127] Valmir C. Barbosa and Jayme L. Szwarcfiter. Generating all the acyclic ori-entations of an undirected graph. Inf. Process. Lett., 72(1-2):71–74, 1999.
    [128] R. P. Stanley. Acyclic orientations of graphs. Discrete Math, 5:171–178, 1973.
    [129] Nathan Linial. Hard enumeration problems in geometry and combinatorics.SIAM J. Algebraic Discrete Methods, 7(2):331–335, 1986.
    [130] Leslie G. Valiant. The complexity of computing the permanent. TheoreticalComputer Science, 8:189–201, 1979.
    [131] Douglas B. West. Introduction To Graph Theory. Prentice Hall, 1996.
    [132] S. Louis Hakimi, Edward F. Schmeichel, and Neal E. Young. Orienting graphsto optimize reachability. Inf. Process. Lett., 63(5):229–235, 1997.
    [133] Pinar Heggernes, Federico Mancini, and Charis Papadopoulos. Minimal com-parability completions of arbitrary graphs. Discrete Appl. Math., 156(5):705–718, 2008.
    [134] Dieter Kratsch, Ross M. McConnell, Kurt Mehlhorn, and Jeremy P. Spinrad.Certifying algorithms for recognizing interval graphs and permutation graphs.In SODA’03: Proceedings of the fourteenth annual ACM-SIAM symposiumon Discrete algorithms, pages 158–167, Philadelphia, PA, USA, 2003.
    [135] A.V.Aho, J.E.Hopcroft, and J.D.Ullman. The Design and Analysis of Com-puter Algorithms. Addison Wesley, 1974.
    [136]徐俊明.图论及其应用.第二版,合肥:中国科学技术大学出版社, 2004.
    [137] G. Chen, O. Ozturk, M. Kandemir, and M. Karakoy. Dynamic scratch-padmemory management for irregular array access patterns. In DATE’06: Pro-ceedings of the conference on Design, automation and test in Europe, pages931–936, 3001 Leuven, Belgium, Belgium, 2006. European Design and Au-tomation Association.
    [138] NVIDIA. Quadro NVS 290, http://www.nvidia.com/page/quadronvs.html.
    [139] Tong Sun and Qing Yang. A comparative analysis of cache designs for vectorprocessing. IEEE Transactions on Computers, 48(3):331–344, 1999.
    [140] Alexandre E. Eichenberger, Peng Wu, and Kevin O’Brien. Vectorization forsimd architectures with alignment constraints. SIGPLAN Not., 39(6):82–93,2004.
    [141] Jacob Leverich, Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian,Mark Horowitz, and Christos Kozyrakis. Comparing memory systems for chipmultiprocessors. In ISCA’07: Proceedings of the 34th annual internationalsymposium on Computer architecture, pages 358–368, New York, NY, USA,2007. ACM.
    [142] Sumesh Udayakumaran and Rajeev Barua. An integrated scratch-pad alloca-tor for a?ne and non-a?ne code. In DATE’06: Proceedings of the conferenceon Design, automation and test in Europe, pages 925–930, 2006.
    [143] Doosan Cho, Ilya Issenin, Nikil Dutt, Jonghee W. Yoon, and Yunheung Paek.Software controlled memory layout reorganization for irregular array accesspatterns. In CASES’07: Proceedings of the 2007 international conference onCompilers, architecture, and synthesis for embedded systems, pages 179–188,New York, NY, USA, 2007. ACM.
    [144] Angel Dominguez, Sumesh Udayakumaran, and Rajeev Barua. Heap dataallocation to scratch-pad memory in embedded systems. J. Embedded Comput.,1(4):521–540, 2005.
    [145] Angel Dominguez, Nghi Nguyen, and Rajeev K. Barua. Recursive functiondata allocation to scratch-pad memory. In CASES’07: Proceedings of the2007 international conference on Compilers, architecture, and synthesis forembedded systems, pages 65–74, New York, NY, USA, 2007. ACM.
    [146] Nghi Nguyen, Angel Dominguez, and Rajeev Barua. Memory allocation forembedded systems with a compile-time-unknown scratch-pad size. In CASES’05: Proceedings of the 2005 international conference on Compilers, archi-tectures and synthesis for embedded systems, pages 115–125, New York, NY,USA, 2005. ACM.
    [147] Bernhard Egger, Jaejin Lee, and Heonshik Shin. Scratchpad memory manage-ment for portable systems with a memory management unit. In EMSOFT’06:Proceedings of the 6th ACM & IEEE International conference on Embeddedsoftware, pages 321–330, New York, NY, USA, 2006. ACM.
    [148] Hyungmin Cho, Bernhard Egger, Jaejin Lee, and Heonshik Shin. Dynamicdata scratchpad memory management for a memory subsystem with an mmu.SIGPLAN Not., 42(7):195–206, 2007.
    [149] Bernhard Egger, Jaejin Lee, and Heonshik Shin. Scratchpad memory man-agement in a multitasking environment. In EMSOFT’08: Proceedings of the8th ACM international conference on Embedded software, pages 265–274, NewYork, NY, USA, 2008. ACM.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700