面向计算密集型嵌入式应用的VLIW编译优化技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

面向计算密集型嵌入式应用的VLIW编译优化技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on VLIW Compile Optimization Technique for Compute-intensive Embedded Application
作者：管茂林
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：嵌入式计算 ; 流处理器 ; 共享互连体系结构 ; 分布式寄存器文件 ; 分布式与层次化寄存器文件 ; 负载均衡 ; 变量分类调度
英文关键词：Embeded Computing ; Stream Processor ; Shared Interconnect
英文关键词：Architecture ; Distributed Register File ; Distributed and Hierarchical Register
英文关键词：File ; Load Banlance ; Variable Classification Scheduling
学位年度：2012
导师：张春元
学科代码：0812
学位授予单位：国防科学技术大学
论文提交日期：2012-09-01

摘要

随着科学技术的不断发展，应用对计算的需求也不断增加。与传统桌面应用领域不同，在科研、国防、商务、娱乐等众多领域，计算密集型应用正成为微处理器的主要负载，日益吸引人们的关注。而高端嵌入式计算因为其强大的应用需求背景，发展一直非常迅猛。目前嵌入式应用对计算性能、功耗的渴求早已超出了当前嵌入式处理器的能力。作为开发并行性的一个有效方法，VLIW技术仍然在目前的微处理器设计中起着非常重要的作用。VLIW技术可以降低硬件复杂度、提升芯片频率、降低功耗，另一方面也对编译器提出了严峻的挑战：处理器的性能更加依赖于编译器的性能。然而随着VLIW处理器体系结构的不断革新、应用领域的不断扩展，编译器如何才能充分利用体系结构上的性能、功耗等优势，尽可能的开发指令级并行性，针对不同的微处理器体系结构面临着不同的问题。在这样的背景下，作者选择了“面向计算密集型嵌入式应用的VLIW编译优化技术研究”作为论文课题。本文重点研究了计算密集型嵌入式应用在VLIW处理器上运行时面临的若干关键问题，包括数据级并行多线程在VLIW处理器上的软件集成、分布式寄存器文件的负载均衡调度、流处理器上部分互连结构的设计和指令调度以及面向能耗有效微处理器的编译优化技术。本文的工作和创新主要体现在以下几个方面：
     1.提出了一种轻量级数据级并行多线程在VLIW处理器上软件集成执行的方法。OpenCL规范下的多线程是一种数据并行的多线程，同时其单个线程负载较轻，在VLIW处理器上执行难以充分发挥处理器的性能优势。本文实现了OpenCL规范下的多线程在MASA流处理器运算簇上的软件集成并行执行。根据数据级并行线程程序结构相同的特点，编译器将不同线程之间相应的基本块的操作合并到一起，扩大了编译器可以调度的指令窗口，将数据级并行转化为指令级并行，充分发挥VLIW处理器的性能。实验结果表明，集成适当数目的多线程执行可以在有效提高程序的性能同时，将程序对处理器硬件资源的需求控制在一个可接受的范围内。
     2.提出了分布式寄存器文件结构中寄存器文件负载均衡的VLIW调度算法。分布式寄存器文件的负载不均衡使得寄存器不能被有效的利用，过高的寄存器文件需求峰值往往导致溢出访存，进而降低性能，抵消其优势。本文针对分布式寄存器文件结构提出了一种寄存器文件负载均衡的VLIW调度算法。通过分析程序的控制结构以及变量的生产者-消费者关系，本文提出了在指令调度时定量精确计算变量生命周期的方法。通过在每一个操作调度完毕后精确计算每个寄存器文件的负载，优先将变量分配到负载较轻的寄存器文件来平衡不同寄存器文件之间的压力。实验结果表明该方法能够有效减少程序对分布式寄存器文件的峰值需求，减少溢出访存。
     3.提出了流处理器中部分共享互连结构设计和面向部分共享互连结构的指令调度优化算法。流处理器中大量的功能单元和全交叉的互连结构使得其共享互连总线的规模非常庞大，增加了硬件资源开销、传输延迟和硬件布局综合的难度。在进行程序特征分析的基础上，本文通过流处理器中的I/O单元复用技术和部分共享互连结构设计来降低共享互连总线的规模，并通过编译器的优化调度尽可能的利用现有的互连资源，弱化部分互连结构对程序性能的影响。实验结果表明，编译器的优化调度有效避免了程序性能的大幅下降，互连资源的利用率则得到了大幅提升；部分互连结构设计可以有效的降低处理器的硬件开销和能量消耗。
     4.提出了面向分布式与层次化寄存器文件结构的变量分类调度算法，为能耗有效的微线程处理器设计实现了Thread级的编译器。本文提出了面向万亿次量级的嵌入式处理器研究，介绍了其最底层的微线程处理器结构以及Thread级程序设计模式，为微线程处理器设计实现了Thread级的编译器。为了降低功耗，微线程处理器中采用了分布式与层次化的寄存器文件结构。TORF的极小容量使得很多数据必须存放在ERF中，这使得指令调度变得更加困难。通过分析程序特征，本文提出了面向分布式与层次化的寄存器文件结构的变量分类调度算法，避免了程序员手工分配优化的难题。实验结果表明，相比于分布式寄存器文件结构，变量分类调度算法能够在程序性能略有降低的情况下，大幅降低访问寄存器的能量消耗以及整个处理器的能量消耗，使得分布式与层次化寄存器文件结构的功耗优势得以充分发挥。
With the continuous development of science and technology, the computing needsof applications are also increasing. Different from the traditional desktop applications,the compute-intensive application is becoming the main load of the microprocessor inmany fields, such as scientific research, defense, business, and entertainment, etc..Meanwhile it is increasingly attracting people's attention. Because of its strongbackground of application requirements, the development of high-end embeddedcomputing has always been very rapid. Now the demand of computing performance andpower brought from the embedded applications has already exceeded the capacity of thecurrent embedded processor. As an effective method of exploiting parallelism, VLIWtechnology still plays a very important role in the current microprocessor design. TheVLIW technology is propitious for the chip to reduce the hardware complexity, enhancethe frequency of the chip and reduce the power consumption. Meanwhile, it also poses asevere challenge to the compiler that the performance of the processor is moredependent on the performance of the compiler. However, with the continuousinnovation of the VLIW processor architecture and the unceasingly expand ofapplication domain, how can the compiler take full advantage of the architecture onperformance, power and other advantages, and exploit instruction level parallelism asfar as possible? For different microprocessor architectures, the compiler faces differentproblems. In this context, this dissertation focuses on the VLIW compile optimizationtechnique research for compute-intensive embedded application. This dissertationfocuses on several key issues when the compute-intensive embedded applications arerunning on the VLIW processor, including software integration of data level parallelismmulti-thread on VLIW processor, load balanced instruction scheduling for distributedregister file, partly connectivity shared interconnect architecture design and instructionscheduling for stream architecture, compiler optimization techniques forenergy-efficient microprocessor and so on. The dissertation has completed the followingmain contributions and innovations:
     1. We present a novel approach which integrates lightweight data-level parallelismthreads through compilation for VLIW processors. The multi-threading program of theOpenCL specification is data-level parallelism multithreading, while the load of a singlethread is lighter and it cannot give full play to the VLIW processor. The softwareintegration and parallel execution of the multi-threaded programs under the OpenCLspecification on the cluster of MASA stream processor is carried out in this dissertation.According to the characteristics that the data-level parallel threads have the samecontrol structure, the compiler merges the operations in corresponding basic blocks ofdifferent threads into one basic block to expand the instruction window that the compiler can schedule. It can transform data-level parallelism into instruction-levelparallelism and make the performance of the VLIW processor into full play. Theexperimental results show that the integration and execution of the appropriate numberof threads can effectively improve the performance of program, while the demands ofprocessor hardware resources are controlled in an acceptable range.
     2. We present the register file load balanced VLIW scheduling (RFBLS) fordistributed register files. The load imbalance of distributed register files makes theregister file cannot be effectively used. High peak register demand of the register filesoften lead to overflow, thereby reducing the performance and weakening its advantages.This dissertation presents the register file load balanced VLIW scheduling for processorwith distributed register file structure. Through analyzing the control structure of theprogram and the producer-consumer relationship of the variable, this dissertationpresents the method of exactly calculating the life time of the variables duringinstruction scheduling. Through the exact calculation of the load of the register filesafter each step of the instruction scheduling, the variables are assigned to the registerfile with lighter pressure firstly, thus balancing the pressure among different registerfiles. The experimental results show that this method can effectively reduce the peakdemand of the program on the distributed register file and reduce the overflow andmemory access.
     3. We design partly connectivity shared interconnect architecture for streamarchitecture and present instruction scheduling optimization algorithm for the partlyconnectivity shared interconnect architecture. In stream processor, a large number offunctional units and the full cross-interconnect structure makes the size of the sharedinterconnect bus very large, increasing the overhead of hardware resources,transmission delay and the difficulty of the hardware layout. Based on the analysis ofprogram characteristics, this dissertation reduces the size of the shared interconnect busthrough designing partly connectivity shared interconnect and the technology of I/O unitmultiplexing. At the same time, it weakens the influence the partly connectivity sharedinterconnect brought to the program performance though the compiler optimizedscheduling so as to use the existing interconnection resources as much as possible. Theexperimental results show that the compiler optimization scheduling is effective toavoid the sharp decline in the program performance; the utilization of the internetresource has been improved tremendously; the design of partly connectivity sharedinterconnect can reduce the hardware cost and energy consumption of processoreffectively.
     4. We present the variable classification scheduling algorithms for distributed andhierarchical register file(DHRF) structure and design the Thread level compiler for theenergy-efficient micro-thread processor. This dissertation proposes the embeddedtera-scale processor research, introduces the bottom level micro-thread processor architecture and the Thread-level programming model, and designs the Thread-levelcompiler for the micro-thread processor. In order to reduce the power consumption, themicro-thread processor employs the distributed and hierarchical register file structure.Because of the small capacity of TORF, many data need to be stored in ERF and thismakes the instruction scheduling for processor with DHRF much more difficult. Basedon the analysis of program characteristics, this dissertation presents the variableclassification scheduling algorithms for distributed and hierarchical register filestructure, avoiding the problem of the programmer's manually allocation andoptimization. The experimental results show that, compared to the distributed registerfile structure, in the condition of slightly reduction of the program's performance, thevariable classification scheduling algorithm significantly reduces the energyconsumption of register accessing and the entire energy consumption of the processor,which enables the power consumption advantages of the distributed and hierarchicalregister file structure can be fully tapped.

引文

[1] P Kogge, K Bergman, S Borkar, D Campbel, et al., ExaScale Computing Study:Technology Challenges in Achieving Exascale Systems[R]. DARPA IPTO,2008.
    [2] J. L. Hennessy&D. A. Patterson, Computer Architecture: A QuantitativeApproach[M],3rd Edition, Morgan Kaufmann Publisher, Copyright2003byElsevier Science Pte Ltd.
    [3] Davis M.E, Space Based Radar Moving Target Detection Challenges[C].2002RADAR Conference, Air Force Res. Lab., USA,2002:143-147.
    [4] S.Farsiu, D.Robinson, M.Elnd and P.Milanfar. Advances and challenges insuper-resolution[J]. International Journal of Imaging Systems and Technology,2004,14(2):47-57.
    [5] Robert Bond. High Performance DoD DSP Applications[EB/OL].2003Workshopon Streaming Systems,2003, http://catfish.csail.mit.edu/wss03.
    [6] Henry S. Kenyon. Unmanned Combat Aircraft Program Takes Off[J]. Signal,2004,58(11):49-52.
    [7] C. G. Masi. Machine Vision Comes of Age: engineers find machine vision canreplace more complex point-sensor-based systems[J]. Control Engineering,2008,55(5):24-34.
    [8] Abraham, D.A. Array Modeling of Active Sonar Clutter[J]. IEEE Journal ofOceanic Engineering,2008,33(2):158-170.
    [9] Wenming Cao, Hao Feng, Lili Hu, Tiancheng He. Space Target Recognition Basedon Biomimetic Pattern Recognition[C]. First International Workshop on DatabaseTechnology and Applications, Wuhan, China,2009:64-67.
    [10]Jinuk Luke Skin, et al. A40nm16-Core128-Thread CMT SPARC SoCProcessor[C].2010IEEE Asian Solid State Circuits Conference(A-SSCC2010),Beijing, China,2010:1-4.
    [11]D. Burger, J.R. Goodman. Billion-Transistor Architectures: There and BackAgain[J]. Computer,2004,37(3):22-28.
    [12]Yoonseo Choi, Yuan Lin, Nathan Chong, Scott Mahlke, Trevor Mudge. StreamCompilation for Real-time Embedded Multicore Systems[C]. In proceedings of2009International Symposium of Code Generation and Optomization(CGO2009),Seattle, Washington, USA,2009:210-220.
    [13]W.J. Dally, James Balfour, David Black-Shaffer et al. Efficient EmbeddedComputing[J]. IEEE Computer,2008,41(7):27-32.
    [14]M. Erez, J. Ahn, A. Garg, W. J. Dally, and E. Darve. Analysis and performanceresults of a molecular modeling application on Merrimac [C]. In Proceedings of the2004ACM/IEEE conference on Supercomputing (SC’04), Pittsburgh,Pennsylvaniva, USA,2004:42~54.
    [15]J. D. Owens, W. J. Dally, U. J. Kapasi, S. Rixner, P. Mattson, and B. Mowery.Polygon rendering on a stream architecture [C]. In Proceedings of the ACMSIGGRAPH/EUROGRAPHICS workshop on Graphics hardware (HWWS’00),Interlaken, Switzerland,2000:23~32.
    [16]Mei Wen, Chunyuan Zhang, Nan Wu, Haiyan Li, A Parallel Reed-solomonDecoder on the Imagine Stream Processor [C]. In Proceedings of the SecondInternational symposium on Parallel and Distributed Processing and Applications,Honking, China,2004:28~33.
    [17]Mei Wen, Nan Wu, Changqing Xun, Wei Wu, Chunyuan Zhang, Optimization andEvaluating of StreamYGX2on MASA Stream Processor [C], In Proceedings of the11th Asia-Pacific Computer Systems Architecture Conference, Shanghai, China,2006:531~537.
    [18]Haiyan Li, Chunyuan Zhang, Li Li, Ju Ren, Transform Coding on ProgrammableStream Processors [J], Journal of Supercomputing,2008,45(1):66~87.
    [19]Yu Deng, Xuejun Yang, Xiaobo Yan, Ying Zhang and Jing Du. Implementationand Evaluation of Specific Data-Intensive Scientific Applications on the FT64Stream Processor [C]. IEEE7th International Conference on Computer andInformation Technology (CIT2007), Aizu, Japan,2007:339-343.
    [20]M. Rixner. Stream Processor Architecture [M], Kluwer Academic Publishers.Boston, MA,2001.
    [21]C. Lee, M. Potkonjak, W. H. Smith. MediaBench: A Tool for Evaluating anSynthesizing Multimedia and Communications Systems [C]. Proceedings of the30th annual ACM/IEEE International Symposium on Microarchitecture, ResearchTriangle Park, NC,1997:330~335.
    [22]H. Liao, A. Wolfe. Available Parallelism in Video Applications [C]. Proceedings ofthe30th annual ACM/IEEE International Symposium on Microarchitecture,Research Triangle Park, NC,1997:321~329.
    [23]Peter Mattson, A Programming System for the Imagine Media Processor[D], Dept.of Electrical Engineering. Ph.D. Thesis, Stanford University,2001.
    [24]Brucek Khailany, William J. Dally, et al, Exploring the VLSI Scalability of StreamProcessors[C]. Proceedings of the9th International Symposium onHigh-Performance Computer Architecture(HPCA'03), Anaheim, CA, USA,2003:153-164.
    [25]Scott Rixner, William J. Dally, Ujval J. Kapasi, et al, A Bandwidth-efficientArchitecture for Media Processing[C]. Proceedings of the31st annual ACM/IEEEInternational Symposium on Microarchitecture, Dallas, TX, USA,1998:3-13.
    [26]William J.Dally, Mattan Erez et al. Merrimac: Supercomputing with Streams[C],Proceedings of the2003ACM/IEEE conference on Supercomputing (SC'03),,Phoenix, Arizona, USA,2003:35-42.
    [27]Karthikeyan Sankaralingam et al, Exploiting ILP, TLP, and DLP with thePolymorphous TRIPS Architecture[C]. Proceedings of30th Annual InternationalSymposium on Computer Architecture(ISCA2003), San Diego, USA,2003:422-433
    [28]Scott Rixner, William J. Dally, Brucek Khailany, Peter Mattson et al. Registerorganization for media processing[C]. In Proceedings of the Sixth InternationalSymposium on High Performance Computer Architecture(HPCA), Touluse,2000:375-387.
    [29]Peter Mattson, William J. Dally, Scott Rixner, Ujval J. Kapasi, John D. Owens,Communication scheduling[C], Proceedings of the9th International Conference onArchitectural Support for Programming Languages and Operating Systems(ASPLOS), Cambridge, MA, USA,2000:82-92.
    [30]The OpenCL Specification[Z]，Khronos OpenCL Working Group-A. Munshi, Ed,2009
    [31]NVIDIA CUDA website[EB/OL].http://www.NVIDIA.com/object/cuda_home.html
    [32]NVIDIA. CUDA计算统一设备架构编程指南[Z],2008.
    [33]文梅：流体系结构关键技术研究[D]，工学博士学位论文，国防科学技术大学，2006.
    [34]B.khailany, W.J.Dally et al, Imagine: media processing with streams [J]. IEEEMicro,2001,21(2):35-46.
    [35]Mattan Erez. Merrimac: High-Performance and High-Efficient ScientificComputing with Streams [D]. PhD thesis, Stanford University,2006.
    [36]D. Pham, S. Asano, M. Bolliger et al. The Design and Implementation of aFirst-Generation CELL Processor[C]. In Proceedings of IEEE InternationalSolid-State Circuits Conference, San Francisco, CA, USA,2005:184-192.
    [37]M. B. Taylor et al, Evaluation of the Raw Microprocessor: An Exposed-Wire--Delay Architecture for ILP and Streams [C]. In Proceedings of the31th AnnualInternational Symposium on Computer Architecture(ISCA), München, Germany,2004:2-13.
    [38]Karthikeyan Sankaralingam et al., Distributed Microarchitectural Protocols in theTRIPS Prototype Processor [C], In Proceedings of the39th Annual InternationalSymposium on Microarchitecture, Orlando, FL, USA,2006:480-489.
    [39]Sourav Chatterji, Manikandan Narayanan, Jason Duell, Leonid Oliker, PerformanceEvaluation of Two Emerging Media Processors: VIRAM and Imagine [C],International Parallel and Distributed Processing Symposium (IPDPS), Nice,France,2003:7-13.
    [40]X. Yang, X. Yan, Z. Xing, Y. Deng, J. Jiang, Y. Zhang, A64-bit Stream ProcessorArchitecture for Scientific Applications [C]. In Proceedings of the34th AnnualInternational Symposium on Computer Architecture(ISCA), San Diego, CA, USA,2007:210~219.
    [41]A. Brunton, J. Zhao. Real-time video watermarking on programmable graphicshardware [C]. Proc. Canadian Conference on Electrical and Computer Engineering,Saskatoon, Sask, Canada,2005:1312~1315.
    [42]O. Fialka, M. Cadik. FFT and convolution performance in image filtering on GPU[C]. Tenth International Conference on Information Visualization, London, England,2006:609~614.
    [43]M. Harris. Parallel prefix sum (scan) with CUDA[EB/OL],http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/scan/doc/scan.pdf,2007.
    [44]Mei Wen, Nan Wu, Haiyan Li, Chunyuan Zhang, Multiple-Morphs AdaptiveStream Architecture [J]. Journal of Computer Science and Technology,2005,(5):635~646.
    [45]Masaaki Oka. Designing and programming the emotion engine[J]. IEEE MICRO,1999,19(6):20-28.
    [46]王海霞，汪东升，多核/众核处理器的关键技术[J]，中国计算机学会通讯，2009,5(11):12-18.
    [47]Tilera Inc. Tile64Processor overview[EB/OL].http://www.Tilera.com/products/processors.php,2007.
    [48]Intel Inc. intel Ploaris80core processor[EB/OL].http://xtreview.com/addcomment-id-1624-view-intel-Polaris-80-core-processor.html,2007.
    [49]J. FernandezM., E. AcacioG., BernabeJ.L., AbellanJ. Franco. Multicore Platformsfor Scientific Computing: Cell BE and NVIDIA Tesla[C]. International Conferenceon Scientific Computing (CSC'08), Las Vegas, Nevada, USA,2008:65-76.
    [50]B. Khailany et al. A Programmable512GOPS Stream Processor for Signal, Image,and Video Processing [J]. IEEE Journal of Solid-State Circuits,2008,43(1):202-213.
    [51]Mark Woh, Sangwon Seo, Scott Mahlke, Trevor Mudge, Chaitali Chakrabarti andKrisztian Flautner. AnySP: Anytime Anywhere Anyway Signal Processing[C]. InProceedings of the36th Annual International Symposium on ComputerArchitecture(ISCA’09), Austin, Texas, USA,2009:20–24.
    [52]James Balfour, William J. Dally, David Black-Schaffer, Vishal Parikh, JongSooPark, An Energy-Efficient Processor Architecture for Embedded Systems [J], IEEEComputer Architecture Letters,2008,7(1):29-32.
    [53]David Black-Schaffer, James Balfour, William J. Dally, et al, HierarchicalInstruction Register Organization[J], IEEE Computer Architecture Letters,2008,7(2):41-44.
    [54]James Balfour, R. Curtis Harting, and William J. Dally. Operand registers andexplicit operand forwarding[J]. IEEE Computer Architecture Letters,2009,8(2):60-63.
    [55]Francky Catthoor, Praveen Raghavan, et al. Ultra-Low Energy Domain-SpecificInstruction-Set Processors[M]. Springer Science Publisher,2009.
    [56]Mark Woh, Sangwon Seo, Scott Mahlke, Trevor Mudge, Chaitali Chakrabarti andKrisztian Flautner. AnySP: Anytime Anywhere Anyway Signal Processing[J]. InProceedings of the36th annual International Symposium on Computer Architecture(ISCA), Austin, Texas, USA,2009:20–24.
    [57]Rehan Hameed, Wajahat Qadeer, Megan Wachs, et al. Understanding Sources ofInefficiency in General-Purpose Chips[C]. In Proceedings of the37th annualInternational Symposium on Computer Architecture (ISCA), Saint-Malo, France,2010:37-47.
    [58]P. Geoffrey Lowney, Stefan M. Freudenberger, Thomas J. Karzes, W. D.Lichtenstein, Robert P. Nix, John S. O'Donnell and John C. Ruttenberg, TheMultiflow trace scheduling compiler[J]. Journal of Supercomputing,1993,7(1-2):51-142.
    [59]Fisher, J. A., Trace scheduling: a technique for global microcode compaction[J].IEEE Transactions on Computers,1981, C-30(7):478-490.
    [60]Wen-Mei W. Hwu, Scott A. Mahlke, William Y. Chen, et al., The superblock: aneffective structure for VLIW and superscalar compilation[J], Journal ofSupercomputing,1993,7(1-2):229-248.
    [61]Lavery, D.M. and Hwu, W.-W, Unrolling-based optimizations for moduloscheduling[C]. Proceedings of the28th Annual International Symposium onMicroarchitecture, Ann Arbor, MI,1995:327-337.
    [62]Monica S. Lam, Software pipelining: an effective scheduling technique for VLIWmachines[C]. Proceedings of the Conference on Programming Language Designand Implementation(PLDI),1988:318-328.
    [63]Ramakishnan, S., Software-pipelining in PA-RISC compilers[J]. Hewlett-PackardJournal, June1992.
    [64]Ellis, J., Bulldog: A compiler for VLIW architectures[M]. MIT Press, United States,1985.
    [65]Capitanio, A., Dutt, N., and Nicolau, A., Partitioned register files for VLIWs: apreliminary analysis of tradeoffs[C]. Proceedings of the25th Annual InternationalSymposium on Microarchitecture,1992:292-300.
    [66]Desoli, G. Instruction assignment for clustered VLIW DSP compilers: A newapproach[R]. Technical Report HPL-98-13, Hewlett-Packard Laboratories,1998.
    [67]Muchnick, S., Advanced Compiler Design and Implementation[M], MorganKaufmann,1997.
    [68]Ozer, E., Banerjia, S., and Conte, T., Unified assign and schedule: A new approachto scheduling for clustered register files microarchitectures[C]. Proceedings of the31st Annual International Symposium on Microarchitecture, Dallas, TX, USA,1998:308-315.
    [69]Mei Wen, Nan Wu, Maolin Guan, Chunyuan Zhang, Load scheduling: ReducingPressure on Distributed Register files for free[C],13th Asia and South PacificDesign Automation Conference (ASP-DAC), Seoul, Korea,2008:340-345.
    [70]David J.Kolson. A method for register allocation to loops in multiple register filearchitecture[C].10th International Parallel Processing Symposium(IPPS’1996),Honolulu, HI, USA,1996:28-33.
    [71]Bart Mesman et al., Efficient scheduling of DSP code on processor with distributedregister files[C], In proceedings of the12th International Symposium on SystemSynthesis, San Jose, CA, USA,1999:100-106.
    [72]Josep M. Codina, Jesus Sanchez and Antonio Gonzalez, A Unified ModuloScheduling and Register Allocation Technique for Clustered Processors[C],Proceedings of2001International Conference on Parallel Architecture andCompilation Techniques(PACT), Barcelona, Spain,2001:175-184.
    [73]Javier Zalamea, et al. Modulo Scheduling with Integrated Register Spilling forClustered VLIW architectures[C]. Proceedings of the34th ACM/IEEE InternationalSymposium on Microarchitecture, Austin, Texas, USA, Dec.2001:160-169.
    [74]Nan Wu, Mei Wen, Ju Ren, Yi He and Chunyuan Zhang: Register allocation onstream processor with local register file[C], In Proceedings of the11th Asia-PacificComputer Systems Architecture Conference, Shanghai, China,2006, Lecture Notesin Computer Science,2006, Volume4186/2006:545-551.
    [75]Gerg Barany and Andreas Krall, Optimistic Integrated Instruction Scheduling andRegister Allocation[C], Proceedings of the2010ACM SIGPLAN/SIGBEDconference on Languages, Compilers, and Tools for Embedded Systems(LCTES),2010(work in progress proposal).
    [76]Eric J. Stotzer, Ernst L. Leiss, Modulo Scheduling without OverlappedLifetimes[C], Proceedings of the2009ACM SIGPLAN/SIGBED conference onLanguages, Compilers, and Tools for Embedded Systems(LCTES), Dublin, Ireland,2009:1-10.
    [77]Ivan D. Baev, Richard E. Hank, David H. Gross, Prematerialization: ReducingRegister Pressure for Free[C]. Proceedings of the15th international conference onParallel Architectures and Compilation Techniques(PACT), Seattle, Washington,USA,2006:285-294
    [78]Jongsoo Park and William J. Dally, Guaranteeing Forward Progress of UnifiedRegister Allocation and Instruction Scheduling[R], Tech. Rep. Stanford UniversityConcurrent VLSI Architecture Group Memo127,2011
    [79]WON SO, Software Thread Integration for Converting TLP to ILP on VLIW/EPICArchitectures[D], Master Thesis, North Carolina State University,2002
    [80]Alexander G. Dean, Software Thread Integration for Hardware to SoftwareMigration[D], Doctoral Thesis, Carnegie Mellon University, Pittsburgh, PA,2000.
    [81]D.P. Scarpazza, P. Raghavan, D. Novo, F. Catthoor, and D. Verkest. SoftwareSimultaneous Multi-Threading, a Technique to Exploit Task-Level Parallelism toImprove Instruction-and Data-Level Parallelism[J]. LECTURE NOTES INCOMPUTER SCIENCE,4148:12,2006.
    [82]Stephan Suijkerbuijk and Ben H.H. Juurlink, Implementing HardwareMultithreading in a VLIW Architecture[C], Proceedings of the17th IASTEDInternational Conference on Parallel and Distributed Computing and Systems,Phoenix, AZ, USA,2005:674-679.
    [83]Emre zer, Weld: A Multithreading Technique Towards Latency tolerant VLIWProcessors[C],8th International Conference on High Performance Computing(HiPC2001), Hyderabad, India, Lecture Notes in Computer Science, Volume2228/2001,2001:192203.
    [84]D. Barretta, W. Fornaciari, M. Sami, and D. Bagni. Multithreaded Extension toMulticluster VLIW Processors for Embedded Applications[C]. Proceedings ofDesign, Automation and Test in Europe(DATE), Munich, Germany,2005:748-749.
    [85]Manoj Gupta, et al, CSMT: Simultaneous Multithreading for Clustered VLIWProcessors[J], IEEE TRANSACTIONS ON COMPUTERS,2009,59(3):385-399.
    [86]Manoj Gupta, Hybrid Multithreading for VLIW Processors[C],2009InternationalConference on Compilers, Architecture, and Synthesis for Embedded Systems(CASES’09). Grenoble, France,2009:37-46.
    [87]Stanford University,The Imagine Project[EB/OL], http://cva.stanford.edu/imagine/.
    [88]Peter Mattson et al, Imagine Programming System Developer’s Guide[EB/OL].http://cva.stanford.edu,2002.
    [89]Maolin Guan, Nan Wu, Mei Wen, Chunyuan Zhang, Software Integration ofIdentical DLP Threads via Compilation for VLIW Processors[C],5th InternationalConference on Computer Sciences and Convergence Information Technology(ICCIT2010), Seoul, Korea,2010:427-433.
    [90]Preston Briggs. Register allocation via graph coloring[D]. PhD thesis, RiceUniversity, Houston, TX, USA,1992.
    [91]J C Park, M S Schlansker. On predicated execution[R]. HewLett PackardLaboratories. Tech Rep:HPL-91-58,1991.
    [92]杨乾明，伍楠，何义，荀长庆，张春元，流处理器MASA_I在FPGA上的实现[J]，计算机工程与科学，2008，30(3):114-118.
    [93]Xuejun Yang, et al, Fei Teng64Stream Processing System: Architecture, Compiler,and Programming[J], IEEE TRANSACTIONS ON PARALLEL ANDDISTRIBUTED SYSTEMS,2009.8,20(8):1142-1157.
    [94]T.I.Inc. TMS320C62x/67x CPU and Instruction Set Reference Guide[EB/OL].1998.
    [95]P.N.Glaskowsky. MAP1000unfolds at Equator[R]. Microprocessor Report.12(16)Dec.1998
    [96]何义，流体系结构指令管理及系统虚拟化仿真技术研究[D]，国防科学技术大学博士学位论文，2010
    [97]Edward A. Lee, David G. Messerschmitt. Synchronous Data Flow[J]. Inproceedings of the IEEE,1987,75(9):1235-1245.
    [98]Jobn R. Levine, Tony Mason, Doug Brown, lex与yacc[M]（第二版），杨作梅，张旭东等译，机械工业出版社，2003.
    [99]陈火旺，刘春林，谭庆平等，程序设计语言编译原理[M]（第三版），国防工业出版社，2000.
    [100] Iain E., Richardson G.. H.264and MPEG-4Video Compression-Video Codingfor Next-Generation Multimedia [R]. John Wiley&Sons Ltd,2003.
    [101] Boudewijn P.F., Lelieveldt. Information Processing in Medical Imaging [J].Medical Image Analysis,2008,12(6):729-730.
    [102] Dollarhide AW., Rutledge T., Weinger MB., Dresselhaus TR.. Use of ahandheld computer application for voluntary medication event reporting byinpatient nurses and physicians [J]. Journal of General Internal Medicine,2008,23(4):418-422.
    [103] K. Kuusilinna, et al.. Designing BEE: a Hardware Emulation Engine for SignalProcessing in Low-Power Wireless Applications [J]. EURASIP Journal on AppliedSignal Processing,2003:502-513.
    [104] UScott Rixner, William J. Dally, Ujval J. Kapasi, Peter R. Mattson and John D.Owens. Memory access scheduling[C]. In27th Annual International Symposium onComputer Architecture, June2000:128-138

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700