分片式流处理器体系结构

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

分片式流处理器体系结构

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Tiled Stream Processor Architecture
作者：徐光
论文级别：博士
学科专业名称：计算机系统结构
中文关键词：分片式 ; 流处理器体系结构 ; 类数据流计算模型 ; SBMD ; 二进制翻译
英文关键词：tiled ; stream processor architecture ; computation model of dataflow-like driven ; SBMD ; binary translator
学位年度：2010
导师：安虹 ; 胡伟武
学科代码：081201
学位授予单位：中国科学技术大学
论文提交日期：2010-05-01

摘要

纳米工艺所带来的功耗、线延迟和设计复杂度等问题制约了处理器体系结构的发展,分片式结构设计是解决这些问题的一种方法。分片式流处理器作为一种面向数据密集型应用的处理器结构,可以利用摩尔定律的发展带来的丰富而廉价的晶体管资源,实现处理器结构的可扩展。分片式结构设计的基本思想是将计算、存储和互连资源组织成片式的基本结构单元,这些片式单元是相对简单的、分布式控制且可重用的；大量的片式单元由高能效、可扩展的片上网络连接起来。分片式流处理器的性能是否也具有可扩展性依赖于其支持的编程模型、片上存储层次、片上互连网络以及计算模型。本文分别从分片式流处理器的计算模型、指令系统、体系结构、流编程模型映射四个方面开展研究。主要研究内容和成果包括以下四个部分。
     (1)研究了类数据流计算模型的原理,提出了一种适合组织分片式流处理器中计算资源的单块多数据(Single Block Multi Data, SBMD)计算模型,设计了支持此计算模型的指令系统DISC-D。SBMD指在一个超块内部处理多份数据,每份数据根据自己对应的数据流依赖关系执行。谓词技术把程序中每份数据所对应的控制流依赖转化为数据流依赖,在超块内部消除控制流转移,这样每份数据可以按照不同的控制流路径执行。SBMD模型支持程序中循环体间显式的消息传递。
     (2)设计了一种分片式流处理器体系结构TPA-PD。TPA-PD采用类数据流驱动的计算模型来组织计算资源,使用软件管理的片上存储层次开发应用中的数据局部性。TPA-PD采用分片式的设计思想,使用多个片上互连网络把片上的各种资源互连起来。
     (3)设计并实现了流编程模型在TPA-PD上的映射。TPA-PD支持流编程模型StreamC/KernelC。StreamC/KernelC是为Imagine流处理器开发的,为了能在TPD-PD上运行StreamC/KernelC语言编写的代码,我们实现了流级翻译器和kernel级二进制翻译器,把在Imagine平台上的流级指令信息和kernel级微码翻译到TPA-PD平台上,翻译后的代码膨胀率小于2。
     (4)实现了TPA-PD的软件模拟环境,并评估了类数据流驱动计算模型及TPA-PD体系结构设计的有效性。文章讨论了物理块资源、计算资源、网络资源的可扩展性,分析了流访存部件的参数设置,提出了优化单个超块执行时间的机制,研究了指令调度算法对程序性能的影响。通过在模拟器上做实验,我们发现TPA-PD在结构可扩展的同时,性能上超过集中控制计算资源的流处理器。
The development of traditional processor architecture is restricted by many problems arose in nanometer technology designs, such as power dissipation, wire delay, design complexity, etc. Tiled processor architecture is a potential solution to these challenges. Tile stream processor is an architecuter for compute-intensive application. Tile stream processor can utilize plentiful but cheap transistor resources introduced by Moore's Law and become scalable. Tiled design method organizes computation, storage and interconnects resources into basic tiled architectural units, which are relatively simple, distributed and reusable. A high-productivity processor can be composed of plenty of such tiled units, interconnected by highly efficient and scalable on-chip networks. The performance of tiled stream processor is determined by programming model, memory hierarchy, NoC (Networ-on-Chip) and computation model. Computation model, instruction set, architecture and mapping of stream programming model are studied in this dissertation. The major research contributions include:
     (1) Based on theory of dataflow-like computation model, single block multi data(SBMD) computation model is proposed which is used to organize comupatation resource of tiled stream processor. An instruction set which supports dataflow-like driven computation model is designed. SBMD indicates processing multi data in a super-block in which each data can be processed according to its data flow dependence. Control flow dependence of each data can be converted into data flow dependence by predicted execution which elimates transfer of control flow. Thus, each data can be processed in different control flow path. Explicit message passing is supported in SBMD model between loops in program.
     (2) A dataflow-like driven architecture for tiled stream processor called TPA-PD is designed. A data-flow driven computation model is employed for orginzing computational resource. To make use of data locality of applications, software-manage memory hierarchy is used. The tiled method is applied to TPA-PD in which several networks on chip are used to connect various resources.
     (3) Mapping of stream programming model is designed and implemented on TPA-PD. The StreamC/KernelC is a two-level programming language, which is stream programming model of TPA-PD. StreamC/KernelC is developed for Imagine processor. In order to run StreamC/KernelC program on TPA-PD, stream level translator and binary translator of kernel level are implemented. The infor of stream level instruction and binary microcode of kernel level on Imagine platform are translated by two level translator seperately. Code size of TPA-PD expands less than 2 times in average.
     (4) Experimental platform of TPA-PD is implemented. The effectiveness of dataflow-like computation model and architectural design is evaluated. The scalability of physical block resource, computational resource, network resource is discussed. The parameter setting of stream load/store unit is analyzed. The mechanism of optimizing execution time for single super-block is propsed. The schedule algorithm of instructions is studied to accelerate performance. Through experiment on simulator, it is found that not only architecture of TPA-PD is scalable, but also performance of TPA-PD exceeds stream processor in which computational resource is centralized controlled.

引文

Adams Duane Albert.1969. A computation model with data flow sequencing[D]. Standford:Stanford University
    Agarwal A, Amarasinghe S, Barua R, Frank M, Lee W, Sarkar V, Srikrishna D, Taylor M.1997. The Raw Compiler Project[R]. MIT Laboratory for Computer Science.
    Agarwal V, Hrishikesh MS, Keckler SW, Burger D.2000. Clock rate versus IPC:the end of the road for conventional microarchitectures[J]. SIGARCH Comput. Archit. News,28(2):248-259.
    Amdahl GM.1967. Validity of the single processor approach to achieving large scale computing capabilities[C]. ACM New York, NY, USA..
    Arvind C, David E.1986. Dataflow architectures[B]. Annual review of computer science, Annual Reviews Inc.,1:225-253.
    Arvind I, Robert A.1983. A critique of multiprocessing von Neumann style[C]. Proceedings of the 10th annual international symposium on Computer architecture, Stockholm, Sweden, ACM.
    Asanovi(?) K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Ke(?)tzer K, Patterson DA, Plishker WL, Shalf J, Williams SW.2006. The landscape of parallel computing research:A view from berkeley[R]. University of California at Berkeley.
    Barua R, Lee W, Amarasinghe S, Agarwal A.1998. Maps:A compiler-managed memory system for RAW machines[R]. Massachusetts Institute of Technology.
    Bodik KAR, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA.2006. The landscape of parallel computing research:A view from berkeley[R].
    Burger D, Keckler SW, McKinley KS, Dahlin M, John LK, Lin C, Moore CR, Burrill J, McDonald RG, Yoder R, Team T.2004. Scaling to the end of silicon with EDGE architectures [J]. Computer,37(7):44-55.
    Dennis JB, Misunas D P.1975. A preliminary architecture for a basic data-flow processor[C]. Proceedings of the 2nd annual symposium on Computer architecture, ACM.
    Dennis JB.1974. First version of a data flow procedure language[C]. Programming Symposium, Proceedings Colloque sur la Programmation, Springer-Verlag.
    Dennis JB.1979. Dataflow computer architecture[R]. Massachusetts institute of technology laborotory for computer science.
    Desikan R, Sethumadhavan S, Burger D, Keckler SW.2004. Scalable selective re-execution for EDGE architectures [C]. Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, Boston, MA, USA, ACM.
    Eatherton W.2005. The Push of Network Processing to the Top of the Pyramid[C]. Symposium on Architectures for Networking and Communications Systems.
    Erez M.2007. Merrimac high performance and highly-efficient scientific computing with streams[D]. Standford:Dep. Of Electrical Engineering, Stanford University
    Gratz P, Kim CK, McDonald R, Keckler SW, Burger D.2006. Implementation and evaluation of on-chip network architectures [C].24th International Conference on Computer Design, San Jose, CA.
    Gurd JR, Kirkham CC, Watson I.1985. The Manchester prototype dataflow computer[J]. Commun. ACM,28(1):34-52.
    He Yi, Ren Ju, Yang Qianming, et al.2009. Research on the design of a Kernel-level instruction set and the compiler-based techniques for configurable stream processors[C]. Computer engineering and science, 11(31):40-44(in Chinese)
    Hennessy J, Patterson D.2007. Computer architecture:A quantitative approach[J].
    Ho R, Mai K, Horowitz M.2003. Efficient on-chip global interconnects[R].
    Ho R, Mai KW, Horowitz MA.2001. The future of wires[J]. Proceedings of the IEEE, 89(4):490-504.
    Huh J, Burger D, Keckler SW.2001. Exploring the Design Space of Future CMPs[C]. Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society.
    Hofstee H.P. et al.2005. Power efficient processor architecture and the cell processor. High-Performance Computer Architecture [C]. Proceedings of the 11th IEEE Symposium on High-Performance Computer Architecture, IEEE Computer Society.
    Jacobson Q, Bennett S, Sharma N, Smith JE.1997. Control Flow Speculation in Multiscalar Processors [C]. Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, IEEE Computer Society.
    Kessler RE.1999. The alpha 21264 microprocessor [J]. IEEE Micro,19(2):24-36.
    Khailany B, Dally W, Rixner S, et al.2001. Imagine:Media processing with streams[J]. IEEE Micro,21(2):35-46
    Kim C, Sethumadhavan S, Govindan MS, Ranganathan N, Gulati D, Burger D, Keckler SW.2007. Composable Lightweight Processors[C]. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society.
    Kim C.2007. A technology-scalable composable architecture[D]. Austin:The University of Texas at Austin
    Seiler Larry et al.2008. Larrabee:A Many-Core x86 Architecture for Visual Computing. ACM Transactions on Graphics, Vol.27, No.3, Article 18
    Lee W, Rajeev B, Matthew F, Devabhaktuni S, Jonathan B, Vivek S, Saman A.1998. Space-time scheduling of instruction-level parallelism on a raw machine[C]. Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, San Jose, California, United States, ACM.
    Lindholm E et al.2008. NVIDIA Tesla:A Unified Graphics and Computing Architecture[J], IEEE Micro, vol.28 no.2, pp 39-55
    Mai K, Ho R, Alon E, Dean L, Kim Y, Dinesh P, Horowitz M.2004. Architecture and circuit t(?)chniques for a reconfigurable memory block[C]. IEEE International Solid-State Circuits Conference.
    Mattson P.2001. A programming system for the imagine media processor[D]. Stanford:Dept. of Electrical Engineering, Stanford University
    Mercaldi Martha, Swanson S, Petersen A, Putnam A, Schwerin A, Oskin M, Eggers SJ.2006. Instruction scheduling for a tiled dataflow architecture[C]. Proceedings of the 12th international conference on Architectural support for programming languages and operating systems, San Jose, California, USA, ACM.
    Nagarajan R, Kushwaha SK., Burger D, McKinley KS, Lin C, Keckler SW.2004. Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures [C]. Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society.
    Nagarajan R, Sankaralingam K, Burger D, Keckler SW.2001. A design space evaluation of grid processor architectures[C]. Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture, Austin, Texas, IEEE Computer Society.
    NVIDIA Coporation.CUDA[OL], http://developer.Nvidia.com/object/cuda.html
    NVIDIA Geforce 9800GTX product[OL], www.nvidia.com/object/geforce_9800gtx. html
    Petersen A, Putnam A, Mercaldi M, Schwerin A, Eggers S, Swanson S, Oskin M. 2006. Reducing control overhead in dataflow architectures[C]. Proceedings of the 15th international conference on Parallel architectures and compilation techniques, Seattle, Washington, USA, ACM.
    Ranganathan N, Nagarajan R, Burger D, Keckler SW.2002. Combining hyperblocks and exit prediction to increase front-end bandwidth and performance[R]. Citeseer.
    Rixner S, Dally WJ, Kapasi UJ, et al.1998. A bandwidth-efficient architecture for media processing[C]. Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture,3-13
    Robatmili B, Coons K, Burger D.2008a. Balancing local and global parallelism for single-thread applications in a composable multi-core system[J]. PESPMA,2.
    Robatmili B, Coons KE, Burger D, McKinley KS.2008b. Strategies for mapping dataflow blocks to distributed hardware[C]. Proceedings of the 2008 41st IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society.
    Roesner F, Burger D, Keckler SW.2008. Counting dependence predictors[C].35th Annual International Symposium on Computer Architecture, Beijing, PEOPLES R CHINA.
    Sankaralingam K, Nagarajan R, Liu H, Kim C, Huh J, Ranganathan N, Burger D, Keckler SW, McDonald RG, Moore C.2004. TRIPS:A polymorphous architecture for exploiting ILP, TLP, and DLP[J]. ACM Trans. Archit. Code Optim.,1(1): 62-93.
    Sankaralingam K, Nagarajan R, Liu HM, Kim C, Huh J, Burger D, Keckler SW, Moore C.2003. Exploiting ILP, TLP, and DLP with the polymorphous trips architecture[J]. IEEE Micro,23(6):46-51.
    Schubert K-D.2009. POWER7-Verification challenge of a multi-core processor. ICCAD, San Jose, California.
    Smith A, Gibson J, Maher B, Nethercote N, Yoder Bl, Burger D, McKinle KS, Burrill J.2006. Compiling for EDGE Architectures[C]. Proceedings of the International Symposium on Code Generation and Optimization, IEEE Computer Society.
    Smith AJ.1982. Cache memories[J]. ACM Computing Surveys (CSUR) 14(3): 473-530.
    Smith J.1996. Multiscalar as a new architecture paradigm[J]. ACM Comput. Surv., 28(4es):34.
    SPI Storm stream processor SP16-G160 product[OL]. http://www.streamprocessors. com
    Swanson S, Michelson K, Schwerin A, Oskin M.2003. WaveScalar[C]. Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society.
    Vangal, S.R.et al.2008. An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS[J], Solid-State Circuits, IEEE Journal of, vol.43, no.1, pp.29-41
    Taylor M B, Lee W, Amarasinghe S, Agarwal A.2003b. Scalar operand Networks: on-chip interconnect for ILP in partitioned architectures[C]. Proceedings of the 9th International Symposium on High-Performance Computer Architecture, IEEE Computer Society.
    Taylor MB, Kim J, Miller J, Wentzlaff D, Ghodrat F, Greenwald B, Hoffman H, Johnson P, Lee W, Saraf A.2003a. A 16-issue multiple-program-counter microprocessor with point-to-point scalar operand network[C]. IEEE International Solid-State Circuits Conference.
    Taylor MB, Lee W.2005. Scalar operand networks[J]. IEEE Trans. Parallel Distrib. Syst.,16(2):145-162.
    The Inteernal Technology Roadmap for Semiconductor website[OL]. http://www. itrs.net/
    Tilera.2007. The Tile ProcessorTM architecture:Embedded multicore for networking and digital multimedia[C]. Hotchips.
    Tilera.2008. TILE64TM processor product Brief[R].
    Veen AH.1986. Dataflow machine architecture [J]. ACM Comput. Surv.,18(4): 365-396.
    Wah B.2008. Dataflow computers:Their history and future[B]. Wiley Encyclopedia of Computer Science and Engineering.
    Waingold E, Taylor M, Sarkar V, Lee V, Lee W, Kim J, Frank M, Finch P, Devabhaktumi S, Barua R, Babb J, Amarsinghe S, Agarwal A.1997. Baring it all to Software:The raw machine[R]. Massachusetts Institute of Technology.
    Wen Mei, Nan Wu, Haiyan Li, et al.2004. Multiple-dimension scalable adaptive stream architecture[C]. Proceedings of the Asia-Pacific Computer Systems Architecture Conference, Springer,199-211.
    Wen Mei, Wu Nan, Zhang Chunyuan, et al.2008. On-chip memory system optimization design for the FT64 scientific stream accelerator[J]. IEEE Micro, 28(4):51-70
    Wu Nan, et al.2008. Multiple macro-tile stream architecture[C]. Proceedings of the SHCMP08 workshop in conjunction with 35th ISCA,17-25
    Yang Xuejun, Wang Li, Xue Jingling, et al.2009. Comparability graph coloring for optimizing utilization of stream register files in stream processors [C]. Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming,111-120
    Yang Xuejun, Yan Xiaobo, et al.2007. A 64-bit stream processor architecture for scientific applications[C]. Proceedings of the 34st Annual International Symposium on Computer Architecture, ACM,210-219
    从明.2009.类数据流驱动的分片式处理器体系结构[D].博士.中国科学技术大学
    王莉.2009.类数据流驱动的分片式处理器上的编译及优化技术[D].博士.中国科学技术大学

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700