基于词汇链和PageRank的多文档自动文摘研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于词汇链和PageRank的多文档自动文摘研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Lexical Chains and PageRank Based Multi-document Summarization
作者：肖欣延
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：多文档自动文摘 ; 词汇链 ; PageRank
英文关键词：Multi-document summarization ; Lexical chains ; PageRank
学位年度：2008
导师：张东站
学科代码：081202
学位授予单位：厦门大学
论文提交日期：2008-05-01

摘要

随着网络信息的剧增,网络上的信息重复性很大。同一主题的相关文档通常是成千上万,它们的内容相似,但又有所不同,各有侧重点。人们迫切需要一种能够以简洁连贯的语言提供同一主题的多文档集合中全面而重要的信息的工具,多文档自动文摘技术顺应这种需求而产生。多文档自动文摘可以将多篇同一主题的文档进行汇总,提供给人们简洁、全面的信息,将人们从繁琐、冗余的信息中解脱出来。多文档自动文摘是信息时代发展到一定程度的必然需要,由于多文档自动文摘有良好的理论研究价值和应用前景,它已经成为文本处理领域的研究热点之一。
     本文首先对自动文摘的分类及发展历程进行简述,然后分别介绍单文档自动文摘及多文档自动文摘的相关技术。在此基础上,讨论了自动文摘的发展方向。
     其次详细介绍词汇链的概念、传统的构造算法。同时,在分析传统构造算法优缺点的基础上,提出了一种新的两阶段词汇链构造算法,实验表明,此算法提高了准确率并保证了较好的效率。
     然后介绍基于图的排序的方法及PageRank算法,讨论了如何将基于图的排序方法应用到文本处理中,进而介绍基于PageRank的句子抽取。
     最后,详细介绍基于词汇链及PageRank的多文档自动文摘系统。该系统采用词汇链表示多文档集合的子主题结构,对子主题排序,然后基于PageRank算法在各个子主题中选取句子生成文摘。这种方法能够保证文摘对多文档集合的各个重要子主题有较好的反映,而文摘本身冗余度较低。实验表明,这种综合的方法所生成的文摘质量较高。
With the explosion of information in internet, the redundancy of information has become a serious problem. There are often thousands of documents related to one topic. Most of their contents are quite similar, but also different in their emphasis. It is necessary to develop a tool that can provide the general and most important information in multi-document with brief and coherent language, thus, multi-document summarization is put forward for settling this problem. Multi-document summarization can generate a brief, fluent summary for multi-document, and finally release people from trivial and redundant information. Multi-document summarization is the inevitability of the developing of information age. With good theory value and prospect of application, multi-document summarization has become research spot in the domain of text processing.
     In this paper, we first briefly introduce the classes and development history of automatic summarization, and describe the traditional method of single document summarization and multi-document summarization. Then we discuss the future direction of research and development of automatic summarization.
     Secondly, we describe the concepts and traditional construction algorithms of lexical chains in detail. By analyzing advantages and disadvantages of traditional methods, we propose a new method called two phases lexical chains construct algorithm. Experimental results show that this method can improve the accuracy and has good efficiency.
     Then, we introduce graph-based ranking method and PageRank algorithm, and discuss the key problems in using graph-based ranking in text processing. Furthermore, we propose a PageRank-based sentence extraction method.
     Finally, we introduce lexical chains and PageRank based multi-document summarization system. This system used lexical chains to analyze the subtopic structure of the documents, sort the subtopics, and use PageRank algorithm to extract sentences in each subtopics to produce final Summary. Summary generated by this method can reflect all the important subtopics well, and also has a lower redundancy. Experimental results show that this integrated approach generate summary with high quality.

引文

[1]H.P.Luhn.The Automatic Creation of Literature Abstracts[J].IBM Journal of Research and Development,1958,2(2):159-165.
    [2]K.S.Jones,E.N Brigitte.Introduction:Automatic Summarizing[J].Information Processing &Management,1995,31(5):625-630.
    [3]秦兵,刘挺,李生.多文档自动文摘综述[J].中文信息学报,2005,Vol.19(06):13-20.
    [4]http://duc.nist.gov/[Z].
    [5]Jade Goldstein,Mark Kantrowitz,Vibhu Mittal,Jaime Carbonell.Summarizing Text Documents:Sentence Selection and Evaluation Metrics[C].In Proceedings of SIGIR-99,Berkeley,CA,1999:121-128.
    [6]Dragomir R.Radev,Hongyan Jing,Malgorzata Budzikowska.Centroid-based Summarization of Multiple Documents:Sentence Extraction,Utility-based Evaluation,and User Studies[C].In ANLP/NAACL 2000 Workshop,Seattle,Washington,USA,April 2000,21-29.
    [7]Endre Boros,Paul B.Kantor,David J.Neu.A Clustering Based Approach to Creating Multi-Document Summaries[C].In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,New Orleans,LA,2001.
    [8]刘挺,王开铸.基于篇章多级依存结构的自动文摘研究[J].计算机研究与发展,1999,32(4):479-488.
    [9]钟义信.自然语言理解的全信息方法论[J].北京邮电大学学报,2004 27(4):1-12.
    [10]姚天顺等.自然语言理解—一种让机器懂得人类语言的研究[M].清华大学出版社,厂西科学技术出版社,1995.
    [11]http://ir.hit.edu.cn/[Z].
    [12]刘德喜.基于基本要素的多文档自动文摘研究[D].武汉大学,博士学位论文,2007.
    [13]刘挺,王开铸.自动文摘的四种主要方法[J].情报学报,1999,18(1):10-20.
    [14]哈罗德·博科,查尔斯·L·贝尼埃合著,赖茂生,王知津合译[M].文摘的概念与方法.书目文献出版社,1991.
    [15]B.A.Mathis,J.E.Rush.Abstracting[J].Encyclopedia of Computer and Tehcnology,Vol.1,New York:Marcel Dekker Inc.,1975,102-142.
    [16]L.B.Doyle.Indexing and Abstracting by Association[J].American Documentation,October,1962,379-390.
    [17]R.Brandow,K.Mitze,L.F.Rau.Automatic Condensation of Electronic Publications by Sentence Selection[J].Information Processing & Management,1995,31(5):675-685.
    [18]H.P.Edmundson,R.E.Wyllys.Automatic Abstracting and Indexing—Survey and Recommendations[J].Communications of the ACM,1961,4(5):226-234.
    [19]H.P.Edmundson.Automatic Abstracting[J],TRW Computer Division.Thompson Ram Wooldridge,Inc.,Canoga Park,California:1963,AD 406 155.
    [20]H.P.Edmundson.Problems of automatic abstracting[J].Communications of ACM,1964,7(4):259-263.
    [21]H.P.Edmundson.New methods in automatic abstracting[J].Journal of the Association for Computing Machinery,1969,16(2):264-285.
    [22]J.J.Pollock,A.Zamora.Automatic Abstracting Research at Chemical Abstracts Service[J].Journal of Chemical Information and Computer Sciences,1975,15(4):226-232.
    [23]刘开瑛,郭炳炎[M].自然语言处理.科学出版社,1991.
    [24]马希文,李小滨,徐越.自然语言处理与自动文摘[J].智能技术与系统基础,1988,99-117.
    [25]L.F.Rau,P.S.Jacobs,Uri Zernik.Information Extracting and Text Summarization Using Linguistic Knowledge Acquisition[J].Information Processing & Management,1989,25(4):419-428.
    [26]P.S.Jacobs,L.F.Rau.Scisor:Extracting Information from Online News[J].Communication of the ACM,1990,33(11):88-97.
    [27]D.Fum,G.Guida,C.Tasso.Forward and Backward Reasoning in Automatic Abstracting[J].COLING 82,1982,83-88.
    [28]Seiji Miike,Etsuo Itoh,Kenji Ono,Kazuo Sumita.A Full Text Retrieval System with a Dynamic Abstract Generation Function[J].SIGIR Forum,1994,152-161.
    [29]K.Ono,K.Sumita,S.Miike.Abstract Generation Based on Rhetorical Structure Extraction [J].COLING 94,Kyoto,August,1994,344-348.
    [30]T.Maeda.An Approach toward Functional Text Structure Analysis of Scientific and Technical Documents[J].Information Processing & Management,1981,17(6):329-339.
    [31]J.G.Carbonell,J.Goldstein.The Use of MMR:Diversity-based Reranking for Reordering Documents and Producing Summaries[C].In Proceedings of SIGIR-98,Melbourne,Australia,Aug.1998.
    [32]王厚峰.指代消解的基本方法和实现技术[J].中文信息学报,Vol.16(6).
    [33] Kevin Knight. Summarization beyond sentence extraction: A probabilistic approach to sentence compression[J].Artificial Intelligence, 2002,91-107.
    [34] Vincent Vandeghinste, Yi Pan. Sentence Compression for Automated Subtitling: A Hybrid Approach [C]. In Proceedings of the 42nd Meeting of the Association of Computational Linguistics (ACL),2004.
    [35] Jenine Turner, Eugene Charniak. Supervised and Unsupervised Learning for Sentence Compression[C]. In Proceedings of the 43rd Meeting of the Association of Computational Linguistics(ACL),2005.
    [36] Yuya Unno, Takashi Ninomiya, Yusuke Miyao, Jun'ichi Tsujii. Trimming CFG Parse Trees for Sentence Compression Using Machine Learning Approaches[C], In Proceedings of the 44th Meeting of the Association of Computational Linguistics(ACL),2006.
    [37] Nitin Madnani, David Zajic, Bonnie Dorr. Multiple Alternative Sentence Compressions for Automatic Text Summarization[C].In Proceedings of the Document Understanding Conference (DUC07),2007.
    [38] R.Barzilay,E. Elhadad,K. McKeown. Inferring Strategies for Sentence Ordering in Multidocument Summarization[J]. Journal of Artificial Intelligence Research(JAIR),2002, 17:35-55.
    [39] N. Okazaki,Y. Matsuo,M. Ishizuka. Coherent Arrangement of Sentences Extracted from Multiple Newspaper Articles[C]. In Proceedings of PRICAI 2004,2004:882-891.
    [40] M. Lapata. Probabilistic Text Structuring: Experiments with Sentence Ordering[C]. In Proceedings of the 41st Meeting of the Association of Computational Linguistics,2003:545-552.
    [41] CY. Lin, E. Hovy. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics [C]. In Proceeding of the Human Language Technology Conference,Edmonton. 2003:71-78.
    [42] Ani Nenkova,Rebecca Passonneau. Evaluating Content Selection in Summarization:the Pyramid Method[C]. In Proceeding of HLT/NAACL 2004,2004.
    [43] J Morris, G Hirst. Lexical Cohesion Computed by Thesaural relations as an Indicator of the Structure of Text[J]. Computational Linguistics , 1991, 17 (1): 21-48.
    [44] Regina Barzilay, Michael Elhadad. Using Lexical Chains for Text Summarization[C] . In Proceedings of the Intelligent Scalable Text Summarization Workshop (ISTS'97), Madrid ,1997.
    [45] Lei Yu,Jia Ma,Fuji Ren,Shingo Kuroiwa. Automatic Text Summarization Based on Lexical Chains and Structural Features[C]. In Proceedings of 8th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007), 2007.
    [46] Ercan G , Cicekli Y. Using lexical chains for keyword extraction[J]. Information Processing & Management, NOV 2007,43 (6): 1705-1714.
    [47] http://wordnet.princeton.edu/[Z].
    [48] G Hirst,D Stonge. Lexical Chains Representations of Context for the Detection and Correction of Malapropisms[M]. In C. Fellbaum, editor, WordNet:An electronic lexical database, 305-332. MIT Press,1998.
    [49] S. Brin,L. Page.The anatomy of a large-scale hypertexrual Web search engine[J]. Computer Networks and ISDN Systems, 30(1-7).
    [50] J.M. Kleinberg. Authoritative sources in a hyperlinked environment[J]. Journal of the ACM, 46(5):604-632.
    [51] Rada Mihalcea, Paul Tarau. TextRank: Bringing Order into Texts[C]. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), Barcelona, Spain,2004.
    [52] Rada Mihalcea. Graph-based ranking algorithms for sentence extraction, applied to text summarization[C]. In Proceedings of the 42nd Meeting of the Association of Computational Linguistics (ACL),2004.
    [53] Gunes Erkan , Dragomir R. Radev. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization[J]. Journal of Artificial Intelligence Research (JAIR), 2004.
    [54] http://hedong.3322.org/archives/000199.hhnl[Z].
    [55] http://sourceforge.net/projects/jwordnet[Z].
    [56] Gunes Erkan , Dragomir R. Radev. The University of Michigan at DUC 2004[C]. In Proceedings of the Document Understanding Conference (DUC04), Boston, USA ,2004.
    [57] Chali, Y., Kolla, M.: University of lethridge summarizer at DUC04[A]. In Proceedings of the Document Understanding Conference (DUC04), Boston, USA ,2004.
    [58]Doran, W., Stokes, N., Newman, E., Dunnion, J., Carthy, J., Toolan, F.: News story gisting at university college Dublin[C]. In Proceedings of the Document Understanding Conference (DUC04), Boston, USA ,2004.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700