用户名: 密码: 验证码:
基于时间链接分析的页面排序优化算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
Web结构挖掘是从WWW的组织结构、Web文档结构和链接关系中推导知识的过程。目前基于网络的组织结构和链接关系进行挖掘的算法主要有两种,其中有代表性的算法主要是Larry Page等人设计的PageRank算法。通过分析和研究PageRank、HITS,TimedPagrank等基于链接结构的搜索结果排名算法,发现传统的页面排序算法偏重于旧网页,使得一些旧的页面经常出现在检索结果的前面。本文引入时间链接分析,使用爬虫抓起页面时HTTP协议反馈回来的修改时间作为页面和链接的时间,并综合考虑页面的出入链接个数和时间来计算页面的权重值。所开发出的WTPR算法,能使新网页集在排序中上升,同时,高质量的旧网页比普通的旧网页能获得较高的排序值。
     本文研究页面排序算法,主要做了如下工作:
     首先介绍Web结构数据挖掘,对Web链接分析的工作原理及相关定义进行了详细的介绍,分析了Web链接分析的研究现状和主要工作,为下面章节介绍基于超链接分析的页面排序算法打下了坚实的基础。
     其次针对PageRank算法存在的这些缺陷,引入时间链接分析,通过分析爬虫Websphnix抓起页面时HITP协议反馈回来的网页最后修改时间来表示网页年龄,并在此基础上对网络的组织结构和链接质量以及时间序列进行挖掘。根据网页年龄,提出了基于网页年龄衰减的页面排序算法Age-WPR,并进行了实验验证。
     继而针对静态网页年龄不能满足当前网络的动态变化,以及页面时间的不确定性,本文提出了兴趣区间的概念,并具体定义了动态网络环境下节点和链接的时间戳,并在此基础上提出了新鲜度概念,以区分新旧页面,然后结合网页节点质量因子开发出WTPR算法,从而克服现有Web超链接分析中的不足。
     最后概要介绍了java版页面排序模块的实现过程,并给出了PageRank计算的大体思路,最终根据改进的页面排序算法对网页快照数据进行实验验证,通过本程序验证,确定了WTPR算法中的权重因子,并且这些优化策略改进了搜索引擎的页面排序结果,同时保证了新旧两种页面的排序优化。
Web structure mining tries to discover the knowledge from the link structure of WWW of the hyperlinks,Web document and interlinkage at the inter-document level.At present there are mainly two mining algorithms based on the web's link structure and interlinkage,and the typical one of which is PageRank designed by Larry Page.etc.After analyzing these algorithms of ranking search results like PageRank,HITS and TimedPagrank which are based on the link structure,one defect found is that the traditional rank algorithms favour the old pages which made some old pages listed on the top of the searching results.So the Temporal-Link-Analyse technology is introduced into this paper,using the last-modified timestamp responsed by the HTTP protocol when spider crawling the web as the timestamp of the pages and links.The new improved algorithm WTPR can make the new pages ascend its rank in the result,while the old pages with high quality get higer rank values than common old pages.
     The main contributions of the paper are following:
     First the Web structure mining is briefly introduced,and the principle and the related definition of the Web Link-analyse are presented in detail.Then the Web link-analyse's state of the art and its major contribution is studied,which makes a good foundation to introduce the pageranking algorithm based on the link-analyse.
     Secondly for the limitation of PageRank,the Temporal-Link-Analyse technology is introduced in order to improve it.The last modified time of the pages, which is responsed by the HTTP protocol when the spider Websphnix is crawling the web,is used as the age of the pages on the web,and on this basis to mine the web link structure,link quality and the time serial.Based on the web age the pageranking algorithm Age-WPR is designed,also the experiment is verified.
     Furthermore for the fact that the static page age can not meet the dynamic changes of the web and the web pages' uncertainty,the paper proposed the definition of interest interval,and give a detailed definition of the nodes and links' timestamp in the dynamic web environment.Based on this the novelty value is proposed in order to differentiate the old and new pages.Then combined with the link quality the algorithm WTPR is designed,which overcomes the deficiency of the link-analyse at present.
     Lastly the design and implementation of the pageranking system in java is introduced,and the general computation step is given accordingly.Finally the improved Pageranking algorithm is verified through testing the web pags snapshot. And the weight factors of the WTPR are therefore determined.The experiment improved that the optimizing strategy used by the WTPR algorithm developed can make the old pages decline and new pages rose in the ranking result,while the old pages of high-quality get higher rank value than common old pages.
引文
[1]D.Fetterly,M.Manasse,M.Najork,and J.Wiener.A large-scale study of the evolution of web pages[C].In WW '03:Proceedings of the 12th international conference on World Wide Web,pages 669-678,New York,NY,USA,2003.ACM Press.
    [2]Adamic L.A,Huberman B A.The Web's Hidden Order Communication of the ACM,2001,44(9):55-59.
    [3]A.Ntoulas,J.Cho,and C.Olston.What's new on the web?:the evolution of the web from a search engine perspective[C].In WWW '04:Proceedings of the 13th international conference on World Wide Web,pages 1-12,New York,NY,USA,2004.ACM Press.
    [4]Pitkow J E.Characterizing World Wide Web Ecologies[PhD Thesis].Georgia Institute of Technology,1997-06.
    [5]Weise R,Veles B.HyPursuit:A Hierarchical Network SearchEnginethat Exploits Content-link Hypertext Clustering.In Proceedingsof the7th ACM Conference on Hypertext,1996-03.
    [6]Spertus E.Parasite:Mining Structural Information on the Web.In:Proc of the Sixth International World Wide Web Conference,1997-04.
    [7]Kleinberg J M.Authoritative Sources in a Hyperlinked Environment.Proc.9th ACM Press,New York and Siam Press,Philadelphia,1998:668-677.
    [8]Brin.S and L.Page.The Anatomy of a Large-Scale Hypertextual Web Search Engine.In Proceedings of the 7th International World Wide Web Conference,pages 107-117,1998.
    [9]Henzinger M R.Hyperlink Analysis for the Web.IEEE Internet Computing,2001,5(1):5-50.
    [10]Arasu A,et all Searching the Web1ACM Transactions on Internet Technology,2000,1(1):2-43.
    [11]蒋晓冬,金宇晖,谈征。 网上高质量智能信息检索系统的实现。 计算机工程与科学,1999,21(4):49-53.
    [12]Meghabghab G.Google's Web Page Ranking Applied to Different Topological Web Graph Structures.Journal of the American Society for Information Science and Technology(JASIS),2001,52(9):736-747.
    [13]Meghabghab G.Discovering Authorities and Hubs in Different Topological Web Graph Structures.Information Processing and Management,2002,38(1):111-140.
    [14]Wang Y T,Kitsuregawa M1Link Based Clustering of Web Search Results In:Web Age Information and Management(WAIM' 2001)Berlin:Spinger Verlage Berlin Heidelberg, 2001.225-236.
    [15]Dean J,Henzinger M R1Finding Related Pages in the World Wide WeblComputer Networks,1999,31:1467-1479.
    [16]Bharat K,et al1A Comparison of Techniques to Find Mirrored Hosts on the WWW1JASIS,2001,51(12):1114-1122.
    [17]黄奇,李伟。基于链接的学术性WWW网络资源评价与分类方法。情报学报,2001,20(2):186-192.
    [18]Ingwersen P1The Calculation of Web Impact Factors Journal of Documentation,1998,54(2):236-243.
    [19]Thelwall M.Web Impact Factors and Search Coverage.Journal of Documentation,2000,56(2):185-189.
    [20]Feldman and Dagan.Knowledge discovery in textual databases(kdt).In Proceedings of the First International Conference on Knowledge Discovery and Data mining(KDD-95),pages 112-117,Montreal Canada,1995.
    [21]杨炳儒,李岩,陈新中等.Web结构挖掘[J].计算机工程,2003,29(20):28-30.
    [22]沙勇忠,牛春华.中国信息化优秀企业网站链接分析与网络影响因子测度.兰州大学学报,2004(5):99-108.
    [23]A.Ntoulas,J.Cho,and C.Olston.What's new on the web?:the evolution of the web from a search engine perspective[C].In WWW '04:Proceedings of the 13th international conference on World Wide Web,pages 1-12,New York,NY,USA,2004.ACM Press.
    [24]Klaus Berberich,Michalis Vazirgiannisl,and Gerhard Weikum.T-Rank:Time-Aware Authority Ranking[C].WAW 2004,LNCS 3243,pp.131-142,2004.
    [25]Philip S.Yu,Xin Li,Bing Liu.On the Temporal Dimension of Search[C],WWW 2004,May 17-22,2004,New York,,USA.pp.448-449.
    [26]The Internet Archive.http://www.archive.org/.
    [27]Philip S.Yu,Xin Li,Bing Liu.Adding the Temporal Dimension to Search - A Case Study in Publication Search[C],WI'05,USA,2005.
    [28]Wenpu Xing,Ali Ghorbani.Weighted PageRank Algorithm[C],Proceedings of the Second Annual Conference on Communication Networks and Services R -esearch,IEEE,2004.
    [29]Alberto O.Mendelzon,Davood Rafiei.What do the neighbors think? Computing web page reputations.IEEE Data Engineering Bulletin,Page 9-16,September 2000.
    [30]Jullen G,Stefan MR.Link-based Approaches for Text Retrieval[J].Proceedings of TREC-10,2001:13-16.
    [31]Loetleydes dorff,Michael Curran.Mapping University-Industry-Government Relations on the Internet:the Construction of Indicators for a Knowledge-Based Economy.Cybermetrics:International Journal of Sciento-metrics,Informetrics and Bibliometrics,2000(4).
    [32]Ziv Bar-Yossef,Andrei Z.Broder.Sic Transit Gloria Telae:Towards an Understanding of the Web's Decay.[C]WWW 2004,May17-22,NewYork,USA.
    [33]Einat Amitay,David Carmel.Temporal link analysis[M],IBM Research Lab.2004.
    [34]Masashi Toyoda,Masaru Kitsuregawa.What's Really New On The Web?:Identifying New Pages from a Series of Unstable Web Snapshots[C].IW3C2,Scotland,2006.
    [35]K.Randall,R.Stata,R.Wickremesinghe,and J.Wiener.The Link Database:Fast Access to Graphs of the Web.SRC Research Report 175,Compaq Systems Research Center,2001.
    [36]Yue Liu,Bin Wang,Guo-Jie Li.Adding The Webpage Quality Factors into The Hyperlinks[C],Proceedings of the First International Conference on Machine Learing and Cybernetics,BeiJing,IEEE,2002.
    [37]W3C,Hypertext Transfer Protocol -- HTTP/1.1,Section14:Header Field Definitions.[S].
    [38]Egghe L.(2001).A Noninformetric Analysis of the Relationship between Citation Age and Journal Productivity[J].Journal of the American Society for Information Science and Technology(JASIST),52(5):371-377.
    [39]张岭,马范援.加速评估算法:一种提高Web结构挖掘质量的新方法[J].计算机研究与发展,2004,41(1):98-103.
    [40]黄德才,戚华春。PageRank算法研究[J]。计算机工程,2006,32(4):145-147.
    [41]K.Bharat,A.Broder,M.Henzinger,P.Kumar,and S.Venkatasubramanian.The Connectivity Server:Fast Access to Linkage Information on the Web.In Proceedings of the 7th International World Wide Web Conference,pages 14-18,1998.
    [42]M.Najork and J.L.Wiener.Breadth-First Crawling Yields High-Quality Pages[C].In Proceedings of the 10th International World Wide Web Conference,pages 114-118,2001.
    [43]J.Cho and S.Roy,"Impact of Web Search Engines on Page Popularity"[C],WWW,New York,USA,2004.pp.20-29.
    [44]王继成、潘金贵、张福炎:Web文本挖掘技术研究:计算机研究与发展;第37卷,513-520
    [45]Lads A.Aarnic.Network dynamics:the World Wide Web.The department of apple physics of university.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700