用户名: 密码: 验证码:
基于综合优先度和主机信息的暴雨灾害主题退火爬虫算法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Focused Annealing Crawler Algorithm for Rainstorm Disasters Based on Comprehensive Priority and Host Information
  • 作者:刘景发 ; 李帆 ; 蒋盛益
  • 英文作者:LIU Jing-fa;LI Fan;JIANG Sheng-yi;School of Computer &Software,Nanjing University of Information Science & Technology;School of Information Science and Technology,Guangdong University of Foreign Studies;
  • 关键词:暴雨灾害 ; 网络主题爬虫 ; 综合优先度 ; 主机信息 ; 模拟退火算法
  • 英文关键词:Rainstorm disasters;;Web focused crawler;;Comprehensive priority;;Host information;;Simulated annealing algorithm
  • 中文刊名:JSJA
  • 英文刊名:Computer Science
  • 机构:南京信息工程大学计算机与软件学院;广东外语外贸大学信息科学与技术学院;
  • 出版日期:2019-02-15
  • 出版单位:计算机科学
  • 年:2019
  • 期:v.46
  • 基金:国家社会科学基金重大招标项目(16ZDA047);; 国家自然科学基金项目(61373016);; 江苏省自然科学基金项目(BK20181409,BK20171458)资助
  • 语种:中文;
  • 页:JSJA201902037
  • 页数:8
  • CN:02
  • ISSN:50-1075/TP
  • 分类号:224-231
摘要
如今,互联网集成的与暴雨灾害相关的信息多种多样,然而人工搜索网页信息的效率不高,因此网络主题爬虫显得十分重要。在通用网络爬虫的基础上,为提高主题相关度的计算精度并预防主题漂移,通过对链接锚文本主题相关度、链接所在网页的主题相关度、链接指向网页PR值和该网页主题相关度的综合计算,提出了基于网页内容和链接结构相结合的超链接综合优先度评估方法。同时,针对搜索过程易陷入局部最优的不足,首次设计了结合爬虫记忆历史主机信息和模拟退火的网络主题爬虫算法。以暴雨灾害为主题进行爬虫实验的结果表明,在爬取相同网页数的情况下,相比于广度优先搜索策略(Breadth First Search,BFS)和最佳优先搜索策略(Optimal Priority Search,OPS),所提出的算法能抓取到更多与主题相关的网页,爬虫算法的准确率得到明显提升。
        Nowadays,Internet integrates a lot of information related to rainstorm disasters.However,the efficiency of manual search is low,so the web focused crawler becomes very important.On the basis of the general web crawler,in order to improve the computational precision of topic relevance for webpages and prevent the topic drift,this paper proposed a comprehensive priority evaluation method based on webpage content and link structure for the hyperlink.The method consists of a combined effect of four parts,including the topic relevance of the anchor text,the topic relevance of all webpages that contain links,the PRvalue and the topic relevance of the webpage which the link points to.At the same time,to avoid the search falling into local optimum,a new focused crawler algorithm combining the memory historical host information and the simulated annealing algorithm(SA)was designed for the first time.The experimental results of the focused crawler about rainstorm disaster show that the proposed algorithm outperforms the breadth first search(BFS)strategy and the optimal priority search(OPS)strategy,and the crawling accuracy rate is significantly improved.
引文
[1] KHAN M A,SHARMA D K.Self-adaptive ontology-based focused crawling:A literature survey[C]∥5th International Conference on Reliability,INFOCOM Technologies and Optimization.IEEE,2016:595-601.
    [2] DONG H,HUSSAIN F K,CHANG E.A transport service ontology-based focused crawler[C]∥Proceedings of 4th International Conference on Semantics,Knowledge and Grid.Washington:IEEE,2008:49-56.
    [3] DU Y J,LI C X,HU Q,et al.Ranking webpages using apath trust knowledge graph[J].Neurocomputing,2017,269:58-72.
    [4] GUAN W G,LUO Y C.Design and implementation of focused crawler based on concept context graph[J].Computer Engineering and Design,2016,37(10),2679-2684.(in Chinese)关卫国,骆永成.基于概念背景图的主题爬虫设计与实现[J].计算机工程与设计,2016,37(10):2679-2684.
    [5] LIU W J,DU Y J.A novel focused crawler based on cell-like membrane computing optimization algorithm[J].Neurocomputing,2014,123:266-280.
    [6] VIDAL M L A,SILVA A S D,DE MOURA E S,et al.Structure-driven crawler generation by example[C]∥International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2006:292-299.
    [7] LI L,ZHANG G Y,LI Z W.Research on focused crawling technology based on SVM[J].Computer Science,2015,42(2):118-122.(in Chinese)李璐,张国印,李正文.基于SVM的主题爬虫技术研究[J].计算机科学,2015,42(2):118-122.
    [8] RAWAT S,PATIL D R.Efficient focused crawling based on best first search[C]∥Advance Computing Conference.IEEE,2013:908-911.
    [9] JING W P,WANG Y J,DONG W W.Research on adaptive genetic algorithm in application of focused crawler search strategy[J].Computer Science,2016,43(8):254-257.(in Chinese)荆文鹏,王育坚,董伟伟.自适应遗传算法在主题爬虫搜索策略中的应用研究[J].计算机科学,2016,43(8):254-257.
    [10]ZHENG S.Genetic and ant algorithms based focused crawler design[C]∥International Conference on Innovations in Bio-Inspired Computing&Applications.IEEE,2011:374-378.
    [11]YANG R G,SONG Y,MENG X Z.Multimedia topic search algorithm based on improved Shark-Search[J].Computer Engineering and Applications,2010,46(14):152-154.(in Chinese)杨仁广,宋宇,孟祥增.一种改进Shark-Search的多媒体主题搜索算法[J].计算机工程与应用,2010,46(14):152-154.
    [12]PRAKASH J,KUMAR R.Web Crawling through shark-search using PageRank[J].Procedia Computer Science,2015,48:210-216.
    [13]CHENG Y,LIAO W,CHENG G.Strategy of focused crawler with word embedding clustering weighted in shark-search algorithm[J].Computer&Digital Engineering,2018,46(1),144-148.
    [14]CHEN C,ZHAN Y W,LI Y.PageRank parallel algorithm based on Web link classification[J].Journal of Computer Applications,2015,35(1):48-52.(in Chinese)陈诚,战荫伟,李鹰.基于网页链接分类的PageRank并行算法[J].计算机应用,2015,35(1):48-52.
    [15]HU P R,LI S J.Focused crawler based on URL patterns[J].Application Research of Computers,2018,35(3):694-699.(in Chinese)胡萍瑞,李石君.基于URL模式集的主题爬虫[J].计算机应用研究,2018,35(3):694-699.
    [16]PATEL A,SCHMIDT N.Application of structured document parsing to focused web crawling[J].Computer Standards&Interfaces,2011,33(3):325-331.
    [17]LIU J F,LI G,CHEN D B,et al.Two-dimensional equilibrium constraint layout using simulated annealing[J].Computers&Industrial Engineering,2010,59(4):530-536.
    [18]LIU J F,ZHANG Z,XUE Y,et al.Heuristic simulated annealing algorithm for orthogonal rectangle packing problem with static non-equilibrium constraints[J].Pattern Recognition and Artificial Intelligence,2015,28(7):626-632.(in Chinese)刘景发,张振,薛羽,等.带静不平衡约束的正交矩形布局问题的启发式模拟退火算法[J].模式识别与人工智能,2015,28(7):626-632.
    [19]JIANG Q,ZHANG Y.SiteRank-Based crawling ordering strategy for search engines[C]∥IEEE International Conference on Computer and Information Technology.IEEE,2007:259-263.
    [20]WU J,ABERER K.Using siteRank for P2Pweb retrieval[OL].http://www.docin.com/p-833478187.html.
    [21]DERRAC J,GARCíA S,MOLINA D,et al.A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms[J].Swarm&Evolutionary Computation,2011,1(1):3-18.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700